18 IPCW Classification
The inverse probability of censoring weights (IPCW) based reduction transforms a survival task to a weighted classification task (Vock et al. 2016). The IPCW reduction is applicable for single-event right-censored data when the task of interest is to predict the survival probability at a single time point \(\tau\) (sometimes referred to as \(\tau\)-year prediction in survival analysis). This method is therefore particularly useful for predicting a single meaningful time point when the entire survival distribution prediction is not required. The reduction also assumes censoring is independent of the event time conditional on covariates.
Consider a right-censored dataset \(\mathcal{D} = \{(\mathbf{x}_i, t_i, \delta_i)\}_{i=1}^n\) with \(\mathbf{x}_i \in \mathbb{R}^p\) as introduced in Chapter 3. The probability of an event occurring by time \(\tau\) is given by the cumulative distribution function: \[ P(Y \leq \tau\mid\mathbf{x}_i) = F(\tau\mid\mathbf{x}_i) = 1-S(\tau\mid\mathbf{x}_i). \] It might be tempting to estimate this probability by defining a binary target variable \[ e_i(\tau):= \mathbb{I}(t_i \leq \tau \wedge \delta_i = 1) \tag{18.1}\] where all observations with event before time \(\tau\) are considered ones (events) and all other observations zeros (non-events). Then the quantity of interest could be estimated using any binary classification method that outputs (calibrated) probabilities as \[ P(Y \leq \tau|\mathbf{x}_i) = P(e_i(\tau) = 1|\mathbf{x}_i) =: \pi(\mathbf{x}_i;\tau) \tag{18.2}\]
This approach would work if there was no censoring before \(\tau\) in the data. However, in the presence of censoring, (18.1) does not define a valid target variable: an observation \(i\) censored before \(\tau\) is neither an event, nor a confirmed non-event; it is simply unknown if the event would have occurred had follow-up been allowed to continue. Removing such observations would introduce a form of selection bias (Chapter 1), skewing the average survival time to be longer than in reality, but treating them as events would introduce another form of bias by artificially increasing the probability of experiencing the event.
Vock et al. (2016) adapt the estimation procedure to obtain unbiased estimates of (18.2) by first calculating weights for each observation as \[ \tilde{w}_i(\tau) = \begin{cases} 0 & \text{if } t_i < \tau \wedge \delta_i = 0,\\ \left[\hat{G}_{KM}(\min(t_i, \tau))\right]^{-1} & \text{otherwise}, \end{cases} \tag{18.3}\] where \(\hat{G}_{KM}\) is the Kaplan-Meier estimate of the censoring distribution (Section 3.6.2). Hence, censored observations are removed (their associated weight is zero) and uncensored observations are upweighted to compensate for the information loss with the magnitude of the weight inversely proportional to the probability of an observation being uncensored at \(\tau\).
These weights are integrated into the estimation procedure by optimizing a weighted objective function, allowing the binary target variable in (18.1) to be used as a valid object of prediction. Let \(\ell(e_i(\tau), \pi(\mathbf{x}_i;\tau))\) be some pointwise loss, then the learner is trained to optimize \[ L(\pi, \tau) = \sum_{i=1}^n \tilde{w}_i(\tau) \ell(e_i(\tau), \pi(\mathbf{x}_i;\tau)). \tag{18.4}\]
Thus, (18.4) can be optimized by any classification learner that can handle weights (which is the case for most popular machine learning methods). The exact form of (18.4) depends on the choice of learner and objective function. For example, let \(\ell\) be the log loss, \[ \ell(e_i, \pi(\mathbf{x}_i;\tau)) = -\left[e_i\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i)\log(1 - \pi(\mathbf{x}_i;\tau))\right], \]
then (18.4) becomes
\[ L(\pi, \tau) = -\sum_{i=1}^n \tilde{w}_i(\tau) \left[e_i(\tau)\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i(\tau))\log(1 - \pi(\mathbf{x}_i;\tau))\right]. \]
This simple reduction allows practitioners to estimate \(\tau\)-year survival probabilities for right-censored data using classification learners. However, when using this reduction, there are some important aspects to consider. Firstly, while the weighting in (18.4) takes into account censoring, the success of the predictions still depends on the choice of model and loss function. For example, the estimates might not be well-calibrated if the chosen loss function does not yield well-calibrated probabilities (for example, the hinge loss used by support vector machines). Secondly, when evaluating predictions \(\hat{\pi}(\mathbf{x};\tau)\) on unseen test data, one cannot use standard evaluation metrics for binary classification as the test data will still contain censored observations. However, one can use survival metrics for evaluation, particularly rank-based metrics like concordance indices (Chapter 6) and some scoring rules (Chapter 8). Finally, (18.3) implies that contributions of observations censored before time \(\tau\) are multiplied by zero in (18.4). Algorithmically, it may appear more efficient to remove censored observations from the training data to avoid unnecessary computations of the loss. However, this should only be done with care, especially when using resampling techniques such as cross-validation, as censored observations are still required for evaluation.