18 IPCW Classification
The inverse probability of censoring weights (IPCW; Section 3.6.2) based reduction transforms a survival task to a weighted classification task (Vock et al. 2016). It is applicable for single-event right-censored data when the task of interest is to predict the survival probability at a single time point \(\tau\) (sometimes referred to as \(\tau\)-year prediction in survival analysis). This method is therefore particularly useful for predicting a single meaningful time point when the entire survival distribution prediction is not required. The reduction assumes censoring is independent of the event time conditional on covariates.
Consider a right-censored dataset \(\mathcal{D}= \{(\mathbf{x}_i, t_i, \delta_i)\}_{i=1}^n\) with \(\mathbf{x}_i \in \mathbb{R}^p\) as introduced in Chapter 3. The probability of an event occurring by time \(\tau\) is given by the cumulative distribution function,
\[ \Pr(Y \leq \tau \mid \mathbf{x}_i) = F(\tau \mid \mathbf{x}_i) = 1-S(\tau \mid \mathbf{x}_i), \]
where \(Y\) is the random variable representing the true event time. It might be tempting to estimate this probability by defining a binary target variable:
\[ e_i(\tau):= \mathbb{I}(t_i \leq \tau \wedge \delta_i = 1), \tag{18.1}\]
where all observations with event before time \(\tau\) are considered ones (events) and all other observations zeros (non-events). Using (18.1), the quantity of interest could naively be estimated with any binary classification method that outputs probabilities as \[ \Pr(Y \leq \tau \mid \mathbf{x}_i) = \Pr(e_i(\tau) = 1 \mid \mathbf{x}_i) =: \pi(\mathbf{x}_i;\tau). \tag{18.2}\]
This approach would only work if there was no censoring before \(\tau\) in the data. In the presence of censoring, (18.1) does not define a valid target variable: an observation \(i\) censored before \(\tau\) is neither an event, nor a confirmed non-event; it is simply unknown if the event would have occurred had follow-up been allowed to continue. Removing such observations would introduce a form of selection bias (Chapter 1), skewing the average survival time to be longer than in reality, but treating them as events would introduce another form of bias by artificially increasing the probability of experiencing the event.
Vock et al. (2016) adapt the estimation procedure to obtain unbiased estimates of (18.2) by calculating weights for each observation,
\[ \tilde{w}_i(\tau) = \begin{cases} 0, & \text{if } t_i < \tau \wedge \delta_i = 0,\\ \left[\hat{G}_{KM}(\min(t_i, \tau))\right]^{-1}, & \text{otherwise}, \end{cases} \tag{18.3}\]
where \(\hat{G}_{KM}\) is the Kaplan-Meier estimate of the censoring distribution (Section 3.6.2). Hence, observations censored before \(\tau\) receive weight zero (thus removed) and uncensored observations are upweighted to compensate for the information loss with the magnitude of the weight inversely proportional to the probability of an observation being uncensored at \(\tau\).
These weights are integrated into the estimation procedure by optimizing a weighted objective function, allowing the binary target variable in (18.1) to be used as a valid object of prediction. Let \(\ell(e_i(\tau), \pi(\mathbf{x}_i;\tau))\) be some pointwise loss, then any learner that can handle weights (true for most popular machine learning methods) can be trained to optimize: \[ L(\pi,\tau) = \sum_{i=1}^n \tilde{w}_i(\tau) \ell(e_i(\tau), \pi(\mathbf{x}_i;\tau)). \tag{18.4}\]
The exact form of (18.4) depends on the choice of learner and objective function. For example, let \(\ell\) be the log loss,
\[ \ell(e_i(\tau), \pi(\mathbf{x}_i;\tau)) = -\left[e_i(\tau)\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i(\tau))\log(1 - \pi(\mathbf{x}_i;\tau))\right], \]
then (18.4) becomes:
\[ L(\pi,\tau) = -\sum_{i=1}^n \tilde{w}_i(\tau) \left[e_i(\tau)\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i(\tau))\log(1 - \pi(\mathbf{x}_i;\tau))\right]. \]
This simple reduction lets practitioners estimate \(\tau\)-year survival probabilities for right-censored data using classification learners, but several important aspects warrant consideration. Firstly, while the weighting in (18.4) takes into account censoring, the success of the predictions still depends on the choice of model and loss function. For example, the estimates might not be well-calibrated if the chosen loss function does not yield well-calibrated probabilities (like the hinge loss used by support vector machines). Secondly, predictions cannot be evaluated using binary classification metrics as the test data will still include censored observations. Instead, one should use survival metrics, such as survival scoring rules (Chapter 8) and concordance indices (Chapter 6). Finally, (18.3) implies that contributions of observations censored before time \(\tau\) are multiplied by zero in (18.4). Algorithmically, it may be more efficient to remove censored observations from the training data to avoid unnecessary computations of the loss. However, this should only be done with care, especially when using resampling techniques such as cross-validation, as censored observations are still required for evaluation.