18  IPCW Classification

Abstract
This chapter introduces the inverse probability of censoring weighted (IPCW; first introduced in Chapter 3) classification reduction, a principled approach for estimating survival probabilities at a fixed time point in the presence of right censoring. The method addresses the challenge that standard binary classification targets are ill-defined when censoring occurs before the time horizon of interest. By reweighting observations using inverse estimates of the censoring distribution, the IPCW reduction yields unbiased estimates of the event probability while retaining compatibility with a wide range of machine learning classifiers. The chapter defines the weighted loss function underlying the reduction and shows how common classification objectives, such as the log loss, can be adapted without modifying the learning algorithm itself. Practical considerations are discussed, including model calibration, evaluation under censoring, and computational implications when handling censored observations. The IPCW classification reduction provides a simple yet powerful tool for practitioners seeking to estimate absolute risk at a clinically meaningful time point using off-the-shelf classification methods, while remaining consistent with the statistical structure of survival data.

The inverse probability of censoring weights (IPCW) based reduction transforms a survival task to a weighted classification task (Vock et al. 2016). The IPCW reduction is applicable for single-event right-censored data when the task of interest is to predict the survival probability at a single time point \(\tau\) (sometimes referred to as \(\tau\)-year prediction in survival analysis). This method is therefore particularly useful for predicting a single meaningful time point when the entire survival distribution prediction is not required. The reduction also assumes censoring is independent of the event time conditional on covariates.

Consider a right-censored dataset \(\mathcal{D} = \{(\mathbf{x}_i, t_i, \delta_i)\}_{i=1}^n\) with \(\mathbf{x}_i \in \mathbb{R}^p\) as introduced in Chapter 3. The probability of an event occurring by time \(\tau\) is given by the cumulative distribution function: \[ P(Y \leq \tau\mid\mathbf{x}_i) = F(\tau\mid\mathbf{x}_i) = 1-S(\tau\mid\mathbf{x}_i). \] It might be tempting to estimate this probability by defining a binary target variable \[ e_i(\tau):= \mathbb{I}(t_i \leq \tau \wedge \delta_i = 1) \tag{18.1}\] where all observations with event before time \(\tau\) are considered ones (events) and all other observations zeros (non-events). Then the quantity of interest could be estimated using any binary classification method that outputs (calibrated) probabilities as \[ P(Y \leq \tau|\mathbf{x}_i) = P(e_i(\tau) = 1|\mathbf{x}_i) =: \pi(\mathbf{x}_i;\tau) \tag{18.2}\]

This approach would work if there was no censoring before \(\tau\) in the data. However, in the presence of censoring, (18.1) does not define a valid target variable: an observation \(i\) censored before \(\tau\) is neither an event, nor a confirmed non-event; it is simply unknown if the event would have occurred had follow-up been allowed to continue. Removing such observations would introduce a form of selection bias (Chapter 1), skewing the average survival time to be longer than in reality, but treating them as events would introduce another form of bias by artificially increasing the probability of experiencing the event.

Vock et al. (2016) adapt the estimation procedure to obtain unbiased estimates of (18.2) by first calculating weights for each observation as \[ \tilde{w}_i(\tau) = \begin{cases} 0 & \text{if } t_i < \tau \wedge \delta_i = 0,\\ \left[\hat{G}_{KM}(\min(t_i, \tau))\right]^{-1} & \text{otherwise}, \end{cases} \tag{18.3}\] where \(\hat{G}_{KM}\) is the Kaplan-Meier estimate of the censoring distribution (Section 3.6.2). Hence, censored observations are removed (their associated weight is zero) and uncensored observations are upweighted to compensate for the information loss with the magnitude of the weight inversely proportional to the probability of an observation being uncensored at \(\tau\).

These weights are integrated into the estimation procedure by optimizing a weighted objective function, allowing the binary target variable in (18.1) to be used as a valid object of prediction. Let \(\ell(e_i(\tau), \pi(\mathbf{x}_i;\tau))\) be some pointwise loss, then the learner is trained to optimize \[ L(\pi, \tau) = \sum_{i=1}^n \tilde{w}_i(\tau) \ell(e_i(\tau), \pi(\mathbf{x}_i;\tau)). \tag{18.4}\]

Thus, (18.4) can be optimized by any classification learner that can handle weights (which is the case for most popular machine learning methods). The exact form of (18.4) depends on the choice of learner and objective function. For example, let \(\ell\) be the log loss, \[ \ell(e_i, \pi(\mathbf{x}_i;\tau)) = -\left[e_i\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i)\log(1 - \pi(\mathbf{x}_i;\tau))\right], \]

then (18.4) becomes

\[ L(\pi, \tau) = -\sum_{i=1}^n \tilde{w}_i(\tau) \left[e_i(\tau)\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i(\tau))\log(1 - \pi(\mathbf{x}_i;\tau))\right]. \]

This simple reduction allows practitioners to estimate \(\tau\)-year survival probabilities for right-censored data using classification learners. However, when using this reduction, there are some important aspects to consider. Firstly, while the weighting in (18.4) takes into account censoring, the success of the predictions still depends on the choice of model and loss function. For example, the estimates might not be well-calibrated if the chosen loss function does not yield well-calibrated probabilities (for example, the hinge loss used by support vector machines). Secondly, when evaluating predictions \(\hat{\pi}(\mathbf{x};\tau)\) on unseen test data, one cannot use standard evaluation metrics for binary classification as the test data will still contain censored observations. However, one can use survival metrics for evaluation, particularly rank-based metrics like concordance indices (Chapter 6) and some scoring rules (Chapter 8). Finally, (18.3) implies that contributions of observations censored before time \(\tau\) are multiplied by zero in (18.4). Algorithmically, it may appear more efficient to remove censored observations from the training data to avoid unnecessary computations of the loss. However, this should only be done with care, especially when using resampling techniques such as cross-validation, as censored observations are still required for evaluation.

18.1 Conclusion

WarningKey takeaways
  • The IPCW reduction transforms a survival task to a weighted classification task and can greatly simplify the estimation of survival probabilities at a specific time point of interest.
  • Many learners for binary classification can be used out-of-the-box without further modifications.
  • Learners that support gradient-based optimization such as gradient boosting and deep learning are particularly well suited for this task, as they support specification of (custom) loss functions and support integration of weights.
ImportantLimitations
  • Currently, the IPCW approach has only been described in detail for right-censored data. Extensions to other settings might be possible, but have not been explored yet.
  • Extensions to event-history analysis have also not been well-explored though an extension to competing risks has been recently proposed (see further reading).
TipFurther reading
  • Vock et al. (2016) provide the main reference where they explicitly show how different learners (logistic regression, Bayesian networks, decision trees and k-nearest neighbors) can be adapted to obtain unbiased estimates of the event probability in the presence of censoring based on adapted IPC weights. They also discuss suitable evaluation metrics.
  • Gonzalez Ginestet et al. (2021) extend the approach to the competing risks setting.