16 IPC weighted classification
The inverse probability of censoring weights (IPCW; Section 3.6.2) based reduction transforms a survival task to a weighted classification task (Vock et al. (2016)). Conceptually, it is one of the simplest reductions, but currently also the least general, as it only applies to single-event, right-censored data. The method is useful when one is not interested in the estimation of the entire event time distribution, but only in the probability that an event occurs before a given time point \(\tau\) (sometimes referred to as \(\tau\)-year prediction in survival analysis).
Consider a right-censored data set \(\mathcal{D} = \{(\mathbf{x}_i, t_i, \delta_i)\}_{i=1}^n\) with \(\mathbf{x}_i \in \mathbb{R}^p\) as introduced in Section 3.2. The probability of an event occurring by time \(\tau\) is given by the complement of the survival probability at time \(\tau\): \[ P(Y \leq \tau\mid\mathbf{x}_i) = F(\tau\mid\mathbf{x}_i) = 1-S(\tau\mid\mathbf{x}_i). \] It might be tempting to estimate this probability by defining a binary target variable \[ e_i(\tau):= \mathbb{I}(t_i \leq \tau \wedge \delta_i = 1) \tag{16.1}\] where all observations with event before time \(\tau\) are considered ones (events) and all other observations zeros (non-events). Then the quantity of interest could be estimated using any binary classification method that outputs (calibrated) probabilities as \[ P(Y \leq \tau|\mathbf{x}_i) = P(e_i(\tau) = 1|\mathbf{x}_i) := \pi(\mathbf{x}_i;\tau) \tag{16.2}\]
This approach could work if there was no censoring before \(\tau\) in the data. However, in the presence of censoring, (16.1) does not define a valid target variable, as observations censored before time \(\tau\) (\(t_i < \tau \wedge \delta_i = 0\)) are neither events nor non-events at time \(\tau\) (as the event possibly occurred between \(t_i\) and \(\tau\)). Treating those observations as non-events or removing them from the data without further modification would introduce bias.
Vock et al. (2016) suggest to adapt the estimation procedure to obtain unbiased estimates of (16.2) by first calculating weights for each observation as \[ \tilde{w}_i(\tau) = \begin{cases} 0 & \text{if } y_i < \tau \wedge \delta_i = 0,\\ \hat{w}_i(\min(y_i, \tau)) = \frac{1}{\hat{G}_{KM}(\min(y_i, \tau))} & \text{else} \end{cases} \tag{16.3}\] where \(\hat{G}_{KM}\) is the Kaplan-Meier estimate of the censoring distribution (Section 3.6.1) and \(\hat{w}_i(\min(y_i, \tau))\) are the IPC weights (3.22) introduced in Section 3.6.2. These weights then need to be integrated into the estimation procedure, particularly by optimizing a weighted objective function. In words, censored observations are removed (the weight is zero) and uncensored observations are upweighted in order to compensate for the information loss. The higher the probability of an observation to be censored at \(\tau\), the higher its weight will be. These weights need to be integrated into the estimation procedure by optimizing a weighted objective function.
Incorporating this weighting scheme allows the binary target variable in (16.1) to be used as a valid object of prediction. Let \(\ell(e_i(\tau), \pi(\mathbf{x}_i;\tau))\) be the point wise loss, then the learner needs to optimize the weighted objective function \[ \mathcal{l}(\pi, \tau) = \sum_{i=1}^n \tilde{w}_i(\tau) \ell(e_i(\tau), \pi(\mathbf{x}_i;\tau)) \tag{16.4}\]
Thus, (16.4) can be optimized by any classification learner that can handle weights (which is the case for most popular machine learning methods). The exact form of (16.4) will depend on the choice of learner and objective function. For example, using the log loss (binary cross-entropy) \(\ell(e_i, \pi(\mathbf{x}_i)) = e_i\log(\pi(\mathbf{x}_i)) + (1-e_i)\log(1 - \pi(\mathbf{x}_i))\) as loss function, 16.4 becomes \[ \mathcal{l}(\pi, \tau) = \sum_{i=1}^n \tilde{w}_i(\tau) (e_i(\tau)\log(\pi(\mathbf{x}_i;\tau)) + (1-e_i(\tau))\log(1 - \pi(\mathbf{x}_i;\tau))). \tag{16.5}\]
This simple reduction allows practitioners to estimate \(\tau\)-year survival probabilities for right-censored data using classification learners. However, when using this reduction, there are some important aspects to consider:
- While the weighting in (16.4) takes into account censoring, the success of the prediction still depends on the choice of model and loss function. For example, the estimates might not be well calibrated if the chosen loss function does not yield well calibrated probabilities (for example, the hinge loss used by support vector machines).
- When evaluating predictions \(\hat{\pi}(\mathbf{x};\tau)\) on unseen test data, one cannot use standard evaluation metrics for binary classification, because the test data contains zeros (non-events), ones (events), but also censored observations. However, we can use survival metrics for evaluation, particularly rank based metrics like concordance indices (Chapter 6) and some scoring rules (Chapter 8). Use of standard classfication metrics based on only uncensored data during evaluation would require IPC weighting similar to the training phase.
- Equation (16.3) implies that contributions of observations censored before time \(\tau\) are multiplied with zero in (16.4) while uncensored observations are upweighted. Algorithmically, it would be more efficient to remove censored observations from the data set before training in order to avoid unnecessary computations of the loss. However, this might interfere with the usual training and evaluation procedures, for example when used in combination with cross-validation.