10 Survival Time Measures
This page is a work in progress and minor changes will be made over time.
When it comes to evaluating survival time predictions, there are few measures available at our disposal. As a result of survival time predictions being uncommon compared to other prediction types (Chapter 6), there are limited survival time evaluation measures in the literature. The most common approach (Wang, Li, and Reddy 2019) is to use regression measures by ignoring censored observations and potentially adding some IPC weighting.
10.1 Uncensored distance measures
Survival time measures are often referred to as ‘distance’ measures as they measure the distance between the true, \((t, \delta=1)\), and predicted, \(\hat{t}\), values. These are presented in turn with brief descriptions of their interpretation. The measures below can be inflated using an IPC weighting by dividing by \(\hat{G}_{KM}(t_i)\). However, evidence suggests that adding this weighting does not improve the measure with respect to ranking if one model is better than another (Qi et al. 2023).
For all measures define \(d := \sum_i \delta_i\), which is the number of uncensored observations in the dataset.
Censoring-ignored mean absolute error, \(MAE_C\)
In regression, the mean absolute error (MAE) is a popular measure because it is intuitive to understand as it measures the absolute difference between true and predicted outcomes; hence intuitively one can understand that a model predicting a height of 175cm is clearly better than one predicting a height of 180cm, for a person with true height of 174cm.
\[ MAE_C(\hat{\mathbf{t}}, \mathbf{t}, \boldsymbol{\delta}) = \frac{1}{d} \sum^d_{i=1} \delta_i|t_i - \hat{t}_i| \]
Censoring-ignored mean squared error
In comparison to MAE, the mean squared error (MSE), computes the squared differences between true and predicted values. While the MAE provides a smooth, linear, ‘penalty’ for increasingly poor predictions (i.e., the difference between an error of predicting 2 vs. 5 is still 3), but the square in the MSE means that larger errors are quickly magnified (so the difference in the above example is 9). By taking the mean over all predictions, the effect of this inflation is to increase the MSE value as larger mistakes are made.
\[ MSE_C(\hat{\mathbf{t}}, \mathbf{t}, \boldsymbol{\delta}) = \frac{1}{d}\sum^d_{i=1}\delta_i(t_i - \hat{t}_i)^2 \]
Censoring-ignored root mean squared error
Finally, the root mean squared error (RMSE), is simply the square root of the MSE. This allows interpretation on the original scale (as opposed to the squared scale produced by the MSE). Given the inflation effect for the MSE, the RMSE will be larger than the MAE as increasingly poor predictions are made; it is common practice for the MAE and RMSE to be reported together.
\[ RMSE_C(\hat{\mathbf{t}}, \mathbf{t}, \boldsymbol{\delta}) = \sqrt{MSE_C(\hat{\mathbf{t}}, \mathbf{t}, \boldsymbol{\delta})} \]
Note that these equations completely remove censored observations from the dataset under evaluation. This is in contrast to how IPC is used in the C-index (Section 7.1.1) and various scoring rules (Chapter 9), where censored observations are at least partially included in the calculation. By failing to include censored observations at all, these measures are only evaluating survival models on a (potentially biased) sample of the overall test data, and hence can never be fairly used to estimate the model’s performance on new data.
10.2 Over- and under-predictions
The distance measures just discussed assume that the error for an over-prediction (\(\hat{t}> t\)) should be equal to an under-prediction (\(\hat{t}< t\)). That is, they assume it is ‘as bad’ if a model predicts an outcome time being 10 years longer than the truth compared to being 10 years shorter. In the survival setting, this assumption is often invalid as it is generally preferred for models to be overly cautious, hence to predict negative events to happen sooner (e.g., predict a life-support machine fails after three years not five if the truth is actually four) and to predict positive events to happen later (e.g., predict a patient recovers after four years not two if the truth is actually three). A simple method to incorporate this imbalance between over- and under-predictions is to add a weighting factor to any of the above measures, for example the \(MAE_C\) might become
\[ MAE_C(\hat{\mathbf{t}}, \mathbf{t}, \boldsymbol{\delta}, \lambda, \mu, \phi) = \frac{1}{d} \sum^m_{i=1} \delta_i|(t_i - \hat{t}_i) [\lambda\mathbb{I}(t_i>\hat{t}_i) + \mu\mathbb{I}(t_i<\hat{t}_i) + \phi\mathbb{I}(t_i=\hat{t}_i)]| \]
where \(\lambda, \mu, \phi\) are any Real number to be used to weight over-, under-, and exact-predictions, and \(d\) is as above. The choice of these are highly context dependent and could even be tuned (Section 3.5).
10.3 Extensions
10.3.1 Competing risks
In a competing risks setting, there is no obvious metric to evaluate survival time predictions, primarily because there is no meaningful interpretation for an ‘all-cause survival time’ or a ‘cause-specific survival time’. If an observation could realistically experience one of multiple, mutually-exclusive events, then predicting the time to one particular event has no inherent meaning without first attaching a probability of the event taking place (i.e., the CIF), hence evaluating this probability is a more sensible approach for evaluating a competing risks model’s predictive performance.
10.3.2 Other censoring and truncation types
Given the above measures simply remove censored observations, the same measures can be easily applied to datasets with left-censoring and interval-censoring – with the same caveats applied. To handle truncation, the formulae above can be extended to remove truncated events, which will introduce the same biases as removing censored events.