9 Distance Measures

As discussed in Chapter 5, evaluating the distance between point predictions of survival time and ground truth event times is fundamentally problematic when censoring is present. In particular, there is no principled way to define a metric that accurately reflects the average discrepancy between predicted survival times and all event times.

The most commonly discussed distance measure in the literature is the mean absolute error (MAE):

\[ \operatorname{MAE}(\mathbf{t}, \mathbf{\hat{t}}) = \frac{1}{n} \sum^{n}_{i = 1} \vert t_i - \hat{t}_i \vert, \tag{9.1}\]

where \(\boldsymbol{\mathbf{t}} = ({t}_1 \ {t}_2 \cdots {t}_{n})^\top\) and \(\boldsymbol{\mathbf{\hat{t}}} = ({\hat{t}}_1 \ {\hat{t}}_2 \cdots {\hat{t}}_{n})^\top\) are observed follow-up times and predicted survival times, respectively. In the literature, (9.1) is occasionally adapted for survival analysis by restricting evaluation to uncensored observations only (Wang et al. 2019). However, restricting evaluation in this way introduces substantial bias. Qi et al. (2023) show that naively applying (9.1) to uncensored data can induce model performance rankings inconsistent with those obtained from the oracle MAE. This remains true even when incorporating inverse probability of censoring weights.

A pseudo-value-based approach (see Chapter 19 for details about pseudo-values) has been shown, under controlled simulation settings, to induce the same model rankings as the oracle MAE (Qi et al. 2023). However, this approach is not intended to, nor does it, recover a meaningful measure of absolute distance between predicted survival times and the ground truth event times — pseudo-values are estimates and do not represent the true outcome.

Given these limitations, distance-based evaluations should generally be avoided. Instead, discrimination-based measures (Chapter 6) may be used to evaluate survival time predictions, while being clear that this evaluates relative risks only and does not attempt to quantify the distance between predictions and truth.

9.1 Conclusion

Key takeaways

Point predictions of survival time cannot be evaluated using standard distance-based error measures in a principled way when censoring is present.
Existing approaches may support relative comparison of models but do not resolve the fundamental ambiguity created by unobserved event times.
As a result, discrimination-based evaluation should be preferred, with careful interpretation of what is, and is not, being evaluated.