10 Choosing Measures

Abstract

This chapter provides practical guidance on selecting evaluation measures in survival analysis, emphasizing alignment between the chosen metric and the goal of the analysis. Discrimination measures are recommended for assessing ranking performance, calibration methods for evaluating agreement between predicted and observed outcomes, and scoring rules for capturing overall predictive accuracy. For discrimination, Harrell’s C and Uno’s C are highlighted as robust default choices, with the latter preferred under heavier censoring. Calibration is best assessed visually using comparisons to the Kaplan–Meier estimator for single models, while D-calibration is recommended for comparing models. For overall performance, the integrated survival Brier score and the right-censored log-loss are proposed, with results reported using explained residual variation to aid interpretation. The chapter also discusses measure selection for model tuning, recommending scoring rules for general use and Uno’s C for ranking-focused tasks, while noting ongoing uncertainty in evaluating survival time predictions.

There are many survival measures to choose from and selecting the right one for the task might seem daunting. We have put together a few heuristics to support decision making. Evaluation should always be according to the goals of analysis, which means using discrimination measures to evaluate rankings, calibration measures to evaluate average performance, and scoring rules to evaluate overall performance and distribution predictions.

For discrimination measures, we recommend Harrell’s and Uno’s C. Whilst others can assess time-dependent trends, these are also captured in scoring rules. In practice the choice of measure matters less than ensuring your reporting is transparent and honest (Therneau and Atkinson 2024; Sonabend et al. 2021).

To assess a single model’s calibration, graphical comparisons to the Kaplan-Meier provide a useful and interpretable method to quickly see if a model is a good fit to the data (Section 7.2.1). When choosing between models, we recommend D-calibration, which can be meaningful optimized and thus used for comparison.

When picking scoring rules, we recommend using both the ISBS and RCLL. If a model outperforms another with respect to both measures then that can be a strong indicator of performance. When reporting scoring rules, we recommend the ERV representation which provides a meaningful interpretation as ‘performance increase over baseline’.

Given the lack of research, if you are interested in survival time predictions then treat evaluation with caution and check for new developments in the literature.

For automated model optimization, we recommend tuning with a scoring rule, which should capture discrimination and calibration simultaneously [Rindt et al. (2022); Yanagisawa (2023); FIXME ECML]. Though if you are only ever using a model for ranking, then we recommend tuning with Uno’s C. Whilst it does have higher variance compared to other concordance measures (Rahman et al. 2017; Schmid and Potapov 2012), it performs better than Harrell’s C as censoring increases (Rahman et al. 2017).