10 Choosing Measures
There are many survival measures to choose from and selecting the right one for the task might seem daunting. We have put together a few heuristics to support decision making. Evaluation should always be according to the goals of analysis, which means using discrimination measures to evaluate rankings, calibration measures to evaluate average performance, and scoring rules to evaluate overall performance and distribution predictions.
For discrimination measures, we recommend Harrell’s and Uno’s C. Whilst others can assess time-dependent trends, these are also captured in scoring rules. In practice the choice of measure matters less than ensuring your reporting is transparent and honest (Therneau and Atkinson 2024; Sonabend et al. 2021).
To assess a single model’s calibration, graphical comparisons to the Kaplan-Meier provide a useful and interpretable method to quickly see if a model is a good fit to the data (Section 7.2.1). When choosing between models, we recommend D-calibration, which can be meaningful optimized and thus used for comparison.
When picking scoring rules, we recommend using both the ISBS and RCLL. If a model outperforms another with respect to both measures then that can be a strong indicator of performance. When reporting scoring rules, we recommend the ERV representation which provides a meaningful interpretation as ‘performance increase over baseline’.
Given the lack of research, if you are interested in survival time predictions then treat evaluation with caution and check for new developments in the literature.
For automated model optimization, we recommend tuning with a scoring rule, which should capture discrimination and calibration simultaneously [Rindt et al. (2022); Yanagisawa (2023); FIXME ECML]. Though if you are only ever using a model for ranking, then we recommend tuning with Uno’s C. Whilst it does have higher variance compared to other concordance measures (Rahman et al. 2017; Schmid and Potapov 2012), it performs better than Harrell’s C as censoring increases (Rahman et al. 2017).