10 Choosing Measures

There are many survival measures to choose from, and selecting the right one for a given task may seem daunting. Below are a few heuristics to support decision-making. First and foremost, evaluation should always be according to the goals of analysis, which means using discrimination measures to evaluate relative risk predictions, calibration measures to evaluate average agreement between predicted and empirical outcome distributions, and scoring rules to evaluate overall performance and distribution predictions.

For discrimination measures, consider Harrell’s C and Uno’s C. While other discrimination measures can assess time-dependent ranking performance, time-dependent predictive performance can also be examined through scoring rules. If only one measure is to be reported, then Uno’s C may be preferred. Although Uno’s C can have higher variance than Harrell’s C (Rahman et al. 2017; Schmid and Potapov 2012), it is less likely to overestimate concordance as censoring increases (Rahman et al. 2017). Moreover, the variance can be transparently captured and reported through techniques such as bootstrapping. In practice, the choice of measure matters less than ensuring your reporting is transparent and honest (Sonabend et al. 2022; Therneau and Atkinson 2024).

To assess the calibration of a single model, graphical comparisons to the Kaplan-Meier estimator provide a useful and interpretable way to assess model fit (Section 7.2.1). When choosing between models, D-calibration provides a quantitative statistic with a meaningful minimum and therefore can be used as a tool for model comparison.

When choosing scoring rules, the survival Brier score (8.3) provides a useful score for developing prediction error curves (Section 8.3). For a quantity that summarizes overall predictive ability, the integrated SBS (8.5) is most commonly used in the literature though its properness is susceptible to different censoring patterns. The RCLL (Section 8.2.2) is intuitive and is proven strictly proper when censoring is conditionally event-independent (Rindt et al. 2022; Yanagisawa 2023), but may require numerical approximation of the density function (Section 3.1.1). Most importantly, the choice of evaluation measures should be decided up front and not selected based on preliminary results. When reporting scoring rules, consider the ERV representation which provides a meaningful interpretation as relative performance compared with a standard baseline.

Given the limited research currently available, evaluation of survival time predictions should be treated with caution and new developments in the literature should be monitored.

For automated model optimization, tuning with a scoring rule should capture discrimination and calibration simultaneously (Kopper et al. 2026; Rindt et al. 2022; Yanagisawa 2023). However, if you are developing a model to maximize separation and calibration truly matters less, then prefer Harrell’s C over Uno’s C, as any bias in the measure should affect models similarly, making lower variance potentially preferable. Finally, avoid using a measure of calibration alone for tuning, as this can favor models that approximate marginal non-parametric estimates rather than individualized predictions.