# 7 What are Survival Measures?

**This page is a work in progress and minor changes will be made over time.**

In this part of the book we discuss one of the most important parts of the machine learning workflow, model evaluation (Foss and Kotthoff 2024). In the next few chapters we will discuss different metrics that can be used to measure a model’s performance but before that we will just briefly discuss why model evaluation is so important.

In the simplest case, without evaluation there is no way to know if predictions from a trained machine learning model are any good. Whether one uses a simple Kaplan-Meier estimator, a complex neural network, or anything in between, there is no guarantee any of these methods will actually make useful predictions for a given dataset. This could be because the dataset is inherently difficult for any model to be trained on, perhaps because it is very ‘noisy’, or because a model is simply ill-suited to the task, for example using a Cox Proportional Hazards model when its key assumptions are violated. Evaluation is therefore crucial to trusting any predictions made from a model.

## 7.1 Survival Measures

Evaluation can be used to assess in-sample and out-of-sample performance.

In-sample evaluation measures the quality of a model’s ‘fit’ to data, i.e., whether the model has accurately captured relationships in the training data. However, in-sample measures often cannot be applied to complex machine learning models so this part of the book omits these measures. Readers who are interested in this are are directed to Collett (2014) and Hosmer Jr, Lemeshow, and May (2011) for discussion on residuals; Choodari-Oskooei, Royston, and Parmar (2012), Kent and O’Quigley (1988) and Royston and Sauerbrei (2004) for \(R^2\) type measures; and finally Volinsky and Raftery (2000), Hurvich and Tsai (1979), and Liang and Zou (2008) for information criterion measures.

Out-of-sample measures evaluate the quality of model predictions on new and previously unseen (by the model) data. By following established statistical methods for evaluation, and ensuring that robust resampling methods are used (James et al. 2013), evaluation provides a method for estimating the ‘generalisation error’, which is the expected model performance on new datasets. This is an important concept as it provides confidence about future model performance without limiting results to the current data. Survival measures are classified into measures of:

- Discrimination (aka ‘separation’) – A model’s discriminatory power refers to how well it separates observations that are at a higher or lower risk of event. For example, a model with good discrimination will predict that (at a given time) a dead patient has a higher probability of being dead than an alive patient.
- Calibration – Calibration is a roughly defined concept (Collins et al. 2014; Harrell, Lee, and Mark 1996; Rahman et al. 2017; Van Houwelingen 2000) that generally refers to how well a model captures average relationships between predicted and observed values.
- Predictive Performance – A model is said to have good predictive performance (or sometimes ‘predictive accuracy’) if its predictions for new data are ‘close to’ the truth.

These measures could also be categorised into how they evaluate predictions. Discrimination measures compare predictions pairwise where pairs of observations are created and then the predictions for these pairs are compared within and across each other in some way. Calibration measures evaluate predictions holistically by looking at some ‘average’ performance across them to provide an idea of how well suited the model is to the data. Measures of predictive performance evaluate individual predictions and usually take the sample mean of these to estimate the generalisation error.

In the next few chapters we categorise measures by the type of survival prediction they evaluate, which is a more natural taxonomy for selecting measures, but we use the above categories when introducing each measure.

## 7.2 How are Models Evaluated?

As well as using measures to evaluate a model’s performance on a given dataset, evaluation can also be used to measure future performance, to compare and select models, and to tune internal processes. In most cases, models should not be trained/predicted/evaluated on their own, instead a number of simpler reference models should be simultaneously trained and evaluated on the same data, which is known as a ‘benchmark experiment’. This is especially important for survival models, as *all* survival measures depend on the censoring distribution and therefore cannot be interpreted out of context and without comparison to other models. Benchmark experiments are used to empirically compare models across the same data and measures, meaning that if one model outperforms another then that model can be selected for future experiments (though simpler models are preferred if the performance difference is marginal). A model is usually said to ‘outperform’ another if it has a lower generalisation error.

The process of model evaluation is dependent on the measure itself. Measures that are ‘decomposable’ (predictive performance measures) calculate scores for individual predictions and take the sample mean over all scores, on the other hand ‘aggregate’ measures (discrimination and calibration) return a single score over all predictions. The simplest method to estimate the generalisation error is ‘holdout’ resampling, where a dataset \(\mathcal{D}\) is split into non-overlapping subsets for training \(\mathcal{D}_{train}\) and testing \(\mathcal{D}_{test}\). The model is trained on \(\mathcal{D}_{train}\) and predictions, \(\hat{\mathbf{y}}\) are made based on the features in \(\mathcal{D}_{test}\). The model is evaluated by using a measure, \(L\), to compare the predictions to the observed data in the test set, \(L(\mathbf{y}_{test}, \hat{\mathbf{y}})\).

Where possible, (repeated) k-fold cross-validation (kCV) should be used for more robust estimation of the generalisation error and for model comparison. In kCV, the data is partitioned into \(k\) folds (often \(k\) is \(5\) or \(10\)), which are non-overlapping subsets. A model is trained on \(k-1\) folds and evaluated on the \(k\)th fold, this process is repeated until each of the \(k\) folds has acted as the test set exactly once, the computed loss from each iteration is averaged into the final loss, which provides a good estimate of the generalisation error.

For the rest of this part of the book we will introduce different survival measures, discuss their advantages and disadvantages, and in Chapter 12 we will provide some recommendations for choosing measures. We will not discuss the general process of model resampling or evaluation further but recommend Casalicchio and Burk (2024) to readers interested in this topic.