8  Evaluating Distributions by Scoring Rules

Abstract
TODO (150-200 WORDS)
Minor changes expected!

This page is a work in progress and minor changes will be made over time.

Scoring rules evaluate probabilistic predictions and (attempt to) measure the overall predictive ability of a model in terms of both calibration and discrimination (Gneiting and Raftery 2007; Murphy 1973). In contrast to calibration measures, which assess the average performance across all observations on a population level, scoring rules evaluate the sample mean of individual predictions across all observations in a test set. As well as being able to provide information at an individual level, scoring rules are also popular as probabilistic forecasts are widely recognised to be superior than deterministic predictions for capturing uncertainty in predictions (A. P. Dawid 1984; A. Philip Dawid 1986). Formalisation and development of scoring rules has primarily been due to Dawid (A. P. Dawid 1984; A. Philip Dawid 1986; A. Philip Dawid and Musio 2014) and Gneiting and Raftery (Gneiting and Raftery 2007); though the earliest measures promoting “rational” and “honest” decision making date back to the 1950s (Brier 1950; Good 1952). Few scoring rules have been proposed in survival analysis, although the past few years have seen an increase in popularity in these measures. Before delving into these measures, we will first describe scoring rules in the simpler classification setting.

8.1 Classification Losses

In the simplest terms, a scoring rule compares two values and assigns them a score (hence ‘scoring rule’), formally we’d write \(L: \mathbb{R}\times \mathbb{R}\mapsto \bar{\mathbb{R}}\). In machine learning, this usually means comparing a prediction for an observation to the ground truth, so \(L: \mathbb{R}\times \mathcal{P}\mapsto \bar{\mathbb{R}}\) where \(\mathcal{P}\) is a set of distributions. Crucially, scoring rules usually refer to comparisons of true and predicted distributions. For example, let’s construct a scoring rule as follows: 1. Let \(y \in \{0,1\}\) be the ground truth and let \(\hat{p}\) be the predicted probability mass function such that \(\hat{p}(y)\) is the probability of the observed event occurring; 2. Define \(\hat{y} := \mathbb{I}(\hat{p}(y) \geq 0.5)\), i.e., \(\hat{y}\) is \(1\) if the predicted probability of event \(1\) is greater or equal than 0.5; 3. Then define our scoring rule such that we score \(1\) if \(\hat{y}\) equals \(y\) or 0 otherwise: \(SR :=\mathbb{I}(\hat{y} == y)\).

In practice, minimisation is often the goal in automated machine learning processes, so we usually talk about ‘losses’ (which are minimised) instead of scoring rules that are maximised, hence let’s adapt SR slightly to the loss \(L := \mathbb{I}(\hat{y} \neq y))\), and putting all the above together we get

\[L_P(\hat{p}, y) = \mathbb{I}(y \neq \mathbb{I}(\hat{p}(y) \geq 0.5))\]

This loss is interpretable and has a real world meaning, in fact it’s just the mean misclassification error after discretising a probabilistic classification prediction. Now consider the following loss:

\[L_I(\hat{p}, y) = 1 - L_P\]

This follows the definition of a scoring rule/loss as it maps a distribution and value to a real-valued number, but the loss is also terrible as it assigns lower scores to worse predictions!

The difference between these losses is that the first is ‘proper’ whereas the latter is ‘improper’. A ‘proper’ loss is a loss that is minimised by the ‘correct’ prediction.

Another important property is strict properness. A loss is strictly proper if the loss is uniquely minimised by the ‘correct’ prediction. Let’s modify \(L_P\) slightly to become the squared difference between the true value and predicted probability (in fact this is the widely used Brier score (Brier 1950)):

\[L_S(\hat{p}, y) = (y - \hat{p}(y))^2\]

Now if we compare \(L_P\) and \(L_S\) across different values of \(y\) and \(\hat{p}_y\) (Table 8.1), we can easily see that whilst \(L_P\) provides some utility, this is limited as we’d have no way to know that some predictions are closer to the truth than others. On the other hand, \(L_S\) provides a quantitative method to compare predictions against the truth and between each other.

Table 8.1: Comparing improper proper (\(L_P\)) and strictly proper (\(L_S\)) scoring rules across different qualities of predictions.
\(y = 0\) \(y = 1\)
\(\hat{p}_y = 0\) \(L_P = 0; L_S = 0\) \(L_P = 0; L_S = 1\)
\(\hat{p}_y = 0.3\) \(L_P = 0; L_S = 0.09\) \(L_P = 0; L_S = 0.49\)
\(\hat{p}_y = 0.6\) \(L_P = 1; L_S = 0.36\) \(L_P = 1; L_S = 0.16\)
\(\hat{p}_y = 1\) \(L_P = 1; L_S = 1\) \(L_P = 1; L_S = 0\)

Mathematically, a classification loss \(L: \mathcal{P}\times \mathcal{Y}\rightarrow \bar{\mathbb{R}}\) is proper if for any distributions \(p_Y,p\) in \(\mathcal{P}\) and for any random variables \(Y \sim p_Y\), it holds that \(\mathbb{E}[L(p_Y, Y)] \leq \mathbb{E}[L(p, Y)]\). The loss is strictly proper if, in addition, \(p = p_Y\) uniquely minimizes the loss.

Proper losses provide a method of model comparison as, by definition, predictions closest to the true distribution will result in lower expected losses. Strictly proper losses have additional important uses such as in model optimisation, as minimisation of the loss will result in the ‘optimum score estimator based on the scoring rule’ (Gneiting and Raftery 2007). Whilst properness is usually a minimal acceptable property for a loss, it is generally not sufficient on its own, for example consider the measure \(L(\hat{p}_y, y) = 0\), which is proper as it is minimised by \(L(y, y)\) but it is clearly useless.

The two most widely used losses for classification are the Brier score (Brier 1950) and log loss (Good 1952), defined respectively by

\[ L_{brier}(\hat{p}, y) \mapsto (y - \hat{p}(y))^2 \]

and

\[ L_{logloss}(\hat{p}, y) = -\log \hat{p}(y) \]

These losses are visualised in Figure 8.1, which highlights that both losses are strictly proper (A. Philip Dawid and Musio 2014) as they are minimised when the true prediction is made, and we can say that we converge to the minimum as predictions are increasingly improved.

TODO
Figure 8.1: Brier and log loss scoring rules for a binary outcome and varying probabilistic predictions. x-axis is a probabilistic prediction in \([0,1]\), y-axis is Brier score (left) and log loss (right). Blue lines are varying Brier score/log loss over different predicted probabilities when the true outcome is 1. Red lines are varying Brier score/log loss over different predicted probabilities when the true outcome is 0. Both losses are minimised with the correct prediction, i.e. if \(\zeta.p(1) = 1\) when \(y = 1\) and \(\zeta.p(1) = 0\) when \(y = 0\) for a predicted discrete distribution \(\zeta\).

8.2 Survival Losses

We are now ready to list common scoring rules in survival analysis and discuss some of their properties. As with other chapters, this list is likely not exhaustive but will cover commonly used losses.

8.2.1 Integrated Graf Score

The Integrated Graf Score (IGS) was introduced by Graf (Graf and Schumacher 1995; Graf et al. 1999) as an analogue to the integrated brier score (IBS) in regression. The loss is defined by

\[ \begin{split} L_{IGS}(\hat{S}, t, \delta|\hat{G}_{KM}) = \int^{\tau^*}_0 \frac{\hat{S}^2(\tau) \mathbb{I}(t \leq \tau, \delta=1)}{\hat{G}_{KM}(t)} + \frac{\hat{F}^2(\tau) \mathbb{I}(t > \tau)}{\hat{G}_{KM}(\tau)} \ d\tau \end{split} \tag{8.1}\] where \(\hat{S}^2(\tau) = (\hat{S}(\tau))^2\) and \(\hat{F}^2(\tau) = (1 - \hat{S}(\tau)^2\), and \(\tau^* \in \mathbb{R}_{\geq 0}\) is an upper threshold to compute the loss up to, and \(\hat{G}_{KM}\) is the Kaplan-Meier trained on the censoring distribution for IPCW.

🪧 Learn more about IPCW

See Section 6.1 to learn more about IPCW.

To understand this loss, let’s break it down and look at the computations at a single time-point, \(\tau\). At \(\tau\) the loss will either be:

  1. \(\frac{\hat{S}^2(\tau)}{\hat{G}_{KM}(t)}\) - If the observation experiences the event before \(\tau\)
  2. 0 - If the observation is censored before \(\tau\)
  3. \(\frac{\hat{F}^2(\tau)}{\hat{G}_{KM}(\tau)}\) - If the observation’s outcome is after \(\tau\)

As we have no information about the true survival time of censored observations, it is sensible to not attempt to provide a meaningful score once censored, so their contribution is \(0\). For observations that are known to have experience the event at \(\tau\), then we would expect their survival probability to be zero as the event has occurred (and they cannot continue to survive) hence contributing \(\hat{S}^2\) – the addition of \(\hat{G}_{KM}(t)\) has the effect of placing more weight on the score at the observed event time if the proportion of censoring is lower at this time, the reason being that when the observations are alive (\(t > \tau\)) then their contributing the rest of the weighting after this time. Finally, for observations who are still alive, then we’d expect their survival probability to be as close to 1 as possible with inverse weighting at the current timepoint. As \(\tau \rightarrow \infty\), then \(\hat{G}_{KM}(\tau) \rightarrow 0\) as the number of observations in the dataset decreases, hence this weighting ensures that observations that are still in the data can contribute as if all observations were still in the data.

When censoring is uninformative, the IGS consistently estimates the mean square error \(L(t, S|\tau^*) = \int^{\tau^*}_0 [\mathbb{I}(t > \tau) - S(\tau)]^2 d\tau\), where \(S\) is the correctly specified survival function (Gerds and Schumacher 2006). However, despite these promising properties, the IGS is improper and must therefore be used with care (Rindt et al. 2022; Sonabend 2022).

The reweighted IGS is a strictly proper outcome-independent loss (Sonabend 2022) that reweights the IGS by removing censored observations and reweighting the denominator.

\[ L_{RIGS}(\hat{S}, t, \delta|\hat{G}_{KM}) = \frac{\delta \int_{\mathcal{T}} (\mathbb{I}(t \leq \tau) - \hat{F}(\tau))^2 \ d\tau}{\hat{G}_{KM}(t)} \]

This loss removes all censored observations, which can be problematic if the proportion of censoring is high. For uncensored observations we expect the predicted survival probability to be \(1\) before any outcome is observed and \(0\) otherwise, which follows more closely to the integrated Brier score. By changing the weighting the interpretation of contributions at time-points changes slightly, in the original IGS we may think of this as “inverse weighting for as long as the observation remains in the data”, which means the weight of a contribution at a time-point will be different for all observations and all time-points. On the other hand, for RIGS, we weight by the outcome time for each observation, which remains the same over time. Hence we instead inflate scores for observations whose outcome are later in the dataset, this is intuitive as it essentially places more importance on observations that are representative of being alive at those time points.

As the loss is strictly proper it may be ‘safer’ to use than the IGS in automated experiments, however this does come at the expense of removing censored observations.

8.2.2 Integrated Survival Log Loss

The integrated survival log loss (ISLL) was also proposed by Graf et al. (1999).

\[ L_{ISLL}(\hat{S},t,\delta|\hat{G}_{KM}) = -\int^{\tau^*}_0 \frac{\log[\hat{F}(\tau)] \mathbb{I}(t \leq \tau, \delta=1)}{\hat{G}_{KM}(t)} + \frac{\log[\hat{S}(\tau)] \mathbb{I}(t > \tau)}{\hat{G}_{KM}(\tau)} \ d\tau \]

where \(\tau^* \in \mathcal{T}\) is an upper threshold to compute the loss up to.

Similarly to the IGS, there are three ways to contribute to the loss depending on whether an observation is censored, experienced the event, or alive, at \(\tau\). Whilst the IGS is routinely used in practice, there is no evidence that ISLL is used, and moreover there are no proofs (or claims) that it is proper.

The reweighted ISLL (RISLL) follows similarly to the RIGS and is also outcome-independent strictly proper (Sonabend 2022).

\[ L_{RISLL}(\hat{S}, t, \delta|\hat{G}_{KM}) \mapsto -\frac{\delta \int_{\mathcal{T}} \mathbb{I}(t \leq \tau)\log[\hat{F}(\tau)] + \mathbb{I}(t > \tau)\log[\hat{S}(\tau)] \ d\tau}{\hat{G}_{KM}(t)} \]

8.2.3 Survival density log loss

Another outcome-independent strictly proper scoring rule is the survival density log loss (SDLL) (Sonabend 2022), which is given by

\[ L_{SDLL}(\hat{f}, t, \delta|\hat{G}_{KM}) = - \frac{\delta \log[\hat{f}(t)]}{\hat{G}_{KM}(t)} \]

where \(\hat{f}\) is the predicted probability density function. This loss is essentially the classification log loss (\(-log(\hat{p}(t))\)) with added IPCW. Whilst the classification log loss has beneficial properties such as being differentiable, this is more complex for the SDLL, which is also only an approximate loss. A useful alternative to the SDLL which can be readily used in automated procedures is the right-censored log loss.

8.2.4 Right-censored log loss

The right-censored log loss (RCLL) is an outcome-independent strictly proper scoring rule (Avati et al. 2020) that does not make use of IPCW and is thus not considered to be an approximate loss. The RCLL is defined by

\[ L_{RCLL}(\hat{S}, t, \delta) = -\log[\delta\hat{f}(t) + (1-\delta)\hat{S}(t)] \]

This loss is easily interpretable when we break it down into its two halves:

  1. If an observation is censored at \(t\) then all the information we have is that they did not experience the event at the time, so they must be ‘alive’, hence the optimal value is \(\hat{S}(t) = 1\) (which becomes \(-log(1) = 0\)).
  2. If an observation experiences the event then the ‘best’ prediction is for the probability of the event at that time to be maximised, as pdfs are not upper-bounded this means \(\hat{f}(t) = \infty\) (and \(-log(t) \rightarrow \infty\) as \(t \rightarrow \infty\)).

8.2.5 Absolute Survival Loss

The absolute survival loss, developed over time by Schemper and Henderson (2000) and Schmid et al. (2011), is based on the mean absolute error is very similar to the IGS but removes the squared time:

\[ L_{ASL}(\hat{S}, t, \delta|\hat{G}_{KM}) = \int^{\tau^*}_0 \frac{\zeta.S(\tau)\mathbb{I}(t \leq \tau, \delta = 1)}{\hat{G}_{KM}(t)} + \frac{\zeta.F(\tau)\mathbb{I}(t > \tau)}{\hat{G}_{KM}(\tau)} \ d\tau \] where \(\hat{G}_{KM}\) and \(\tau^*\) are as defined above. Analogously to the IGS, the ASL score consistently estimates the mean absolute error when censoring is uninformative (Schmid et al. 2011) but there are also no proofs or claims of properness. The ASL and IGS tend to yield similar results (Schmid et al. 2011) but in practice there is no evidence of the ASL being widely used.

8.3 Prediction Error Curves

As well as evaluating probabilistic outcomes with integrated scoring rules, non-integrated scoring rules can be utilised for evaluating distributions at a single point. For example, instead of evaluating a probabilistic prediction with the IGS over \(\mathbb{R}_{\geq 0}\), instead one could compute the IGS at a single time-point, \(\tau \in \mathbb{R}_{\geq 0}\), only. Plotting these for varying values of \(\tau\) results in ‘prediction error curves’ (PECs), which provide a simple visualisation for how predictions vary over the outcome. PECs are especially useful for survival predictions as they can visualise the prediction ‘over time’. PECs should only be used as a graphical guide and never for model comparison as they only provide information at a limited number of points. An example is provided in Figure 8.2 for the IGS where the the Cox PH consistently outperforms the SVM.

TODO
Figure 8.2: Prediction error curves for the CPH and SVM models from Chapter 7. x-axis is time and y-axis is the IGS computed at different time-points. The CPH (red) performs better than the SVM (blue) as it scores consistently lower. Trained and tested on randomly simulated data from \(\textbf{mlr3proba}\).

8.4 Baselines and ERV

A common criticism of scoring rules is a lack of interpretability, for example, an IGS of 0.5 or 0.0005 has no meaning by itself, so below we present two methods to help overcome this problem.

The first method, is to make use of baselines for model comparison, which are models or values that can be utilised to provide a reference for a loss, they provide a universal method to judge all models of the same class by (Gressmann et al. 2018). In classification, it is possible to derive analytical baseline values, for example a Brier score is considered ‘good’ if it is below 0.25 or a log loss if it is below 0.693 (Figure 8.1), this is because these are the values obtained if you always predicted probabilties as \(0.5\), which is a reasonable basline guess in a binary classificaiton problem. In survival analysis, simple analytical expressions are not possible as losses are dependent on the unknown distributions of both the survival and censoring time. Therefore all experiments in survival analysis must include a baseline model that can produce a reference value in order to derive meaningful results. A suitable baseline model is the Kaplan-Meier estimator (Graf and Schumacher 1995; Lawless and Yuan 2010; Royston and Altman 2013), which is the simplest model that can consistently estimate the true survival function.

As well as directly comparing losses from a ‘sophisticated’ model to a baseline, one can also compute the percentage increase in performance between the sophisicated and baseline models, which produces a measure of explained residual variation (ERV) (Edward L. Korn and Simon 1990; Edward L. Korn and Simon 1991). For any survival loss \(L\), the ERV is,

\[ R_L(S, B) = 1 - \frac{L|S}{L|B} \]

where \(L|S\) and \(L|B\) is the loss computed with respect to predictions from the sophisticated and baseline models respectively.

The ERV interpretation makes reporting of scoring rules easier within and between experiments. For example, say in experiment A we have \(L|S = 0.004\) and \(L|B = 0.006\), and in experiment B we have \(L|S = 4\) and \(L|B = 6\). The sophisticated model may appear worse at first glance in experiment A (as the losses are very close) but when considering the ERV we see that the performance increase is identical (both \(R_L = 33\%\)), thus providing a clearer way to compare models.