3 Statistical Learning

Abstract

TODO (150-200 WORDS)

Major changes expected!

This page is a work in progress and major changes will be made over time.

TODO

\(\mathcal{D}\) - labeled data set with \(n\) observation
Observations: \((\mathbf{x}^{(i)}, y^{(i)})\) where \(\mathbf{x}^{(i)} \in \mathcal{X}\) is a \(p\)-dimensional feature vector and \(y^{(i)} \in \mathcal{Y}\) is a label
Data set: \(\mathcal{D}= ((\mathbf{x}^{(1)}, y^{(1)}) , . . . , (\mathbf{x}^{(n)}, y^{(n)}))\)
Assume \(\mathcal{D}\stackrel{i.i.d.}\sim(\mathbb{P}_{xy})^n\) (unknown distribution)
ML model: \(f : \mathcal{X}\rightarrow \mathbb{R}^g\) assigning a prediction in \(\mathbb{R}^g\)
ML learners, \(\mathcal{I}\), configured by hyperparameters \(\lambda \in \Lambda\), are \(\mathcal{I}: \mathbb{D} \times \Lambda \rightarrow \mathcal{H}, (\mathcal{D}, \lambda) \mapsto \hat{f}\) where \(\mathcal{H}\) is the function space to which a model belongs and \(\hat{f}\) is a trained model with learned hyperparameters \(\hat{\theta} \in \mathcal{H}\).
Models are evaluated with loss functions which measure the discrepancy between predictions and true values \(L: \mathcal{Y}\times \mathbb{R}^g \rightarrow \mathbb{R}, (\hat{f}(\mathbf{x}), \mathbf{y}) \mapsto l(\hat{f}(\mathbf{x}), \mathbf{y})\)
To prevent overfitting, models are evaluated on unseen test data to ensure unbiased performance estimation and the generalization error \(GE(\mathcal{I}, \lambda, n_{train}, l) := \lim_{n_{test} \rightarrow \infty} \mathbb{E}_{\mathcal{D}_{train},\mathcal{D}_{test}\sim\mathbb{P}_{xy}}[l(\mathbf{y}_{test}, \mathbf{F}_{\mathcal{D}_{test},\mathcal{I}(\mathcal{D}_{train},\lambda)})]\) where \(\mathbf{F}_{\mathcal{D}_{test},\mathcal{I}(\mathcal{D}_{train},\lambda)}\) is the matrix of predictions from a model trained on \(\mathcal{D}_{train}\) and making predictions on \(\mathcal{D}_{test}\).

3.1 Machine Learning

This section begins with a very brief introduction to machine learning and a focus on regression and classification; the survival machine learning task is then introduced (Section 6.2). Of the many fields within machine learning (ML), the scope of this book is narrowed to supervised learning. Supervised learning is the sub-field of ML in which predictions are made for outcomes based on data with observed dependent and independent variables. For example predicting someone’s height is a supervised learning problem as data can be collected for features (independent variables) such as age and sex, and outcome (dependent variable), which is height. Predictive survival analysis problems fall naturally in the supervised learning framework as there are identifiable features and (multiple types of) outcomes.

3.1.1 Terminology and Methods

Common supervised learning methods are discussed in a simplified setting with features \(X \ t.v.i. \ \mathcal{X}\) and outcomes \(Y \ t.v.i. \ \mathcal{Y}\); usually outcomes are referred to as ‘targets’ (a ‘target for prediction’). Let \(\mathcal{D}_0 = \{(X_1,Y_1),...,(X_n,Y_n)\}\) be a (training) dataset where \((X_i, Y_i) \stackrel{i.i.d.}\sim(X, Y)\). The methods below extend naturally to the survival setting.

Strategies and Models

In order to clearly separate between similar objects, several terms for machine learning are now introduced and clearly distinguished.

Let \(g: \mathcal{X}\rightarrow \mathcal{Y}\) be the true (but unknown) mapping from the features to outcomes, referred to as the true prediction functional. Let \(\mathcal{G}\) be the set of prediction functionals such that \(\forall \Upsilon \in \mathcal{G}, \Upsilon: \mathcal{X}\rightarrow \mathcal{Y}\). A learning or fitting algorithm is defined to be any function of the form \(\mathcal{A}: \mathcal{X}^n \times \mathcal{Y}^n \rightarrow \mathcal{G}\). The goal of supervised learning is to learn \(g\) with a learning algorithm fit on (i.e. the input to the algorithm is) training data, \(\hat{g}:= \mathcal{A}(\mathcal{D}_{train}) \in \mathcal{G}\). Note that \(\hat{g}\) may take hyper-parameters that can be set or tuned (see below). The learning algorithm is ‘good’ if \(\hat{g}(X) \approx g(X)\) (see ‘Evaluation’ below).

The learning algorithm is determined by the chosen learning strategy and model, where a model is a complete specification of a learning strategy including hyper-parameters. These terms are more clearly illustrated by example:

Learning strategy – simple linear regression
Model – \(y = \beta_0 + \beta_1 x\) where \(x \in \mathbb{R}\) is a single covariate, \(y \in \mathbb{R}\) is the target, and \(\beta_0,\beta_1 \in \mathbb{R}\) are model coefficients.
Learning algorithm (model fitting) – Minimise the residual sum of squares: \((\hat{\beta_0}, \hat{\beta_1}) := \operatorname{argmin}_{\beta_0,\beta_1} \{\sum^n_{i=1} (y_i - \beta_0 - \beta_1 x_i)^2\}\) for \((x_i,y_i) \in \mathcal{D}_{train}, i = 1,...,n\).
Prediction functional – \(\hat{g}(x) = \hat{\beta_0} + \hat{\beta_1}x\)

To further illustrate the difference between learning strategy and model, note that the same learning strategy ‘simple linear regression’ could either utilise the model above or instead a model without intercept, \(y = \beta x\), in which case the learning algorithm and prediction functional would also be modified.

The model in (ii) is called unfitted as the model coefficients are unknown and the model cannot be used for prediction. After step (iii) the model is said to be fit to the training data and therefore the model is fitted. It is common to refer to the learning algorithm (and associated hyper-parameters) as the unfitted model and to refer to the prediction functional (and associated hyper-parameters) as the fitted model.

Evaluation

Models are evaluated by evaluation measures called losses or scores, \(L: \mathcal{Y}\times \mathcal{Y}\rightarrow \bar{\mathbb{R}}\). Let \((X^*, Y^*) \sim (X,Y)\) be test data (i.e. independent of \(\mathcal{D}_{train}\)) and let \(\hat{g}: \mathcal{X}\rightarrow \mathcal{Y}\) be a prediction functional fit on \(\mathcal{D}_{train}\), then these evaluation measures determine how closely predictions, \(\hat{g}(X^*)\), relate to the truth, \(Y^*\), thereby providing a method for determining if a model is ‘good’.

Task

A machine learning task is a simple mechanism to outline the problem of interest by providing: i) the data specification; ii) the definition of learning; iii) the definition of success (when is a prediction ‘good’?) (Király et al. 2021). All tasks in this paper have the same definitions of learning and success. For (ii), the aim is to learn the true prediction functional, \(g\), by fitting the learning algorithm on training data, \(\hat{g}:= \mathcal{A}(\mathcal{D}_0)\). For (iii), a predicted functional is considered ‘good’ if the expected generalization error, \(\mathbb{E}[L(Y^*, \hat{g}(X^*))]\), is low, where \((X^*, Y^*) \sim (X,Y)\) is independent of the training data \(\mathcal{D}_0\), and \(L\) is some loss that is chosen according to the domain of interest (regression, classification, survival).

Resampling

Models are tested on their ability to make predictions. In order to avoid ‘optimism of training error’ (James et al. 2013) – overconfidence caused by testing the model on training data – models are tested on previously unseen or ‘held-out’ data. Resampling is the procedure of splitting one dataset into two or more for separated training and testing. In this paper only two resampling methods are utilised: holdout and cross-validation. Holdout is the process of splitting a primary dataset into training data for model fitting and testing data for model predicting. This is an efficient method but may not accurately estimate the expected generalisation error for future model performance, instead this is well-estimated by \(K\)-fold cross-validation (KCV) (Hastie, Tibshirani, and Friedman 2001). In KCV, data is split into \(K \in \mathbb{N}_{> 0}\) ‘folds’ such that \(K-1\) of the folds are used for model training and the final \(K\)th fold for testing. The testing fold is iterated over all \(K\) folds, so that each at some point is used for testing and then training (though never at the same time). In each iteration the model is fit on the training folds, and predictions are made and evaluated on the testing fold, giving a loss \(L_k := L(\hat{g}(X^k), Y^k)\), where \((X^k, Y^k)\) are data from the \(k\)th fold. A final loss is defined by, \(L^* := \frac{1}{K} \sum^K_{k = 1} L_k\). Commonly \(K = 5\) or \(K = 10\) (Breiman and Spector 1992; Kohavi 1995).

Model Performance Benchmarking

Whilst benchmarking often refers to speed tests, i.e. the time taken to complete an operation, it can also refer to any experiment in which objects (mathematical or computational) are compared. In this report, a benchmark experiment will either refer to the comparison of multiple models’ predictive abilities, or comparison of computational speeds and object sizes for model fitting; which of these will be clear from context.

Model Comparison

Models can be analytically compared on how well they make predictions for new data. Model comparison is a complex topic with many open questions (Demšar 2006; Dietterich 1998; Nadeau and Bengio 2003) and as such discussion is limited here. When models are compared on multiple datasets, there is more of a consensus in how to evaluate models (Demšar 2006) and this is expanded on further in (Sonabend 2021). Throughout this book there are small simulation experiments for model comparison on single datasets however as these are primarily intended to aid exposition and not to generalise results, it suffices to compare models with the conservative method of constructing confidence intervals around the sample mean and standard error of the loss when available (Nadeau and Bengio 2003).

Hyper-Parameters and Tuning

A hyper-parameter is a model parameter that can be set by the user, as opposed to coefficients that are estimated as part of model fitting. A hyper-parameter can be set before training, or it can be tuned. Tuning is the process of choosing the optimal hyper-parameter value via automation. In the simplest setting, tuning is performed by selecting a range of values for the hyper-parameter(s) and treating each choice (combination) as a different model. For example if tuning the number of trees in a random forest (Chapter 15), \(m_r\), then a range of values, say \(100, 200, 500\) are chosen, and three models \(m_{r100}, m_{r200}, m_{r500}\) are benchmarked. The optimal hyper-parameter is given by whichever model is the best performing. Nested resampling is a common method to prevent overfitting that could occur from using overlapping data for tuning, training, or testing. Nested resampling is the process of resampling the training set again for tuning.

3.1.2 Machine Learning in Classification and Regression

Before introducing machine learning for survival analysis, which is considered ‘non-classical’, the more standard classification and regression set-ups are provided; these are referenced throughout this book.

3.1.2.1 Classification

Classification problems make predictions about categorical (or discrete) events, these may be deterministic or probabilistic. Deterministic classification predicts which category an observation falls into, whereas probabilistic classification predicts the probability of an observation falling into each category. In this brief introduction only binary single-label classification is discussed, though the multi-label case is considered in \(\ref{sec:car_reduxes_r7_mlc}\). In binary classification, there are two possible categories an observation can fall into, usually referred to as the ‘positive’ and ‘negative’ class. For example predicting the probability of death due to a virus is a probabilistic classification task where the ‘positive’ event is death.

A probabilistic prediction is more informative than a deterministic one as it encodes uncertainty about the prediction. For example it is clearly more informative to predict a \(70%\) chance of rain tomorrow instead of simply ‘rain’. Moreover the latter prediction implicitly contains an erroneous assumption of certainty, e.g. ‘it will rain tomorrow’.

Classification Task

Box 3.1 Let \((X,Y)\) be random variables t.v.i. \(\mathcal{X}\times \mathcal{Y}\) where \(\mathcal{X}\subseteq \mathbb{R}^p\) and \(\mathcal{Y}= \{0, 1\}\). Then,

The probabilistic classification task is the problem of predicting the probability of a single event taking place and is specified by \(g: \mathcal{X}\rightarrow [0, 1]\).
The deterministic classification task is the problem of predicting if a single event takes place and is specified by \(g: \mathcal{X}\rightarrow \mathcal{Y}\).

The estimated prediction functional \(\hat{g}\) is fit on training data \\((X_1,Y_1),...,(X_n,Y_n) \stackrel{i.i.d.}\sim(X,Y)\) and is considered ‘good’ if \(\mathbb{E}[L(Y^*, \hat{g}(X^*))]\) is low, where \((X^*, Y^*) \sim (X, Y)\) is independent of \((X_1,Y_1),...,(X_n,Y_n)\) and \(\hat{g}\).

In the probabilistic case, the prediction \(\hat{g}\) maps to the estimated probability mass function \(\hat{p}_Y\) such that \(\hat{p}_Y(1) = 1 - \hat{p}_Y(0)\).

3.1.2.2 Regression

A regression prediction is one in which the goal is to predict a continuous outcome from a set of features. For example predicting the time until an event (without censoring) occurs, is a regression problem.

Regression Task

Box 3.2 Let \((X,Y)\) be random variables t.v.i. \(\mathcal{X}\times \mathcal{Y}\) where \(\mathcal{X}\subseteq \mathbb{R}^p\) and \(\mathcal{Y}\subseteq \mathbb{R}\). Let \(\mathcal{S}\subset \operatorname{Distr}(\mathcal{Y})\) be a convex set of distributions on \(\mathcal{Y}\). Then,

The probabilistic regression task is the problem of predicting a conditional distribution over the Reals and is specified by \(g : \mathcal{X}\rightarrow \mathcal{S}\).
The deterministic regression task is the problem of predicting a single continuous value in the Reals and is specified by \(g: \mathcal{X}\rightarrow \mathcal{Y}\).

Whilst regression can be either probabilistic or deterministic, the latter is much more common and therefore in this book ‘regression’ refers to the deterministic case unless otherwise stated.