16 Choosing Models

Abstract

This chapter provides practical guidance for selecting models in survival analysis, emphasizing that many principles from regression and classification extend naturally to time-to-event settings. In low-dimensional data, traditional statistical models often perform as well as, or better than, more complex machine learning approaches, while offering advantages in interpretability and computational efficiency. In higher-dimensional settings, machine learning methods become more competitive, and model selection typically focuses on four main classes: random forests, boosting methods, support vector machines, and neural networks. Random survival forests and boosting approaches are presented as strong general-purpose methods, with forests offering robustness and ease of tuning, and boosting providing greater flexibility at the cost of increased complexity. Support vector machines are generally not recommended due to limited empirical success. Neural networks are highly data-dependent and require substantial tuning, but may offer advantages when integrating non-tabular data. The chapter concludes with a short discussion about model interpretability including use of SHAP and LIME.

In contrast to measure selection, selecting models is more straightforward and the same heuristics from regression and classification largely apply to survival analysis. Firstly, for low-dimensional data, many experiments have demonstrated that machine learning may not improve upon more standard statistical methods (Christodoulou et al. 2019) and the same holds for survival analysis (Burk et al. 2024). Therefore the cost that comes with using machine learning – lower interpretability, longer training time – is unlikely to provide any performance benefits when a dataset has relatively few covariates. In settings where machine learning is more useful, then the choice largely falls into the four model classes discussed in this book: random forests, support vector machines, boosting, and neural networks (deep learning). If you have access to sufficient computational resources, then it is always worthwhile including at least one model from each class in a benchmark experiment, as models perform differently depending on the data type. However, without significant resources, the rules-of-thumb below can provide a starting point for smaller experiments.

Random survival forests and boosting methods are both good all-purpose methods that can handle different censoring types and competing risks settings. In single-event settings both have been shown to perform well on high-dimensional data, outperforming other model classes (Spooner et al. 2020). Forests require less tuning than boosting methods and the choice of hyperparameters is often more intuitive. Therefore, we generally recommend forests as the first choice for high-dimensional data. Given more resources, boosting methods such as xgboost are powerful to improve the predictive performance of traditional survival models. Survival support vector machines do not appear to work well in practice and to-date we have not seen any real-world use of SSVMs, therefore we generally do not recommend use of SVMs without robust training and testing first.

Neural networks are incredibly data-dependent. Moreover, given a huge increase in research into this area (Wiegrebe et al. 2024), there are no clear heuristics for recommending when to use neural networks and then which particular algorithms to use. With enough fine-tuning we have found that neural networks can work well but still without outperforming other methods. Where neural networks may shine is going beyond tabular data to incorporate other modalities, but again this area of research for survival analysis is still nascent.

Interpreting survival models

Interpreting models is increasingly important as practitioners rely on more complex ‘black-box’ models (Molnar 2019). Classic methods that test if a model is fit well to data, such as the AIC and BIC, have been extended to survival models however are limited in application to the foundational survival models discussed in Chapter 11. As a more flexible alternative, any of the calibration measures in Chapter 7 can be used to evaluate a model’s fit to data. To assess algorithmic fairness, the majority of measures discussed in Part II can be used to detect bias in a survival context (Sonabend et al. 2022). Gold-standard interpretability methods such as SHAP and LIME (Molnar 2019) can be extended to survival analysis off-shelf (Langbein et al. 2024), and time-dependent extensions also exist to observe the impact of variables on the survival probability over time (Krzyziński et al. 2023; Langbein et al. 2024).