16  Choosing Models

In contrast to measure selection, selecting models is more straightforward and the same heuristics from regression and classification largely apply to survival analysis. First, for low-dimensional data, many experiments have demonstrated that machine learning may not improve upon more standard statistical methods (Christodoulou et al. 2019) and the same holds for survival analysis (Burk et al. 2026). Therefore, the cost that comes with using machine learning — lower interpretability and longer training times — is unlikely to provide any performance benefits when a dataset has relatively few covariates especially when combined with low sample size. In settings where machine learning is more useful, the choice largely falls into the four model classes discussed in this book: random survival forests, survival support vector machines, gradient boosting machines, and neural networks (deep learning). In benchmark experiments with sufficient data and computational resources, it is sensible to include models of varying complexity, from featureless and linear models to models that capture non-linear effects and interactions. However, without significant resources, the rules-of-thumb below can provide a starting point for smaller experiments.

Random survival forests and boosting methods are strong all-purpose methods that can handle different censoring types and competing risks settings. In single-event settings, both have been shown to perform well on high-dimensional data, outperforming other model classes (Spooner et al. 2020). Forests often work well without hyperparameter tuning and may therefore be a sensible first choice for high-dimensional data.

Survival support vector machines have generally shown limited empirical success and appear to have seen little real-world adoption; moreover, their runtime can be substantial, taking hours to produce models that may see little (if any) benefit over simpler models (Burk et al. 2026; Fouodo et al. 2018; Pölsterl et al. 2015). Consequently, they are not generally recommended as a first-choice method.

Among machine learning methods, deep learning has become the dominant paradigm across many domains, and current research trends point to a similar trajectory in survival analysis (Wiegrebe et al. 2024). Neural network performance remains highly data-dependent. There are clear situations in which they are preferred or required, for example when handling image data such as MRI scans, or when identifying patterns in very large datasets such as omics data, yet there are no firm heuristics for selecting one architecture over another.

While deep learning has traditionally required large amounts of data, this constraint is increasingly relaxed by transfer learning, in which large pre-trained models are fine-tuned to a specific context, making powerful architectures usable on smaller datasets, particularly for feature extraction. This is especially valuable in multi-modal settings, where image or text inputs are combined with tabular covariates. The development of foundation models opens further avenues, potentially leveraging general representations to benefit low-resource time-to-event problems. A recent example is prior-data fitted networks (PFNs), which learn to approximate Bayesian inference so that predictions can be obtained in a single forward pass. PFNs have been adapted to right-censored time-to-event data, enabling individualized survival prediction without dataset-specific training or tuning (Seletkov et al. 2026; Qi et al. 2026).

In practice, many real-world applications go beyond one-off analyses and require custom architectures developed with frameworks such as PyTorch (Paszke et al. 2017) or TensorFlow (Abadi et al. 2015) rather than off-the-shelf implementations.

Interpreting survival models

Interpreting models is increasingly important as practitioners rely on more complex black box models (Molnar 2019). Classic methods for model comparison, such as the AIC and BIC, have been extended to survival models, though their application is limited to the core survival models (Chapter 11) only (Liang and Zou 2008; Volinsky and Raftery 2000). As a more flexible alternative, any of the calibration measures in Chapter 7 can be used to evaluate a model’s fit to data. To assess algorithmic fairness, the majority of measures discussed in Part II can be used to detect bias in a survival context (Sonabend et al. 2022). Gold-standard interpretability methods such as SHAP and LIME (Molnar 2019) can be extended to survival analysis off-the-shelf (Langbein et al. 2025), and time-dependent extensions also exist to interpret the impact of variables on the survival probability over time (Krzyziński et al. 2023; Langbein et al. 2025).