Machine Learning in Survival Analysis

Getting Started

by Raphael Sonabend and Andreas Bender

This book is a work in progress, the final work will be published by CRC Press. This electronic version (including pdf download) will always be free and open access (CC BY-NC-SA 4.0).

We will continue to update this version after publication to correct mistakes (big and small), as well as making minor and major additions, if you notice any mistakes please feel free to open an issue.

Licensing

This book is licensed under CC BY-NC-SA 4.0, so you can adapt and redistribute the contents however you like as long as you: i) do cite this book (information below); ii) do not use any material for commercial purposes; and iii) do use a CC BY-NC-SA 4.0 compatible license if you adapt the material.

If you have any questions about licensing just open an issue and we will help you out.

Citation Information

Whilst this book remains a work in progress you can cite it as

Sonabend. R, Bender. A. (2025). Machine Learning in Survival Analysis.
https://www.mlsabook.com.

@book{MLSA2025,
    title = {Machine Learning in Survival Analysis},
    editor = {Raphael Sonabend, Andreas Bender},
    url = {https://www.mlsabook.com},
    year = {2025}
}

Contributing to this book

We welcome contributions to the electronic copy of the book, whether they’re issues picking up on typos, requests for additional content, or pull requests with contributions from any size. All contributions will be acknowledged in the preface of the book.

Before you contribute please make sure you have read our code of conduct.

Biographies

Raphael Sonabend is the CEO and Co-Founder of OSPO Now, a company providing virtual open-source program offices as a service. They are also a Visiting Researcher at Imperial College London. Raphael holds a PhD in statistics, specializing in machine learning applications for survival analysis. They created the R packages mlr3proba, survivalmodels, and the Julia package SurvivalAnalysis.jl. Raphael co-edited and co-authored Applied Machine Learning Using mlr3 in R (Bischl et al. 2024).

Andreas Bender is FIXME.

Preface

“Everything happens to everybody sooner or later if there is time enough” - George Bernard Shaw

“…but in this world nothing can be said to be certain, except death and taxes.” - Benjamin Franklin

A logical consequence of Bernard Shaw’s quote is that if there is time enough, then everybody will have experienced a given event at some point. This is one of the central assumptions to survival analysis (specifically to single-event analysis, but we’ll get to that later). As nothing can be certain (except death and taxes), machine learning can be used to predict the probability people will experience the event and when. This is exactly the problem that this book tackles.

With immortality only being a theoretical concept, there is never ‘time enough’, hence survival analysis assumes that the event of interest is guaranteed to occur within an object’s lifetime. This event could be a patient entering remission after a cancer diagnosis, the lifetime of a lightbulb after manufacturing, the time taken to finish a race, or any other event that is observed over time. Survival analysis differs from other fields of Statistics in that uncertainty is explicit encoded in the survival problem; this uncertainty is known as ‘censoring’. For example, say a model is being built to predict when a marathon runner will finish a race and to learn this information the model is fed data from every marathon over the past five years. Across this period, there will be many runners who never finish their race and are never observed to experience the event of interest (finishing the race). Instead, these runners are said to be ‘censored’, this means that the model uses all information up until the point of censoring (dropping out the race), and learns that they ran for at least as long as their censoring time (they time they dropped out). Censoring is unique to survival analysis and without the presence of censoring, survival analysis is mathematically equivalent to regression.

This book covers survival analysis in the most common right-censoring setting for independent censoring, as well as discussing competing risk frameworks for dependent censoring - these terms will all be covered in the introduction of the book.

A note from Raphael: I wrote my PhD thesis about machine learning applications to survival analysis as I was interested in understanding why more researchers were not using machine learning models for survival analysis. Since then I’ve had the pleasure to work with, and advise, researchers across different sectors, including pharmaceutical companies, governmental agencies, funding organisations, and research institutions. I hope that this book continues to help researchers discover machine learning survival analysis and to navigate the nuances and complexities it presents.

A note from Andreas: FIXME.

Overview to the book

This textbook is intended to fill a gap in the literature by providing a comprehensive introduction to machine learning in the survival setting. If you are interested in machine learning or survival analysis separately then you might consider James et al. (2013), Hastie, Tibshirani, and Friedman (2001), Bishop (2006) for machine learning and Collett (2014), Kalbfleisch and Prentice (1973) for survival analysis. This book serves as a complement to the above examples and introduces common machine learning terminology from simpler settings such as regression and classification, but without diving into the detail found in other sources, instead focusing on extension to the survival analysis setting.

This book may be useful for Masters or PhD students who are specialising in machine learning in survival analysis, machine learning practitioners looking to work in the survival setting, or statisticians who are familiar with survival analysis but less so with machine learning. The book could be read cover-to-cover, but this is not advised. Instead it may be preferable to dip into sections of the book as required and use the ‘signposts’ that direct the reader to sections of the book that are relevant to each other.

The book is split into five parts:

Part I: Survival Analysis and Machine Learning
The book begins by introducing the basics of survival analysis and machine learning and unifying terminology between the two to enable meaningful description of ‘machine learning in survival analysis’ (MLSA). In particular, the survival analysis ‘task’ and survival ‘prediction types’ are defined.

Part II: Evaluation
The second part of the book discusses one of the most important parts of the machine learning workflow, model evaluation. In the simplest case, without evaluation there is no way to know if predictions from a trained machine learning model are any good. Whether one uses a Kaplan-Meier estimator, a complex neural network, or anything in between, there is no guarantee any of these methods will actually make useful predictions for a given dataset. This could be because the dataset is inherently difficult for any model to be trained on, perhaps because it is very ‘noisy’, or because a model is simply ill-suited to the task, for example using a Cox Proportional Hazards model when its key assumptions are violated. Evaluation is therefore crucial to trusting any predictions made from a model.

The measures in Part II are presented in different classes that reflect the prediction types identified in Part I. In-sample measures, which evaluate the quality of a model’s ‘fit’ to data, are not included as this book primarily focuses on external validation of predictive machine learning models. Readers who are interested in this are are directed to Collett (2014) and Hosmer Jr, Lemeshow, and May (2011) for discussion on residuals; Choodari-Oskooei, Royston, and Parmar (2012) and Royston and Sauerbrei (2004) for \(R^2\) type measures; and Volinsky and Raftery (2000), Hurvich and Tsai (1989), and Liang and Zou (2008) for information criterion measures.

In each chapter, the measure class is introduced, particular metrics are listed, and commentary is provided on how and when to use the measures. Recommendations for choosing measures are discussed in 20 FAQs and Outlook.

Part III: Models
Part III is a deep dive into machine learning models for solving survival analysis problems. This begins with ‘classical’ models that may not be considered ‘machine learning’ and then continues by exploring different classes of machine learning models including random forests, support vector machines, gradient boosting machines, neural networks, and other less common classes. Each model class is introduced in the simpler regression setting and then extensions to survival analysis are discussed. Differences between model implementations are not discussed, instead the focus is on understanding how these models are built for survival analysis - in this way readers are well-equipped to independently follow individual papers introducing specific implementations.

Part IV: Reduction Techniques
The next part of the book introduces reduction techniques in survival analysis, which is the process of solving the survival analysis task by using methods from other fields. In particular, chapters focus on demonstrating how any survival model can be used in the competing risks setting, discrete time modelling, Poisson methods, pseudovalues (reduction to regression), and other advanced modelling methods.

Part V: Extensions and Outlook
The final part of the book provides some miscellaneous chapters that may be of use to readers. The first chapter lists common practical problems that occur when running survival analysis experiments and solutions that we have found useful. The next lists open-source software at the time of writing for running machine learning survival analysis experiments. The final chapter is our outlook on survival analysis and where the field may be heading.

Acknowledgments

We would like to gratefully acknowledge our colleagues that reviewed the content of this book, including: Lukas Burk, Cesaire Fouodo.