Machine Learning in Survival Analysis
Getting Started
by Raphael Sonabend and Andreas Bender
This book is a work in progress, the final work will be published by CRC Press in January 2025. This electronic version (including pdf download) will always be free and open access (CC BY-NC-SA 4.0).
We will continue to update this version after publication to correct mistakes (big and small), as well as making minor and major additions, if you notice any mistakes please feel free to open an issue.
Licensing
This book is licensed under CC BY-NC-SA 4.0, so you can adapt and redistribute the contents however you like as long as you: i) do cite this book (information below); ii) do not use any material for commercial purposes; and iii) do use a CC BY-NC-SA 4.0 compatible license if you adapt the material.
If you have any questions about licensing just open an issue and we will help you out.
Citation Information
Whilst this book remains a work in progress you can cite it as
Sonabend. R, Bender. A. (2024). Machine Learning in Survival Analysis.
https://www.mlsabook.com.
@book{MLSA2024
title = Machine Learning in Survival Analysis,
editor = {Raphael Sonabend, Andreas Bender},
url = {https://www.mlsabook.com},
year = {2024}
}
Contributing to this book
We welcome contributions to the electronic copy of the book, whether they’re issues picking up on typos, requests for additional content, or pull requests with contributions from any size. All contributions will be acknowledged in the preface of the book.
Before you contribute please make sure you have read our code of conduct.
Biographies
Raphael Sonabend is the CEO and Co-Founder of OSPO Now, a company providing virtual open-source program offices as a service. They are also a Visiting Researcher at Imperial College London. Raphael holds a PhD in statistics, specializing in machine learning applications for survival analysis. They created the R packages mlr3proba
, survivalmodels
, and the Julia package SurvivalAnalysis.jl
. Raphael co-edited and co-authored Applied Machine Learning Using mlr3 in R (Bischl et al. 2024).
Andreas Bender is FIXME.
Preface
“Everything happens to everybody sooner or later if there is time enough” - George Bernard Shaw
“…but in this world nothing can be said to be certain, except death and taxes.” - Benjamin Franklin
A logical consequence of Bernard Shaw’s quote is that if there is time enough, then everybody will have experienced a given event at some point. This is one of the central assumptions to survival analysis (specifically to single-event analysis, but we’ll get to that later). As nothing can be certain (except death and taxes), machine learning can be used to predict the probability people will experience the event and when. This is exactly the problem that this book tackles.
With immortality only being a theoretical concept, there is never ‘time enough’, hence survival analysis assumes that the event of interest is guaranteed to occur within an object’s lifetime. This event could be a patient entering remission after a cancer diagnosis, the lifetime of a lightbulb after manufacturing, the time taken to finish a race, or any other event that is observed over time. Survival analysis differs from other fields of Statistics in that uncertainty is explicit encoded in the survival problem; this uncertainty is known as ‘censoring’. For example, say a model is being built to predict when a marathon runner will finish a race and to learn this information the model is fed data from every marathon over the past five years. Across this period, there will be many runners who never finish their race and are never observed to experience the event of interest (finishing the race). Instead, these runners are said to be ‘censored’, this means that the model uses all information up until the point of censoring (dropping out the race), and learns that they ran for at least as long as their censoring time (they time they dropped out). Censoring is unique to survival analysis and without the presence of censoring, survival analysis is mathematically equivalent to regression.
This book covers survival analysis in the most common right-censoring setting for independent censoring, as well as discussing competing risk frameworks for dependent censoring - these terms will all be covered in the introduction of the book.
A note from Raphael: I wrote my PhD thesis about machine learning applications to survival analysis as I was interested in understanding why more researchers were not using machine learning models for survival analysis. Since then I’ve had the pleasure to work with, and advise, researchers across different sectors, including pharmaceutical companies, governmental agencies, funding organisations, and research institutions. I hope that this book continues to help researchers discover machine learning survival analysis and to navigate the nuances and complexities it presents.
A note from Andreas: FIXME.
Overview to the book
This textbook is intended to fill a gap in the literature by providing a comprehensive introduction to machine learning in the survival setting. If you are interested in machine learning or survival analysis separately then you might consider James et al. (2013), Hastie, Tibshirani, and Friedman (2001), Bishop (2006) for machine learning and Collett (2014), Kalbfleisch and Prentice (1973) for survival analysis. This book serves as a complement to the above examples and introduces common machine learning terminology from simpler settings such as regression and classification, but without diving into the detail found in other sources, instead focusing on extension to the survival analysis setting.
This book may be useful for Masters or PhD students who are specialising in machine learning in survival analysis, machine learning practitioners looking to work in the survival setting, or statisticians who are familiar with survival analysis but less so with machine learning. The book could be read cover-to-cover, but this is not advised. Instead it may be preferable to dip into sections of the book as required and use the ‘signposts’ that direct the reader to sections of the book that are relevant to each other.
The book is split into five parts:
Part I: Survival Analysis and Machine Learning
The book begins by introducing the basics of survival analysis and machine learning and unifying terminology between the two to enable meaningful description of ‘machine learning in survival analysis’ (MLSA). In particular, the survival analysis ‘task’ and survival ‘prediction types’ are defined.
Part II: Evaluation
The second part of the book discusses measures for evaluating survival models. These are presented in different classes that reflect the prediction types identified in Part I. In each chapter, the measure class is introduced, particular metrics are listed, and commentary is provided on how and when to use the measures. The final chapter of Part II briefly discusses when to use a given measure class and provides recommendations for model comparison.
Part III: Models
Part III is a deep dive into machine learning models for solving survival analysis problems. This begins with ‘classical’ models that may not be considered ‘machine learning’ and then continues by exploring different classes of machine learning models including random forests, support vector machines, gradient boosting machines, neural networks, and other less common classes. Each model class is introduced in the simpler regression setting and then extensions to survival analysis are discussed. Differences between model implementations are not discussed, instead the focus is on understanding how these models are built for survival analysis - in this way readers are well-equipped to independently follow individual papers introducing specific implementations.
Part IV: Reduction Techniques
The next part of the book introduces reduction techniques in survival analysis, which is the process of solving the survival analysis task by using methods from other fields. In particular, chapters focus on demonstrating how any survival model can be used in the competing risks setting, discrete time modelling, Poisson methods, pseudovalues (reduction to regression), and other advanced modelling methods.
Part V: Extensions and Outlook
The final part of the book provides some miscellaneous chapters that may be of use to readers. The first chapter lists common practical problems that occur when running survival analysis experiments and solutions that we have found useful. The next lists open-source software at the time of writing for running machine learning survival analysis experiments. The final chapter is our outlook on survival analysis and where the field may be heading.
Exercises are provided at the end of the book so you can test yourself as you go along.