20  Partition-Based Reductions

Abstract
This chapter introduces partition-based reductions, a family of techniques that transform survival data into a format amenable to standard regression and classification methods by partitioning the follow-up time into discrete intervals. The key idea is that survival data can be restructured into a long-format dataset, with one row per subject per interval at risk, and interval-specific event indicators as targets. Three related reductions are presented. The discrete-time approach treats survival as a sequence of binary classification problems, yielding an estimate of the discrete hazard in each interval. Survival stacking is a special case where interval boundaries are defined by the observed event times, mirroring the construction of the Kaplan-Meier estimator and estimation of the Cox model. The piecewise exponential model instead frames the problem as Poisson regression with an offset, preserving information about the exact event times within intervals and yielding continuous-time hazard estimates. All three approaches support arbitrary machine learning methods as base learners and naturally accommodate time-varying covariates. Practical guidance on the choice of interval boundaries, interval conventions, and time representation is provided.
WarningMinor changes expected!

This page is a work in progress and minor changes will be made over time.

In contrast to the previously introduced reductions that focus on the estimation of a specific quantity of interest at one or few time points, partition-based reductions aim to estimate the entire distribution of the event times. The general idea is to partition the time axis into \(J\) disjoint intervals \([a_0,a_1),\ldots,[a_{J-1},a_J)\) and estimate a discrete or continuous hazard rate for each interval; intervals do not need to be of the same size.

For illustration, consider the partitioning of the follow up in Figure 20.1. Let \(I_j:=[a_{j-1},a_{j})\) be the \(j\)-th interval and let \(J_i\) denote the index of the interval in which the observed time \(t_i\) of subject \(i\) falls, that is \(t_i \in I_{J_i} = [a_{J_i-1},a_{J_i})\).

Partitioning the follow-up time into discrete intervals.
Figure 20.1: Partitioning of the follow-up time into \(J\) discrete intervals \([a_0,a_1), \ldots, [a_{J-1},a_J)\). This partitioning forms the basis for the reduction techniques discussed in this chapter.

Formally, define the partitioning of the follow-up times via the set of interval boundary points \(\{a_{j-1}, a_j\}\) for \(j=1,\ldots,J\) as \[ \mathcal{A} = \{a_0, a_1, \ldots, a_J\}, \quad a_0 < a_1 < \ldots < a_J, \tag{20.1}\]

where \(a_0\) will often be zero and \(a_J\) some time point larger than the last observed event time. This partitioning is used to create a transformed dataset, based on which standard regression or classification methods can be used in order to estimate (discrete) hazards in each interval. Section 20.1 introduces the required data transformation in more detail. Then Section 20.2, Section 20.3 and Section 20.4 define specific reductions based on this data transformation, provide theoretical justification why models applied to such transformed data are valid approaches for estimation in time-to-event analysis and provide illustrative examples for such models. Finally, Section 20.5 discusses computational aspects and modeling choices, including the choice of interval boundaries \(a_j\) and the convention for left-closed vs. left-open intervals.

20.1 Data Transformation

In order to apply the reduction techniques introduced in the upcoming sections, standard time-to-event data needs to be transformed into a specific format. The transformation is based on the partitioning of the follow-up as illustrated in Figure 20.1 and given boundaries as defined in (20.1). Figure 20.2 illustrates the resulting structure for three hypothetical subjects with interval boundaries \(\mathcal{A} = \{0, 1, 1.5, 3\}\): the left-hand side shows the original time-to-event format (one row per subject), the right-hand side shows the transformed dataset \(\mathcal{D}_{\mathcal{A}}\) that the rest of this section formalizes (all notation will also be defined).

Partitioning the follow-up time into discrete intervals.
Figure 20.2: Illustration of the data transformation needed for partition-based reductions. Left: Data in standard time-to-event format. Right: Transformed data using interval boundaries \(\mathcal{A} = \{0, 1, 1.5, 3\}\).

The new dataset contains \(J_i\) rows for each subject \(i\), that is one row for each interval in which subject \(i\) was at risk of experiencing an event. For each subject \(i\), we record their interval-specific event indicator \[ \delta_{ij} = \begin{cases} 1, & t_i \in [a_{j-1}, a_j) \wedge \delta_i = 1 \\ 0, & \text{otherwise} \end{cases},\quad j=1,\ldots,J_i. \tag{20.2}\]

\(\delta_{ij}=0\) for all intervals except the last interval of subject \(i\), where \(\delta_{iJ_i} = \delta_i\). Thus the total number of events is the same in both the original and transformed data, and given the interval boundaries \(\mathcal{A}\), \(\delta_{ij}\) is a deterministic function of the original data.

Further, let \(t_{ij}\) be the time that subject \(i\) was at risk for the event in interval \(j\), \[ t_{ij} = \begin{cases} a_j - a_{j-1}, & \text{ if } t_i > a_j \\ t_i - a_{j-1}, & \text{ if } t_i \in [a_{J_i-1}, a_{J_i})\\ \end{cases},\quad j=1,\ldots,J_i, \tag{20.3}\] or in words, if the event was experience by subject \(i\) in interval \(j\) then \(t_{ij}\) is the time taken within that interval until the event occurred, otherwise, \(t_{ij}\) is simply the interval length. Let \(t_j\) be a scalar representation of interval \(j\), often taken as the interval midpoint \[ t_j = \frac{a_{j-1} + a_j}{2}. \tag{20.4}\]

See Section 20.5 for specific choices of the interval and how they affect modelling.

The transformation is completed by recording the subject’s (potentially time-dependent) features \(\mathbf{x}_{ij}\) with \(\mathbf{x}_{ij} = \mathbf{x}_{i}\) for all \(j\) if features are constant over time.

Thus, given the partition \(\mathcal{A}\), the standard time-to-event data \(\mathcal{D}= \{(t_i, \delta_i, \mathbf{x}_i)\}_{i=1,\ldots,n}\) is transformed to the dataset (Figure 20.2), \[ \mathcal{D}_{\mathcal{A}} = \{(i, j, \delta_{ij}, t_{ij}, t_j, \mathbf{x}_{ij})\}_{i=1,\ldots,n;\ j=1,\ldots,J_i}, \tag{20.5}\] where \(i\) and \(j\) indices are indices not used in modeling and \(t_{ij}\) is only required for the piecewise constant hazards approach (Section 20.4). \(t_j\) is the only feature that is shared across all subjects within an interval and is used to estimate the baseline hazard, in many software implementations it is included as another column of \(\mathbf{x}_{ij}\), though it is kept separate in this chapter to emphasize its role.

20.2 Discrete-Time Survival Analysis

Recall from Section 3.1.2 that \(\tilde{Y}\) denotes discretized time, such that \(\tilde{Y}=j \Leftrightarrow Y \in [a_{j-1}, a_j)\). Now consider the partitioning of the follow-up (20.1) and assume we are only interested in whether the event occurred within an interval rather than the exact event time. The likelihood contribution of the \(i\)th observation is given by \(P(Y_i \in [a_{J_i-1}, a_{J_i})) = P(\tilde{Y}_i = J_i)\) for subjects who experienced an event (\(\delta_i = 1\)) and \(P(Y_i \geq a_{J_i}) = P(\tilde{Y}_i > J_i)\) for censored observations (\(\delta_i = 0\)).

Recall from Section 3.1.2 the discrete-time survival and event-probability identities, \[ S_{\tilde{Y}}(j) = \prod_{k \leq j}(1 - h_{\tilde{Y}}(k)) \qquad\text{and}\qquad P(\tilde{Y} = j) = S_{\tilde{Y}}(j-1)\, h_{\tilde{Y}}(j). \] Applying these to the likelihood contribution of subject \(i\), and using the interval-specific event indicators \(\delta_{ij}\) from (20.2) (in particular \(\delta_{iJ_i} = \delta_i\) and \(\delta_{ij}=0\) for \(j<J_i\)), gives \[\begin{aligned} L_i &= P(Y_i \in [a_{J_i-1}, a_{J_i}))^{\delta_i}\, P(Y_i \geq a_{J_i})^{1-\delta_i}\\ &= \left[S_{\tilde{Y}}(J_i-1)\, h_{\tilde{Y}}(J_i)\right]^{\delta_i} \left[S_{\tilde{Y}}(J_i)\right]^{1-\delta_i}\\ &= \left[\prod_{j=1}^{J_i-1} (1-h_{\tilde{Y}}(j)) \cdot h_{\tilde{Y}}(J_i)\right]^{\delta_i} \left[\prod_{j=1}^{J_i} (1-h_{\tilde{Y}}(j))\right]^{1-\delta_i}\\ &= \prod_{j=1}^{J_i} (1-h_{\tilde{Y}}(j))^{1-\delta_{ij}}\, h_{\tilde{Y}}(j)^{\delta_{ij}}. \end{aligned} \tag{20.6}\]

For a random variable \(Z \sim \text{Bernoulli}(\pi)\) with \(\pi = P(Z = 1)\), the likelihood contribution is \(P(Z = z) = \pi^z (1-\pi)^{1-z}\), which is exactly the form of each factor in (20.6) under the identification \(\pi = h_{\tilde{Y}}(j)\). We therefore recognize that the interval-specific event indicators \(\delta_{ij}\) can be modelled as independent Bernoulli random variables, \[ \Delta_{ij} \stackrel{iid}{\sim} \text{Bernoulli}(\pi_j), \qquad j = 1, \ldots, J_i, \] where the success probability \(\pi_j\) is the discrete-time hazard, \[ \pi_j = P(\Delta_{ij} = 1 \mid t_j) = P(Y_i \in [a_{j-1}, a_j) \mid Y_i \geq a_{j-1}) = h_{\tilde{Y}}(j). \tag{20.7}\]

This implies that we can estimate the discrete-time hazards \(h_{\tilde{Y}}(j)\) for each interval \(j\) by fitting any binary classification model to the transformed dataset

\[ \mathcal{D}_{\mathcal{A}} = \{(\delta_{ij}, t_{j})\}_{i=1,\ldots,n;\quad j=1,\ldots,J_i}, \tag{20.8}\]

where \(\delta_{ij}\) are the targets and \(t_j\) is a covariate used to estimate different hazards/probabilities in different intervals.

In the presence of features \(\mathbf{x}_{i}\), it follows

\[ \Delta_{ij}| \mathbf{x}_{i}, t_j \stackrel{iid}{\sim} \text{Bernoulli}(\pi_{ij}), \]

where \(\pi_{ij} = P(\Delta_{ij} = 1| \mathbf{x}_{i}, t_j) = h_{\tilde{Y}}(j|\mathbf{x}_{i})\) is the discrete hazard rate for interval \(j\) given features \(\mathbf{x}_{i}\) and is estimated by applying binary classifiers to

\[ \mathcal{D}_{\mathcal{A}} = \{(\delta_{ij}, \mathbf{x}_{i}, t_{j})\}_{i=1,\ldots,n;\quad j=1,\ldots,J_i}. \]

To summarize, when fitting a classification model with \(\delta_{ij}\) as the binary target and \((\mathbf{x}_i, t_j)\) as covariates, the predicted class probabilities \(\hat{\pi}_{ij} = \hat{P}(\Delta_{ij}=1 \mid \mathbf{x}_i, t_j)\) are estimates of the conditional discrete-time hazard \(\hat{h}_{\tilde{Y}}(j \mid \mathbf{x}_{i})\).

20.2.1 Example: Logistic Regression

To illustrate the discrete-time reduction approach, consider the tumor dataset introduced in Table 3.1. The follow-up time is partitioned into \(J = 100\) equidistant intervals and the data are transformed according to the procedure in Section 20.1. Define \(x_{i1} = \mathbb{I}(\text{complications}_i = \text{``yes''})\) and fit the logistic regression model , \[\log\frac{\pi_{ij}}{1-\pi_{ij}} = \beta_{0j} + \beta_1 x_{i1}, \tag{20.9}\] to the transformed dataset, where \(j\) denotes the interval index, \(\beta_{0j}\) are the interval-specific intercepts and \(\beta_1\) is the common effect of complications on the discrete hazard rate (across all intervals).

Predicted discrete hazards are obtained by applying the sigmoid to the linear predictor: \[ \hat{h}_{\tilde{Y}}(j \mid \mathbf{x}_i) = \hat{\pi}_{ij} = \sigma(\hat{\beta}_{0j} + \hat{\beta}_1 x_{i1}), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}, \] where the equality \(\pi_{ij} = h_{\tilde{Y}}(j \mid \mathbf{x}_i)\) was established in (20.7). The estimated survival probabilities can then be obtained by calculating

\[ \hat{S}_{\tilde{Y}}(j|x_{1}) = \prod_{k=1}^j (1-\hat{h}_{\tilde{Y}}(k|x_{1})), \]

for each interval \(j\) and complication group.

Figure 20.3 shows the estimated survival probabilities from the discrete-time model together with the Kaplan-Meier estimates for comparison, with the model encoded by line color and complications status by line type. The model specified in (20.9) is a proportional odds model (Section 11.4), where the baseline hazard is the same for both complication groups, thus the shape of the hazard and therefore the survival probability curve is the same for both complication groups, shifted by the common effect of complications. The figure indicates that (20.9) does not describe the data too well as in reality the hazards are different in the two groups.

Single plot showing survival curves: Kaplan-Meier and discrete-time model estimates differentiated by line color, complications status (yes/no) differentiated by line type.
Figure 20.3: Comparison of survival probability estimates using the discrete-time model (logistic regression) and the Kaplan-Meier estimator, for patients with and without complications.

To estimate separate discrete baseline hazards for each group, replace \(\beta_{j}\) for \(\beta_{1j}\) in (20.9), which effectively introduces an interaction term between the interval and complications variables, \[\log\frac{\pi_{ij}}{1-\pi_{ij}} = \beta_{0j} + \beta_{1j} x_{i1}, \tag{20.10}\] where \(\beta_{0j}\) are the interval-specific intercepts for the reference group (no complications), \(\beta_{1j}\) are the deviations from the reference group for the complications group (technically this is fit as an interaction model using reference coded categorical features for interval and complications). This specification allows the shape of the hazard and therefore the survival probability curve to be different for the two groups.

Figure 20.4 shows the estimated survival probabilities from the interaction model together with the Kaplan-Meier estimates for comparison, with the model encoded by line color and complications status by line type. The interaction model provides a much better fit to the data compared to the proportional odds model, as it allows for separate baseline hazards for each complications group. The difference to the Kaplan-Meier estimates is barely detectable.

Single plot showing survival curves: Kaplan-Meier and discrete-time interaction model estimates differentiated by line color, complications status (yes/no) differentiated by line type.
Figure 20.4: Comparison of survival probability estimates using the discrete-time model with interaction (logistic regression) and the Kaplan-Meier estimator, for patients with and without complications.

While this is a simple example with only one feature, it shows that the discrete time reduction approach provides a good approximation of the event time distribution when the number of intervals is large enough, despite the fact that we ignore the information about the exact event time. Importantly, the estimated survival probabilities are discrete, representing survival probabilities at the interval endpoints. These discrete estimations can be joined to a continuous survival function using a variety of approaches, most simply linear interpolation between the interval endpoints.

The example also raises the question about the choice of the number of intervals \(J\) and the placement of interval boundaries \(\mathcal{A}\). These questions will be discussed in more detail in Section 20.5, which is relevant for both the discrete-time reduction approach and the piecewise constant hazards approach discussed in Section 20.4.

20.3 Survival Stacking

Survival stacking (Craig et al. 2025) casts survival tasks to classification tasks similarly to the discrete-time method described in Section 20.2. It can be viewed as a special case of the discrete time approach, where the interval boundaries are defined by the unique, observed, ordered event times, such that \[ \mathcal{A} = \{t_{(1)}, t_{(2)}, \ldots, t_{(m+1)}\}, \quad t_{(1)} < t_{(2)} < \ldots < t_{(m+1)}, \tag{20.11}\] resulting in intervals \([t_{(1)}, t_{(2)}), \ldots, [t_{(m-1)}, t_{(m)}), [t_{(m)}, t_{(m+1)})\), where \(t_{(m+1)}\) is some time point larger than the last event time. Equivalently, survival stacking can be motivated analogous to the construction of the Kaplan-Meier estimator (Section 3.5.2.1): at each observed event time \(t_{(j)}\), we consider all subjects in the risk set \(\mathcal{R}_{t_{(j)}}\) and record whether or not they experience an event at this time.

For illustration, consider once again the example from Figure 20.2. The adapted data-transformation (using \(t_{(m+1)} = 3\)) for survival stacking is shown in Figure 20.5 (dropping columns that are not meaningful here). At the first event time \(t_{(1)} = 0.5\), all subjects are still at risk for the event. At time \(t_{(2)} = 2.7\), however, only subject \(3\) is still at risk for the event.

Partitioning the follow-up time into discrete intervals.
Figure 20.5: Illustration of the data transformation needed for survival stacking. Left: Data in standard time-to-event format. Right: Transformed data using unique, observed event times \(t_{(1)} < \ldots < t_{(m)}\).

Note that in contrast to Figure 20.2, here subject \(1\) has only one row and subject \(3\) has two rather than three rows. This could suggest that the data transformation for survival stacking creates smaller datasets, however, we need to create a row for each subject for each observed event time \(t_{(j)}, j=1,\ldots,m\) at which the subject is at risk, whereas for the discrete-time approach in Section 20.2 one can freely choose the number and placement of interval boundaries \(\mathcal{A}\), such that the discrete-time approach will have fewer rows if \(J < m\).

Survival stacking yields data \[\mathcal{D}_{\mathcal{A}} = \{(\delta_{ij}, t_{j}, \mathbf{x}_{ij})\}_{i=1,\ldots,n;\ j=1,\ldots,m+1:\ i \in \mathcal{R}_{t_{(j)}}}, \tag{20.12}\] where the \(\delta_{ij}\) can be used as targets for binary classification, \(t_{(j)}\) serves as the time representation feature (here the interval startpoint \(t_j = t_{(j)}\)), and as before, any algorithm that returns class probabilities can be used for estimation.

20.4 Piecewise Constant Hazards

The piecewise constant hazards approach, also known as the piecewise exponential model (PEM), assumes the continuous-time hazard function is a step function: constant within each interval but allowed to vary between intervals. In contrast to the discrete-time approach (Section 20.2), which only models whether an event occurs within each interval, PEM uses the exact within-interval event time and yields a continuous-time hazard estimate. Intuitively, any underlying continuous hazard function can be approximated arbitrarily well given enough intervals (and events). The name comes from the equivalent assumption that within each interval, event times are exponentially distributed, which implies constant hazards within the interval.

Consider the partitioning of the follow-up as defined in (20.1) and assume that within each interval \(I_j = [a_{j-1}, a_j)\), the hazard function is constant, \[ h(\tau) = h_j,\ \forall\ \tau \in I_j. \tag{20.13}\] Note that left-closed, right-open intervals are used here for consistency with the discrete-time approach in Section 20.2. Usually in the PEM literature, intervals are defined as left-open, right-closed (see also Section 20.5).

First we derive the likelihood contribution of subject \(i\) under the piecewise-exponential assumption (20.13). Recall from Section 3.1.1 the right-censored likelihood and survival-hazard identity, \[ \mathcal{L}_i = h(t_i)^{\delta_i}\, S(t_i), \qquad S(t) = \exp\!\left(-\int_0^t h(u)\,\,\mathrm{d}u\right). \] Combining these, \[\begin{aligned} \mathcal{L}_{i} & = h(t_i)^{\delta_{i}}S(t_i)\\ & = h(t_i)^{\delta_{i}}\exp\left(-\int_{0}^{t_i} h(u) \ \,\mathrm{d}u\right)\\ & = h_{J_i}^{\delta_{i}}\exp\left(-\left[\sum_{j=1}^{J_i-1} (a_j - a_{j-1})h_j + (t_i - a_{J_i-1})h_{J_i}\right]\right)\\ & = h_{J_i}^{\delta_{i}}\exp\left(-\sum_{j=1}^{J_i} h_j t_{ij}\right) = h_{J_i}^{\delta_{i}}\prod_{j=1}^{J_i}\exp\left(- h_j t_{ij}\right)\\ & = \prod_{j=1}^{J_i} h_j^{\delta_{ij}} \exp(-h_j t_{ij}), \end{aligned} \tag{20.14}\] where the third and fourth equalities follow from the assumption the hazard is a step function (20.13) and the definition of time at risk (20.3), and the last equality uses the property \(\delta_{iJ_i} = \delta_i\) with \(\delta_{ij}=0\) for \(j<J_i\) from (20.2).

Now assume that the interval-specific event indicators \(\delta_{ij}\) are realizations of random variables \(\Delta_{ij} \stackrel{iid}\sim \text{Poisson}(\mu_{ij} : = h_j t_{ij})\) and recall that \(Z\sim \text{Poisson}(\mu)\) implies \(P(Z = z) = (\mu^z \exp(-\mu))/z!\). Thus, the likelihood contribution of subject \(i\) can be written as \[\begin{aligned} \mathcal{L}_{\text{Poisson},i} & = \prod_{j=1}^{J_i} P(\Delta_{ij} = \delta_{ij}) \\ &= \prod_{j=1}^{J_i} \frac{(h_j t_{ij})^{\delta_{ij}} \exp(-h_j t_{ij})}{\delta_{ij}!}\\ & = \prod_{j=1}^{J_i} h_j^{\delta_{ij}} t_{ij}^{\delta_{ij}}\exp(-h_j t_{ij}), \end{aligned} \tag{20.15}\] where the last equality follows from \(\delta_{ij} \in \{0,1\}\) and \(0!=1!=1\).

Note that \(\mathcal{L}_{\text{Poisson}, i} \propto \mathcal{L}_i\) from equation (20.14), since \(t_{ij}\) is a constant and does not depend on the parameters of interest (here \(h_j\)). This implies that we can estimate a model with piecewise constant hazards by optimizing the Poisson likelihood (20.15). The \(t_{ij}\) term enters as an offset \(\log(t_{ij})\) in the Poisson regression model.

More generally, in the presence of features \(\mathbf{x}_i\), assume \(\delta_{ij}|\mathbf{x}_i \stackrel{iid}{\sim} \text{Poisson}(\mu_{ij})\), where \(\mu_{ij} = h_{ij} t_{ij}\) is the expected number of events in interval \(j\) given features \(\mathbf{x}_i\) and \(h_{ij} = g(t_j, \mathbf{x}_i)\) is the hazard rate in interval \(j\) given features \(\mathbf{x}_i\). This can be estimated by fitting a Poisson regression model with log-link to the transformed dataset \[\mathcal{D}_{\mathcal{A}} = \{(\delta_{ij}, t_{ij}, t_j, \mathbf{x}_{ij})\}_{i=1,\ldots,n; j=1,\ldots,J_i}, \tag{20.16}\] where the model specification is \[\log(E[\Delta_{ij}]) = \log(h_{ij}) + \log(t_{ij}), \tag{20.17}\] where \(\log(h_{ij}) = g(\mathbf{x}_i, t_j)\) is the interval and feature specific hazard rate, \(g\) is a function learned by the model of choice and \(\log(t_{ij})\) is included as an offset term in the Poisson likelihood/Loss function.

Importantly, in contrast to the discrete-time reductions in Section 20.2, the piecewise exponential model likelihood has no information loss regarding the exact time-to-event by including the \(t_{ij}\) terms (20.3) and estimates continuous-time hazards rather than discrete-time hazards.

The fitted model returns interval-specific hazard estimates \(\hat h_j(\mathbf{x})\). Survival probabilities follow from the cumulative hazard via \(\hat S(\tau \mid \mathbf{x}) = \exp\left(-\int_0^\tau \hat h(u \mid \mathbf{x})\,\,\mathrm{d}u\right)\). Because the hazard is piecewise constant, the integral reduces to a sum of areas: \[ \hat S(\tau \mid \mathbf{x}) = \exp\!\left(-\left[\sum_{j: a_j \leq \tau} \hat h_j(\mathbf{x})\,(a_j - a_{j-1}) \;+\; \hat h_{J_\tau}(\mathbf{x})\,(\tau - a_{J_\tau-1})\right]\right), \tag{20.18}\] where \(J_\tau\) is the index of the interval containing \(\tau\). The first sum accumulates the hazard over completed intervals before \(\tau\), and the trailing term adds the partial-interval contribution from \(a_{J_\tau-1}\) up to \(\tau\).

Why PEM?

One benefit of PEM is that it preserves the exact within-interval event time through the offset \(\log(t_{ij})\), so no information is lost by discretizing; the fitted model returns continuous-time hazards, from which survival probabilities at any \(\tau\) are obtained without interpolation. Another advantage is PEM combines naturally with smooth functions of time (for example, penalized splines or tree-based smoothers; see Section 20.5), which makes the approach scalable to very fine partitions without the proliferation of interval-specific parameters seen in the discrete-time interaction model. However, in contrast to other methods, PEM requires a Poisson regression with offset, which fewer software implementations support directly compared to vanilla binary classification. Another drawback is that fitted hazards are continuous rates rather than per-interval probabilities, which some practitioners may find less intuitive than modelling \(h_{\tilde{Y}}(j)\).

20.4.1 Example: Poisson Regression

To illustrate the piecewise constant hazards approach, we once again consider the tumor dataset. We partition the follow-up time into \(J = 100\) equidistant intervals. We then fit a Poisson regression model with interaction between interval and complications to allow for separate baseline hazards for each complications group:

\[ \log(\mu_{ij}) = \log(E[\Delta_{ij}]) = \underbrace{\beta_{0j} + \beta_{1j}x_{i1}}_{\log(h_{ij})} + \log(t_{ij}), \tag{20.19}\] where \(x_{i1} = \mathbb{I}(\text{complications}_i = \text{``yes''})\), \(\beta_{0j}\) are the interval-specific intercepts for the reference group (no complications), \(\beta_{1j}\) are the deviations from the reference group for the complications group (as in (20.10), this is estimated as an interaction model using reference coded categorical features for interval and complications), and \(\log(t_{ij})\) enters as a known offset (fixed by the data, not estimated). Concretely, fitting the model on \(\mathcal{D}_{\mathcal{A}}\) uses \(\delta_{ij}\) as the response, \(\mathbf{x}_{ij}\) and \(t_j\) as features. Estimation can be performed by any method that can handle the Poisson likelihood (with offset) as objective function.

Figure 20.6 shows the estimated survival probabilities from the piecewise exponential model together with the Kaplan-Meier estimates for comparison, with the model encoded by line color and complications status by line type. The piecewise exponential model provides an excellent fit to the data, closely approximating the Kaplan-Meier estimates. This demonstrates that the piecewise constant hazards approach can effectively estimate the survival distribution when using a sufficient number of intervals.

Single plot showing survival curves: Kaplan-Meier and piecewise exponential model estimates differentiated by line color, complications status (yes/no) differentiated by line type.
Figure 20.6: Comparison of survival probability estimates using the piecewise exponential model (Poisson regression) and the Kaplan-Meier estimator, for patients with and without complications.

20.5 Choice of Interval Boundaries

The reduction approaches introduced in the previous sections all depend on the choice of the partition \(\mathcal{A}\) (20.1). This section discusses the practical considerations involved in selecting interval boundaries, including the number and placement of cut points, the conventions and considerations with respect to left-closed vs. left-open intervals, the representation of time within intervals, and implications for left-truncated data.

20.5.1 Number and placement of cut points

The choice of the number of intervals \(J\) and the placement of the interval boundaries \(a_j\) involves a trade-off between flexibility, robustness, and computational cost. More intervals allow for a finer approximation of the underlying hazard function, but at the cost of less robust estimates (fewer events per interval) and a larger transformed dataset. Fewer intervals yield more stable estimates but may miss important variation in the hazard.

The simplest strategy is to use equidistant intervals, which does not require any knowledge about the data but also does not adapt to it. A more data-adaptive strategy is to use quantile-based intervals, which place more cut points in regions with higher event density and fewer in regions with sparse events.

A natural alternative is to use the unique, observed event times as cut points, as done in survival stacking (Section 20.3). This mirrors the construction of the Kaplan-Meier estimator and ensures that interval boundaries are dense where events are frequent and sparse where they are rare. The same strategy is the default in software packages for PEMs like pammtools (Bender and Scheipl 2018). However, using all unique event times can lead to very large transformed datasets: since each of \(n\) subjects can contribute up to \(m\) rows (where \(m\) is the number of unique event times), the augmented dataset has \(\tilde{n} = \sum_{i=1}^n J_i\) rows. Bender et al. (2021) showed that using a random subsample of event times as cut points yields comparable predictive performance while keeping the increase in dataset size linear in the number of events, making this a practical strategy when \(m\) is large.

20.5.2 Left-closed vs. left-open intervals

Throughout this chapter, we have used left-closed, right-open intervals \([a_{j-1}, a_j)\), consistent with the discrete-time definitions in Section 3.1.2. This is the natural convention for discrete-time/risk-set-based approaches (Kaplan-Meier, Nelson-Aalen, survival stacking), because all observations before the event time are excluded from the risk set in the interval \([t_{(k)}, t_{(k+1)})\), leading to the equivalence of the discrete-time hazard and risk-set-based estimation (Tutz and Schmid 2016).

In the PEM literature, however, left-open, right-closed intervals \((a_{j-1}, a_j]\) are more commonly used with the event times as cut points with \(t_j = a_j\) (the right endpoint) (Bender and Scheipl 2018).

In most practical situations, the choice of convention has negligible impact on the results: when intervals are small, both conventions produce nearly identical estimates. The distinction matters primarily when interval boundaries coincide with event times, as is the case in survival stacking or when event times are used as cut points for PEM. In this situation, with left-closed intervals \([a_{j-1}, a_j)\) and the first cut point equal to the first event time, the interval \([0, t_{(1)})\) contains no events. Hazard estimates in such event-free intervals are unstable or undefined, and these intervals should be excluded from modeling (the hazard is zero and survival probability is one by definition in this interval). With left-open, right-closed intervals, the first event time \(t_{(1)}\) is included in the first interval \((0, t_{(1)}]\), avoiding this issue. Additionally, the definition of the last interval is somewhat awkward when using left-open intervals since there is no natural endpoint; for example \([t_{(m)}, t_{(m)}+1)\) and \([t_{(m)}, \infty)\) will yield the same number of zeros and ones in the last interval and thus the same hazard rate estimate.

20.5.3 Representation of Time Within Intervals

The time representation feature \(t_j\) serves as the “time coordinate” for interval \(j\) in the model. As introduced in Section 20.1, one common choice is the interval midpoint \(t_j = (a_{j-1} + a_j)/2\), but other choices include the interval start point \(t_j = a_{j-1}\) or the endpoint \(t_j = a_j\).

The choice of \(t_j\) interacts with the interval convention. With left-open, right-closed intervals \((a_{j-1}, a_j]\), the right endpoint \(a_j\) is a natural representative because it is included in the interval. With left-closed, right-open intervals \([a_{j-1}, a_j)\), the left endpoint \(a_{j-1}\) is more natural, particularly when a smooth function of \(t_j\) is estimated: the right endpoint \(a_J\) of the last interval is not contained in that interval and the midpoint is arbitrary because there is no natural endpoint for the last interval. For survival stacking (Section 20.3), the interval start point \(t_j = t_{(j)}\) is used, since the event time itself defines the interval.

A related modeling choice is whether \(t_j\) enters the model as a categorical (factor) or continuous variable:

  • When \(t_j\) is treated as a factor variable, the specific choice of \(t_j\) (midpoint, endpoint, etc.) doesn’t matter at all. Each interval receives its own parameter (e.g., an interval-specific intercept \(\beta_{0j}\)), as in the examples in Section 20.2 and Section 20.4. This is the classical PEM approach and provides maximal flexibility, but assumes that the hazard in each interval is estimated independently of neighboring intervals. Without penalization, this can lead to implausible hazard estimates, especially when some intervals contain few events. The number of baseline hazard parameters equals \(J\), which becomes prohibitive when \(J\) is large (e.g., when using all unique event times as cut points).
  • When \(t_j\) enters as a continuous variable, the specific choice of \(t_j\) is also not too important but for the last interval and depending on the interval convention. For modeling, a functional relationship between time and the (log-)hazard is assumed. A linear relationship \(\log(h_j) = \beta_0 + \beta_1 t_j\) implies that the log-hazard changes at a constant rate over time, which is overly restrictive for most applications. Instead, the function should be estimated as a flexible non-linear function \(\log(h_j) = f(t_j)\), for example using penalized splines (Bender, Scheipl, et al. 2018; Kopper et al. 2022) or tree-based methods (Bender et al. 2021). Here, the number of parameters for the baseline hazard depends only on the number of basis functions (not the number of intervals), making this approach scalable to fine partitions. Penalization also ensures that hazard estimates in neighboring intervals are similar unless the data provides strong evidence for rapid changes.

In practice, the continuous approach with some form of penalization or regularized breakpoint detection (as in gradient boosted trees) combines the flexibility of a fine partition with the stability of regularization and is recommended when the number of intervals is moderate to large.

20.5.4 Left-truncation

When data are subject to left-truncation (delayed entry; Section 3.4.1), the choice of interval boundaries has additional implications.

The simplest case is when each subject’s left-truncation time \(t_i^L\) coincides with an interval boundary. Then \(\delta_{ij}\) is defined only for intervals \(j\) with \(a_{j-1} \geq t_i^L\), and left-truncation is incorporated by simply excluding the earlier rows from the transformed dataset \(\mathcal{D}_{\mathcal{A}}\).

If \(t_i^L\) does not coincide with an interval boundary, two options exist:

  • Include only intervals for which subject \(i\) has been at risk from the start, i.e. \(a_{j-1} \geq t_i^L\). This is the most consistent with the risk-set-based approach.
  • Use the left-truncation times as additional cut points so no subject enters mid-interval. This guarantees correctness but enlarges the dataset and can introduce empty intervals; it is therefore only practical with a regularised baseline-hazard estimator.

20.6 Conclusion

WarningKey takeaways
  • Partition-based reductions transform survival data into a long-format dataset with one row per subject per interval at risk, enabling standard regression and classification methods to be applied to time-to-event data.
  • The discrete-time reduction and survival stacking estimate interval-specific discrete hazards via binary classification; the piecewise exponential model estimates continuous-time hazards via Poisson regression with an offset.
  • Any regression or classification method — including gradient boosting, neural networks, and random forests — can serve as the base learner, requiring no modification to the learning algorithm itself.
  • The choice of interval boundaries \(\mathcal{A}\) involves a trade-off between flexibility and computational cost; using unique event times as cut points is data-adaptive but can produce large transformed datasets. A subsample of event times is often sufficient.
  • Modeling the baseline hazard as a smooth function of time (e.g., via penalized splines) is more scalable than treating each interval independently and is recommended when the number of intervals is large.
ImportantLimitations
  • The transformed dataset can be much larger than the original, with up to \(O(nm)\) rows where \(m\) is the number of unique event times, which increases memory and computation requirements.
  • The discrete-time reduction discards information about the exact event time within each interval; prediction quality improves as intervals become finer, but at the cost of a larger dataset.
  • The reductions as presented apply to right-censored data and can handle left-truncation. Left-censored and interval-censored data cannot be handled directly, as the data transformation and likelihood derivations rely on the right-censoring structure. Additional adaptations are required for these settings, which will often preclude the use of off-shelf implementations.
  • Left-truncation requires careful handling during the data transformation step; naive exclusion of truncated intervals can introduce bias.
  • Treating time as a categorical variable (interval-specific intercepts) leads to a large number of baseline parameters that become impractical for fine partitions without penalization.
TipFurther reading
  • Friedman (1982) introduced the piecewise exponential model and established its consistency properties.
  • Tutz and Schmid (2016) provide a comprehensive treatment of discrete-time survival analysis, covering model specification, estimation, and interpretation in depth.
  • Argyropoulos and Unruh (2015) and Bender, Groll, et al. (2018) show how penalized additive models (PAMs) can be used as the base learner in the PEM framework, with the pammtools package (Bender and Scheipl 2018) providing a convenient implementation.
  • Bender et al. (2021) define a general PEM-based machine learning framework and study the effect of the number and placement of cut points on predictive performance.
  • Craig et al. (2025) provide a review of survival stacking and its connections to pooled logistic regression and the Cox model.