The Role of the Time Machine in Adaptive Platform Trials
Platform trials enroll multiple experimental arms simultaneously, typically randomized versus a control, with arms entering and leaving and study over the course of months to years. This evolution of the platform over time creates a rich history of past data, both on past experimental arms as well as a continuing control. Can we use this past data to make better decisions in the present? If an active arm enrolled in the years 2022-2024, and the trial has been running 2010-2024, do we make better decisions only looking at 2022-2024, or the totality of the evidence from 2010?
While not at first comparable, this relates to a time-honored question in sports. How do we compare different players from different eras, for example Hank Aaron and Babe Ruth? Taking raw numbers (home runs, batting averages, etc.) may be quite flawed, given differences in the eras they played. However, while they never played at the same time, many players overlapped with Hank Aaron going back in time. Those players overlapped with others, going back farther in time, and so on until we have players that overlapped with Babe Ruth. While we never saw player A compared to B directly, we did see A and C together, then C and D together, then D and B together, and so on. This applies to thousands of players going back in time.
In the 1990s, Scott Berry and colleagues (https://www.tandfonline.com/doi/abs/10.1080/01621459.1999.10474163) proposed a model to formalize these comparisons in sports (hockey, baseball, and golf). This model specified era effects going back in time and used the large number of overlapping players to estimate these era effects and compare players. Essentially, the model gives an estimate of how Hank Aaron would have performed in 1920 (or Babe Ruth in 1970). Essentially, the player can be moved through time via “the time machine”.
Moving forward to the 2010s, platform trials create a seemingly different but mathematically similar structure. Therapies move in and out of trials like players move in and out of major sports, and we may use their overlapping times to increase the precision of our estimates, under certain assumptions.
Why use a time machine?
The goal behind any use of past data is to increase precision. If we have a platform trial, clearly patients that were concurrently randomized to control and treatment are relevant for estimating the treatment effect. However, we typically also have control data extending back in time, prior to the activation of the current arm of interest, as well as other arms. This information may result in more precise estimates of the control arm, and better decisions (it carries a risk of bias we discuss below). In many ways the platform situation is easier than the sports example above. Treatments may have different effectiveness over time, but we usually expect those differences to be smaller than the aging process of a major sports player. More importantly, often the control is present throughout the course of the study (imagine a single player staying in baseball throughout the course of the league).
The time machine and related methods use a model to incorporate this past information, increasing the precision of the treatment effect subject to a key assumption – that treatment effects between arms are constant over time. When this assumption is satisfied, substantial gains in precision can be obtained. This allows for quicker and/or better decisions. When the assumption is not satisfied (the assumption may be checked by testing for a time by treatment interaction), biases can occur that negatively impact decision making, in the form of either inflated type 1 error or reduced power.
What does this look like in practice?
When analyzing a platform trial, we divide time into “bins”. Often these bins are defined by time, for example every 3-month period might be a bin. Statistically, bins should occur whenever randomization changes, for example when an arm enters or exits the trial. Thus, if interims occur every 3 months and arms may be stopped or added around those interims, then this is a natural choice for the selection of bins.
We then fit a model of the form:
Outcome = Intercept + Arm effects + Time effects + error
with the exact details depending on the exact outcome and potentially including other covariates as desired. Thus, each time bin has its own effect, indicating how outcomes in that time bin might be higher or lower than other time bins. These effects might be due to changes in underlying disease, for example changes in the patient population over time.
We might fit this model with fully separate effects in each time bin (called the “time categorical” model in some papers). The model usually called “the time machine” smooths the time effects via a normal dynamic linear model, a model akin to a spline in terms of smoothing, reflecting an assumption that changes over time should occur gradually. The choice of separate versus smoothed estimates is a key user choice. In many situations, there may be little difference in actual performance (the time categorical model is easier to communicate). However, if many bins are used, the time machine smoothing may result in improved performance, like using a dose response model when multiple doses are considered in a dose finding trial.
At heart, then, this model essentially is a standard linear model which blocks over time. We can compute the amount of weight given to both concurrent data (time periods when control and the arm of interest are both enrolling) and data going back in time. As would be hoped, the concurrent data receives the most weight, with past data receiving less and less weight the farther back in time you go. The speed at which these weights decrease depend on the amount of overlap between the arms.
Note the assumption here allows for changes over time but not changes in treatment effects over time (there is no time by treatment interaction in the model). It is perfectly acceptable for outcomes to get better or worse over time, as long as the differences between arms remains constant. This assumption also may be satisfied after including desired covariates in the model, for example covariates which explain differences in the population over time. The prototypical violation of this assumption is infectious disease, where treatments might be differentially effective for variants of a disease, working well for some and poorly for others. Time by treatment interactions are simply very difficult problems to address with or without modeling. If we truly expect a treatment might be effective in 2023 and 2025, but not 2024 (or simply differentially effective), what is our basis for approval and labelling of a therapy going forward? Meta-analyses make similar assumptions of equal treatment effects (in fact the time machine can be shown to be analogous to a meta-analysis over the individual time bins), and most current protocols refer to “the treatment effect” repeatedly, implicitly assuming it is singular and unchanging.
Quantifying the benefits
The value of the time machine depends on the number of overlapping arms and the sample sizes within those overlaps. As with all statistical methods, larger sample sizes result in increased precision.
Overlap refers to the number of arms that continue in different time bins. For example, suppose in four consecutive time bins we have
(Ctrl, A, B, C), (Ctrl, B, C, D), (Ctrl, B, C, D), (Ctrl, C, D, E)
From time bin to time bin, multiple arms are continuing. Many comparisons, for example between arms C and D, occur in multiple time bins. If we were analyzing arm E, there are direct comparisons to arms C and D, and one level indirect comparisons to arms B and A. This would be considered a higher amount of overlap than if the four time bins contained
(Ctrl, A, B, C), (Ctrl, C, D, E), (Ctrl, E, F, G), (Ctrl, G, H, I)
In this second example the trial is almost fully resetting with different active arms in each time bin. The net result is that the weight assigned to past data in the former example, with high overlap, is greater and results in larger efficiency gains.
Depending on the sample sizes and degree of overlap, the time machine can achieve 20-50% effective sample size increases, allowing for smaller trials with similar accuracy (again when the required assumption is satisfied).
What are the risks?
Fundamentally, inferences based on a time machine are not a full randomized comparison. While the concurrent period is randomized, the past data will contain the control arm but not the treatment arm of interest. Patients were randomized to different arms, typically met the same inclusion/exclusion criteria, and were investigated within the same protocol at the same sites, but the past data is not a direct randomized comparison. As such, there are risks of biases if the modeling assumption (constant treatment effects) is violated.
When the assumption is violated, we expect estimates of the control arm to be biased. Similarly, when making inferences, we may see either inflated type 1 error or reduced power, depending on the direction of the interaction. This is like the effect of drift in historical borrowing. The magnitude of the bias depends on the size of the interaction. Having therapies which work incredibly well in some time periods but are “nulls” in other time periods will produce large biases, while smaller interactions will produce smaller negative effects.
Thus, any use of nonconcurrent data should examine the potential for time treatment interactions and include as a sensitivity a “concurrent controls only” analysis. We would also recommend sponsors consider choosing randomization ratios that provide reasonable power for these sensitivity analyses. In other words, instead of trying to use all the increased efficiency of a platform to reduce sample size, some of that increased efficiency should be used to create robust sensitivities.
Summary
The time machine and similar models offer the potential to use the full dataset in a platform trial to increase precision of estimates and produce better decisions. This efficiency is gained through dividing time into bins and then estimating time effects going back in time. When there are overlapping arms going back in time, precision may be increased by 20-50% depending on the degree of overlap. This efficiency comes with a key assumption. While arms can change over time, the model requires that differences between arms remain constant. If this assumption is violated, biases and degraded inferences can result. This assumption should be monitored, and the time machine should be supplemented with “concurrent only” sensitivity analyses.