The Revised FDA Draft Guidance on Master Protocols
By Kert Viele, Ph.D.
The FDA released a new version of its draft guidance on master protocols “Master Protocols for Drug and Biological Product Development”. https://www.fda.gov/media/174976/download
I’m going to focus here on what I saw as the most important new content (note I looked at a document comparison which may have missed some changes). I think the primary important changes involve the inclusion of basket trials and more details on control groups. I like the substance of what was added here on basket trials but think in the long run basket trials and platform trials should just be separate topics. The new guidance changes also heavily reference the FDA draft Bayesian guidance. https://www.fda.gov/media/190505/download
Inclusion of Basket Trials
Master protocols typically refer to two, somewhat opposite, designs. In a platform or umbrella trial, we explore multiple therapies for a single indication. A basket trial does the “reverse”, investigating a single therapy in multiple indications. Some trials do both, exploring multiple therapies in multiple indications.
Both paradigms are called “master protocols” because they involve doing similar experiments in different settings. In a platform trial, once you have procedures for exploring one therapy, you try to copy these procedures as much as possible to other therapies. These common procedures are described in a single overarching document, the master protocol, with appendices for each individual therapy describing any (hopefully limited) deviations from those common procedures. Similarly, for basket trials, these common procedures apply to different indications rather than different therapies.
The additional content on basket trials looks good. Note that unlike platform trials with multiple therapies, we cannot randomize among the multiple indications in a basket trial. Patients arrive with a fixed indication. Thus, basket trials avoid much of the complexity and opportunities that arise from randomizing across multiple therapies in platform trials. Basket trial analyses also differ greatly from their platform trial counterparts. In a platform trial, inferential benefit is achieved through sharing of controls (along with adaptive stopping and investing saved resources into new therapies). In a basket trial, controls are usually not poolable (different indications simply behave differently). However, basket trials typically involve a set of indications with a common mechanism, such as a therapy targeting a common biomarker that is expressed in different indications. Thus, unlike platforms, we often feel comfortable sharing information on the treatment effect across indications. If a therapy works in 8 of 9 indications targeting a common biomarker, we are reasonably more optimistic the therapy will work in the 9th. Conversely, if the therapy does not work in 8 of 9 indications, we are more pessimistic about the 9th.
Much of the additional guidance material discusses the sharing of treatment effect information in basket trials. While one can analyze the baskets (indications) separately, this generally limits the benefits of the basket trial to operational, rather than inferential, gains. These operational gains are meaningful, but in my opinion leaves something on the table, and does not fully use the available scientific information.
In contrast, borrowing across subgroups, as discussed in the draft Bayesian guidance and referenced in this revised master protocols draft guidance, formulates a model which addresses the scientific intuition. This is often some form of Bayesian hierarchical model, which places the treatment effects for each indication within a common distribution. The model estimates the similarity of the treatment effects across indications, and borrows information in response to that similarity (or lack thereof). This model does not pool, it estimates a separate treatment effect for each indication, but generally the treatment effects are “shrunk together”, meaning the estimates within the model are closer than the estimates you would get by estimating all indications separately. If the observed treatment effects are similar, the model borrows strongly, increasing precision for each of the individual treatment effects. If the observed treatment effects are quite different, the model borrows minimally, and the resulting estimates may be quite similar to the separate estimates for each indication.
Note these models can get complex, with more “state of the art” models including clustering aspects that allow individual outlying groups (e.g. with 9 indications, the model can conclude 8 of them are quite similar but a 9th is distinctly different). The models here are quite similar to those used for historical borrowing, and anyone interested in either topic will find it worthwhile to keep up with the literature in both.
These models come with benefits and risks as compared to estimating each indication separately. When the true treatment effects are similar, the model significantly increases precision of the treatment effect estimates, increases power, and lowers type 1 error (these gains can be traded for sample size decreases or other inferential goals). When the true treatment effects are spread out, the model produces few benefits and also limited risks (generally the estimates match estimating each group separately). The greatest risk occurs when there are several indications where the therapy is quite effective, and several indications where the therapy is ineffective (or vice versa). In this situation the model can estimate all treatment effects close to the middle, decreasing power in the effective indications and increasing type 1 error in the ineffective indications. The clustering models are designed to mitigate this effect, but it should be noted that you need sufficient information to estimate the clustering. Thus, for small sample sizes, these advanced methods may produce little gain over standard hierarchical models. Noting this benefit risk profile, the draft guidance lays out conditions where borrowing may be desirable, such as having a common mechanism of action, etc.
While I agree with this new content, I feel that eventually platform and basket trials should be separated in guidance and methodology. While they share the “process” similarity of a master protocol, the randomization and inferential differences present qualitatively different opportunities and challenges, and I think eventually will just need their own separate guidances as these fields develop further. Indeed, it may make more sense to include basket trials further in the Bayesian guidance, given the similarity of statistical methodology.
Control Groups
The guidance continues the recommendation to use concurrent controls by default, but expands this in important ways. The central goal is to describe a broader class of potential biases and how to avoid them. To avoid bias, we need comparable control and treatment arms. Nonconcurrent controls are controls enrolled at a time when the treatment arm of interest is not enrolling. When nonconcurrent controls are used, we will have patients from the nonconcurrent time in the control arm, but by definition cannot have such patients in the treatment arm (it wasn’t enrolling). As such, differences due to time may create systematic differences between the control and treatment arms. Restricting to concurrent controls avoids this bias as the same time period is included in both arms.
This is not the only route to such biases. If there are treatment specific exclusion criteria, then even among concurrent controls we may have patients that can be given the control arm, but which cannot reach a specific treatment arm. Again, if there are differences between the eligible and ineligible patients, this can bias treatment comparisons. Such differences could arise simply from certain therapies not being available at certain sites, whether due to site not wishing to include a therapy, temporary supply issues, or other reasons.
At heart, all these differences can be handled through a more general rule than concurrent controls. When examining an arm of interest, we avoid biases by restricting the control group to patients that could have been randomized to that arm, simultaneously addressing all the issues above. Nonconcurrent controls are eliminated, as are ineligible patients or patients at sites that were not giving the arm of interest. This can have implications for trials which allow rerandomization, as rerandomization is typically restricted to arms the patient has not previously been given.
Note that this restriction is necessary for comparability but is not sufficient. If allocation ratios are varied throughout the trial, we may still obtain systematic differences. Instead of one group of patients exclusively entering the control arm and not the treatment, we will obtain different proportions of patients in each arm. Biases here will be smaller, but still should be avoided. As such, the guidance recommendation includes some form of modeling to account for these differences.
Borrowing Nonconcurrent Data Versus Modeling
While using nonconcurrent controls may introduce biases, in some contexts we have sufficient consistency going into the past that nonconcurrent data may produce inferential benefits. The guidance notes discussing the use of nonconcurrent controls early with the FDA and being aware of the required assumptions but leaves the door open to such use when it can be justified.
There are two usual mechanisms for including nonconcurrent controls. The first is to view the nonconcurrent controls as historical data and use historical borrowing methods (referencing the draft Bayesian guidance). Platform trials present a particularly attractive use case for historical borrowing, as many of the usual concerns about historical data are minimized. The nonconcurrent controls were randomized under the same protocol, many at very similar times, often at the same sites, and so on. Platform trials may evolve, so these issues may not disappear, but they are generally far less concerning for platforms than for historical borrowing in general.
The second methodology is to use some form of modeling of nonconcurrent controls over time, typically a model which bins patients by time group and fits a model of the form:
Y = Intercept + Arm effects + Time Bin effects
There are a few things to consider when choosing a method (these are hinted at in the new draft guidance, but not really explored, so to be clear these are my, hopefully somewhat informed, opinions):
- The time modeling approaches can only be used when there are multiple overlapping arms going back in time. If we are interested in comparing arm A to control, we need another arm B that spans the concurrent and nonconcurrent time periods. This is typically satisfied in platform trials. These models have similarities to meta-analyses viewing each time bin as a trial unit, and then combining the information. In meta-analysis terms, these trial units must be “connected”, with greater connectivity resulting in greater inferential benefits.
- Historical borrowing requires greater assumptions than time modeling. Most historical borrowing methods require stability in the absolute parameters. For example, if the endpoint is mortality, we need the mortality rate for the concurrent controls to be similar to the mortality rate on the non-concurrent controls (if not, at best the historical borrowing will discount the historical data and provide little benefit). This assumption may be satisfied in some therapeutic areas, but certainly not all. In contrast, time modeling requires an assumption of constant treatment effects over time. In the mortality example, the mortality rates of the arms can change over time (perhaps as the trial enrolls healthier or sicker patients, the indication itself changes over time, etc.), but the differences between arms must remain equal, on whatever scale you are modeling. This assumption again may or may not be satisfied, but it is commonly made in clinical trials (see your average protocol and its use of “the treatment effect” as opposed to using a nonconstant treatment function over time). I do not think it is a guarantee, but I think it’s more likely to be true than the historical borrowing assumption of no drift in absolute value.
- Note that both methods discount past data relative to concurrent controls, with both methods usually weighing concurrent data far more heavily than nonconcurrent data. These are augmentation methods, not replacements for the concurrent data (note I would personally almost never recommend reducing control allocation in a platform, so long as the comparison to control is still of interest).
I tend to prefer time modeling for its more limited assumptions, but it’s worth looking at both possibilities and seeing which is most scientifically appropriate and statistically efficient.