Comments on the draft FDA master protocol guidance
Blog by: Kert Viele
The FDA draft master protocol guidance is out
Comments are due by February 22. Here are some preliminary thoughts and an explainer on some of the issues involved. Note I’m much more of a statistical designer rather than a regulatory or operations expert, so my comments will be lighter on those sections and hoping someone with more expertise will fill in those gaps. Working through it in the order it is written…
Section 1 (Introduction)
I’m glad to see “master protocols” being more split between umbrella/platform trials versus basket trials, with only umbrella/platform trials considered here. While there are some similarities (the notion of common procedures and documents), I think the statistical and clinical issues in platforms are simply different than baskets, and these should be split apart (you can randomize patients among multiple arms, you have to take a participant in whatever basket they are in). Looking forward to a separate document on baskets later.
Section 2 (Background)
I don’t see major issues here. I am reminded of the continuing evolution of adaptive trial guidance, which are a least a “generation” ahead of platforms. Novel method guidances typically contains lots of “this might go wrong, that might go wrong”, lots of concerns and warnings and so forth. As the guidances evolve, these become “method A can fail in the following circumstances for the following reasons, here is where you should use method A, here is where you shouldn’t, and here are possible tests and sensitivities you can use to diagnose issues and mitigate potential problems.” If I have an overall comment, I think we have research on a number of potential problems and solutions that can arise in platforms and would like to see more detailed discussion of those issues.
Section 3A (randomization)
I don’t find this section controversial, but it’s a little down in the weeds. They make an important point that with multiple arms there is a statistical incentive to enroll more on the control arm than each active arm (the common control is in every pairwise comparison to control, so it makes sense to have more precision on control). I think the sqrt(k) allocation is a reasonable choice, but we may yet find that allocation depends strongly on goals (are we looking for a best arm, or identifying all effective arms, how much we value drawing conclusions quickly, etc.)
A key important point in this section is said very quickly. They note that the sqrt(k) allocation changes the allocation ratio over time, and thus we need some form of time adjustment in the analysis model. Why? “Constant allocation ratio” here means we are always enrolling X controls for every Y treatment participants (on average). If we change the allocation ratio, that means that at certain points in time we are enrolling a larger proportion of controls, and at certain times a smaller proportion of controls. When we compare control to treatment, the groups differ by time. If time has an effect on response, then we can create biases with a naïve analysis. Thus, the FDA recommends including time in the model to correct for this difference. The sqrt(k) allocation in this section changes the allocation ratio over time, as would other methods such as response adaptive randomization.
If you want to guarantee comparable control and treatment groups you need to maintain constant allocation ratios. Interestingly, similar issues arise when there are eligibility differences among patients (tangentially noted in this draft guidance). I discuss this in a recent paper
https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9750
Section 3B (Control group)
The central issue in this section is non-concurrent controls. I’ve written a bunch of stuff on this topic on X, summarizing several recent papers in this area.
https://twitter.com/KertViele/status/1572753399698919425
Non-concurrent controls present the same time issue from the last section, but to a larger degree because the allocation is 0 for active during the non-concurrent time period. Clearly you should never pool the non-concurrent controls since pooling is susceptible to any time trend in the data. Thus, all modern treatments and controversies in this area involve time modeling, and the assumptions involved in those time models.
The simplest time models include “bins” over time, with a separate additive time effect for each bin. Other authors have included parametric forms (linear trends over time, etc.) and have proposed models which smooth the “binned” time results (for example the “Bayesian Time Machine”). Whether these models are effective depend on the exact form of the time trend. Thus, they account for some time trends well and other time trends poorly.
The central assumption is whether the time trends are “additive”, meaning that treatment effects are constant over time. With additive time trends, responses can vary over time, but the differences between arms must be constant. If you have ever written an estimand or protocol that talks about “the treatment effect” you are implicitly making this assumption, since “the” implies a singular treatment effect, as opposed to a treatment effect that varies from day to day or month to month. This additivity assumption must hold on the scale used for modeling, but can be obtained after covariate adjustment as long as the required covariates are included in the model. In contrast, when this assumption is violated, we might refer to “time varying treatment effects”, “interactions between time and treatment”, or similar language.
The papers referenced in the threads on X above show that time modeling can adjust for additive time trends, but not time varying treatment effects.
A few other things to note here
1) These models do not weight the non-concurrent controls equally. All non-concurrent controls are implicitly “downweighted” relative to concurrent controls, and the most recent non-concurrent controls carry much more weight than “far past” non-concurrent controls. This is somewhat intuitive, and has implications toward the dissemination section of this guidance.
2) Restricting to concurrent controls does not always solve the “time varying treatment effects” issue. For example, the correction mentioned for sqrt(k) allocation in the previous section also requires an additive time trend. This is one reason why sponsors may choose equal allocation over the sqrt(k) allocation.
3) When reading the literature, you often hear “time trends are problematic”. The above discussion and references indicate that the details of the time trends are important (additive or not). Some time trends can be modeled, some can’t. Pet peeve on details….
4) I think there is an interesting regulatory conundrum here. There are lots of situations where non-randomized data appear in guidances, for example on real world evidence, historical borrowing of clinical trial data from other trials, and meta-analyses, with varying degrees of acceptability. It’s hard to see rejecting non-concurrent controls within a single platform, while accepting any of these others. The non-concurrent controls are enrolled in the same setting, at near the same time, and we typically have multiple overlapping arms that can be used to assess assumptions. The assumptions for meta-analyses are virtually identical to the non-concurrent controls assumptions (see the Marschner and Schou paper in the X threads), and the needed assumptions for historical data and real world evidence are stronger.
Sections 3C and 3D (informed consent and blinding)
I’m lumping these sections together because at heart they deal with the same issue, the poolability/shareability of the controls in a platform. Both sections involve ways that informed consent and blinding can impact that shareability. I think these sections will be the most controversial, given there is a conflict between “perfect statistics” and practical trial needs.
While the guidance goes through the following, I’m going to partially repeat it here to add some details. Platforms typically enroll, consent, and randomize patients through one of two paradigms.
Option 1 – Let’s call this the ideal world
Each participant is consented to all the arms in the study prior to any randomization. Randomization then occurs between all arms in the study, including control as an option, and everyone is fully blinded to everything. This option is ideal because every arm has a clean randomized comparison to control, with full blinding to avoid operational biases.
Option 2 – Perhaps not ideal, but often preferred/required for practical reasons
As shown in Figure (A) on page 20, each participant is first consented to the platform generally. They are then randomized to “cohorts” consisting of an active arm and its matched control (so if arms A,B, and C are in the platform, a cohort would be “active B” and “control B” together). A second consent is then performed, explaining the specific arms in the cohort to participant, and then randomization occurs between active and control. Thus, a participant will know their cohort, but not whether they are on active or control within that cohort. Blinding occurs at the cohort level.
Why is this important? Platforms typically want to share controls. In option (1) all the controls are equal so this is fine. In option (2) we would be sharing the controls, pooling together “control A”, “control B”, etc., in any analysis. The concern is that any systematic differences between those controls will create biases in estimation or other inferential problems such as inflated type 1 error. If for some reason “control A” and “control B” respond worse than “control C”, then a comparison of “active C” to “pooled controls” may be biased in favor of “active C”. With the two-tier consent process, we may find patients differentially enroll at the second stage. For example, a patient may enroll hoping to get in cohort B or C, and when they find out they are in cohort A they may just leave. This creates differing populations in the multiple control arms. This can be an issue, although in some platforms very few people leave at that stage of the consent process. In other platforms we may see large deviations in second tier consent rates which may indicate issues (not that the differing populations do not imply differing control responses, but would certainty invite exploration of that issue in the analyses). Second, if blinding only occurs within a cohort, not across all arms, we may see differing placebo responses (people might psychologically expect an injectable to perform better than an oral medication, or many other mechanisms).
Thus, where feasible, option 1 is preferred (as noted in the guidance). In practice, many trial sponsors either choose or are forced into option 2. With multiple arms in a study, there are concerns that consenting all patients to all arms may be confusing for trial participants (I’m not a consent expert, so I’m not sure where the lines are drawn on this issue). Blinding to multiple arms (“multiple dummy”) can prove burdensome. If one treatment is given in the morning, another at lunch, and another in the evening, then participants will have to participate in control arms for all three times. Similarly, it may be burdensome for patients to require both oral and injectable controls, or multiple pill sets, etc. This can limit compliance by making the trial too complicated for a participant to follow. Additionally, with multiple-dummy blinding the controls arms begin to look less and less like standard of care (for example a psychological study where multiple dummy blinding may require EXTENSIVE patient interaction with providers simply to administer multiple placebo like options). In some cases, one treatment in a platform may require specific lab tests for safety not needed for the other treatments, and it can be burdensome to require unnecessary tests for the other arms.
Thus, there is a complicated tradeoff between ideal inferential procedures and practical concerns. It will be interesting going forward if we limit platforms only to situations where option 1 can be applied, or obtain comfort with diagnostics and sensitivity analyses that assess whether shareability appears valid within the option 2 paradigm.
As a final note, these practical issues can be amplified when a platform involves multiple sponsors. Suppose several therapies are already enrolled in a trial and a new therapy enters with some specific requirements (for example the safety lab above). Adding the new therapy and employing option 1 requires changing the procedures for all the therapies currently enrolling in the trial to accommodate the new requirements, requiring negotiation with all existing sponsors. This is often difficult, while option 2 allows new therapies to enter the trial in a more modular fashion.
Section 3E (Adaptive Design)
I think this section just says the FDA adaptive guidance applies to platforms, which seems uncontroversial. They do make an important point that a platform may have more opportunities for data leakage and operational bias, and so care should be taken to avoid this.
Section F (Multiplicity)
This section involves a basic question. If a platform investigating 4 different drugs is replacing 4 separate trials, AND we wouldn’t place a multiplicity penalty on the 4 separate trials (generally we don’t), should we require a multiplicity penalty on the platform? The FDA answers no (your mileage may vary at other regulatory agencies). Whatever multiplicity would be applied for the separate trials applies to the platform.
There are couple interesting sidebars to this topic. First is a reasonable (my opinion) note that this general statement has exceptions for specific kinds of arms. This covers issues where multiple arms might be very similar (and in my opinion a situation where I would also be skeptical of separate trials with no adjustment). The second is a statistical concern. With a common control, the results for different arm become correlated (how much depends on overlap in the trial, etc.). While the expected number of type 1 errors is identical between the platform and separate trials, when the results are correlated the platform will have an increased chance of ZERO type 1 errors, and an increased chance of multiple type 1 errors (note the magnitude of these differences are often modest or small). There may be specific situations where this correlation is a concern, but this guidance focuses on the overall rate and thus no adjustment is needed for this issue.
Section 3G (Comparisons between drugs)
FDA notes that comparisons between active arms are possible and should be prespecified if an important part of the platform decision process. They note an odd example where you can get nontransitivity of results (A superior to B, B superior to C, C superior to A) which is theoretically possible, but I believe is rare in practice.
Section 3H (Safety)
This section notes that standard requirements for safety must be obeyed in the platform. They make an important point that the safety population may not match the efficacy population, particularly in the ability to share a control arm (for example, as noted above if only some control are tested for a particular adverse event, then only those patients will be in the safety analysis). As such, care should be taken to make sure safety analyses have adequate sample sizes for the indication under study.
Section 4 (Trial Oversight, Data Sharing, and Dissemination of Information)
FDA recommends central IRBs review the master protocol and DMCs be formed for monitoring. They note that care must be taken to avoid any “data leakage” issues, including all the issues that occur in non-platform trials as well as interesting cases where the platform may have particular issues. They discuss an example where analyses are therapies are enrolling simultaneously and a time to event analysis is employed that is based on a specific number of observed events. They note that knowledge that one arm finishes first implies the other arms have fewer events.
I completely agree with the point that data leakage should be avoided, although I think people react incorrectly, almost randomly, to data leakage. Take the example of Oncothyreon, where people made assumption about what the timing of a press release meant for the results.
https://seekingalpha.com/article/280439-oncothyreon-possible-pharma-blockbuster
In actuality, the delay in release of information had nothing to do with the event counts, and many investors were disappointed when this “sure bet” didn’t pan out.
In my experience, the two “standard” issues in dissemination for platforms are 1) the release of data that is used in a completed analysis, but it still relevant for a yet to be completed arm, and 2) when multiple independent sponsors are involved and the analysis for each therapy depends on the data in other active arms (either through time modeling, or common modeling of covariate effects, or some other mechanism). Somewhat mitigating the first concern is that the older data is, the less weight that data typically has in future analyses. In many trials by the time a full writeup, review, and publication cycle is complete the disclosed data may only have a small effect on future analyses (this must be considered on a case by case base). In the latter, it is all the more important to have a data sharing plan that allows each sponsor to fulfill their own goals as well as satisfy all regulatory and scientific requirements for transparency.
Section 5 (Regulatory Issues)
I don’t consider myself a regulatory expert, so I’m going to skip this section. I didn’t notice anything controversial here, but very possible I missed something.
Conclusions
In any case, this has been long but lots of substance to this guidance. I hope this has been helpful, and at least it gave me an opportunity to organize my thoughts. I encourage everyone impacted by this draft guidance to give it some thought and make comments to improve it. These guidance documents have a dramatic impact on review and conduct of platform trials!
