Berry Consultants Provides Comments on the Draft ICH E20 Harmonised Guideline
With the ICH E20 Draft Guideline currently open for comments, Berry Consultants shares its current thoughts, general comments, and specific suggestions of the document in this blog.
Contact information:
For questions or to provide comments on this blog and proposed document, please contact any or all of the personnel below, ideally before October 29, 2025.
Roger J. Lewis, MD, PhD
Senior Medical Scientist
Berry Consultants, LLC
Email: roger@berryconsultants.net
Kert Viele, PhD
Director & Senior Statistical Scientist
Berry Consultants, LLC
Email: kert@berryconsultants.net
Scott M. Berry, PhD
President & Senior Statistical Scientist
Berry Consultants, LLC
Email: scott@berryconsultants.net
Mike Krams, MD
Senior Medical Scientist
Berry Consultants, LLC
Email: mikekrams@berryconsultants.net
Introduction and Overview
The draft ICH E20 Harmonised Guideline entitled “Adaptive Designs for Clinical Trials” addresses a substantial breadth of considerations in the evaluation of a confirmatory clinical trial. We applaud the tremendous effort towards clarifying the considerations surrounding the design and implementation of adaptive clinical trials in the confirmatory setting. While the draft Guideline addresses these considerations in the context of the design, implementation, and analysis of adaptive clinical trials, many of these considerations apply similarly to traditional frequentist, non-adaptive confirmatory trials as well.
This document has value in supporting the appropriate, productive, and careful application of adaptive clinical trial design. However, the ability to maintain trial integrity with concomitant improvements in successfully identifying safe and effective treatments will be limited by:
(1) Insufficient focus on the consistent and equal application of objective criteria to the evaluation of both adaptive and non-adaptive trials. This in turn is associated with a lack of objective judgement of non-adaptive approaches against equivalent benchmarks. For instance, there is little or no acknowledgement of the limitations of and risks associated with the use of traditional, non-adaptive approaches to trial design (e.g., the risks of continuing a trial longer than necessary without interim analyses and prespecified futility rules, or in conducting a large simple trial to estimate an average treatment effect in a population in which heterogeneity of treatment effect is likely, resulting in a highly-precise estimate of a treatment effect that applies to no one);
(2) A general representation throughout the document that adaptive designs are at risk of providing less information than non-adaptive clinical trial designs. While this may be true if the adaptive design is poorly designed—any approach can be done poorly—the document should take the opportunity both to clarify the characteristics of high-quality adaptive design and to compare adaptive and non-adaptive approaches under the assumption that both are well designed and executed. The motivation of adaptive designs is to strive for the right amount of data to meet pre-defined user requirements or desired operating characteristics. This applies to stopping just at the right time (including increasing the sample size to provide additional information in light of the accruing data), response-adaptive randomization to maximize the information on the arm found to be of greatest interest, and other adaptations;
(3) There is a missed opportunity to emphasize the importance of quantification of threats to trial validity (e.g., bias in estimation) as well as a consistent emphasis on certain threats (e.g., bias, type I error, risks to trial integrity) without an equivalent consideration of the other risks (e.g., variance in estimation, failing to consider valid external evidence, reduced power or ability to identify the most effective dose or adverse safety signals). This is accompanied by another missed opportunity, namely to discuss the advantages of some estimation strategies (e.g., hierarchical models) and trial designs (e.g., basket trials) that demonstrate improved performance by important metrics. One specific omission is a discussion of the biases associated with common, traditional approaches, e.g., the upward bias seen in the largest among many raw estimates;
(4) An implied equivalence between the use of Bayesian strategies and the borrowing of external information. In contrast, there are non-Bayesian approaches to borrowing external information, Bayesian methods are most commonly used without substantial external information, and the topic of borrowing deserves a stand-alone guideline;
(5) Inconsistent emphasis on the importance of adhering to prespecified rules for adaptations in maintaining the defined operating characteristics of an adaptive trial, e.g., use of the term “anticipated” rule (e.g., on lines 127, 145-154, 318, 329, 388, 394, 414-418, and following) or suggesting a routine role for an IDMC in determining whether or how a prespecified rule is applied as part of the design (e.g., see lines 145-148). This results in a lack of clarity in whether the design is thoroughly prespecified, or represents simply a range of possible options regarding trial implementation. There are missed opportunities associated with the strategy of maintaining flexibility in the application of adaptive rules: To maintain appropriate operating characteristics such as type I error control may require the use of such conservative analysis strategies that the desired efficiency of the adaptive design is lost. This tradeoff should be discussed;
(6) There is a missed opportunity to acknowledge the ethical imperatives (i) to optimize statistical efficiency and expose as few participants as possible to experimental treatments in determining safety and efficacy; and (ii) to minimize the risk of failing to correctly identify treatments that are safe and effective, i.e., type II errors. The focus on type I error, while critically important, fails to address regulatory agencies’ broader remit to optimize population health, which is not limited to simply preventing ineffective treatments from reaching the market.
(7) There is a pervasive implication that adaptive designs result in smaller enrolled populations and are less informative—a shortcut. There are many examples where, in practice, adaptive designs and advanced modeling provide more information and better precision regarding parameters that are critical to regulatory decision making. Three simple examples: (i) if forced to do a fixed sample size in a 1:1 confirmatory trial, a smaller sample size is generally selected, while the use of interim analyses to test for superiority and futility allow trial to be larger than the typical fixed trial when needed, but only when needed. Adaptive-sample-size phase 3 trials are almost always more powerful than fixed-sample-size phase 3 trials; (ii) a seamless 2/3 trial that selects dose(s) to continue to a confirmatory stage will be larger, as the data used for doses selection is included in the confirmatory analysis, this tends to lead to larger phase 2 trials with better dose characterization; and (iii) if a basket or enrichment trial is conducted this allows the ability to better characterize heterogeneity of the treatment effect or safety profile, resulting in a better selected population. In practice, the alternative non-adaptive strategy is virtually never a wide-ranging and adequately sized phase 2 trial of the broader population, it is a selection of a single population with the associated risks to both the population and the sponsor, followed by the conduct of a non-adaptive large phase 3 trial.
Instead of communicating the current perspective that has the potential, perhaps paradoxically, to decrease the safety and informativeness of trials conducted to inform product development and regulatory decision making, the draft guideline has an opportunity to discuss the potential value of appropriately used adaptive design elements to substantially improve clinical drug development and confirmatory clinical trials.
General Comments
In Section 5.3, the Bayesian approach is largely equated to the use of informative prior distributions for borrowing of information and, perhaps, even to a static approach to borrowing of information. However, in the vast majority of applications of Bayesian methods in adaptive clinical trial design, relatively non-informative prior information is used, prior information that is rapidly overwhelmed by accumulating data. The benefits of the Bayesian approach in this setting include the coherent inferential framework and interpretability of adaptive rules based on posterior probability distributions or, unique to Bayesian inference, predictive probabilities.
Many modern implementations of Bayesian borrowing of information utilize a dynamic approach in which the degree of borrowing depends on the observed consistency in treatment effect between the borrowed and newly acquired data. This approach can largely mitigate risks associated with static borrowing approaches and should be explicitly mentioned as a strategy that is worthy of consideration. Such an approach often requires careful selection of the prior variance on the distribution of treatment effects, a design choice that should generally be supported by simulation studies.
The discussion of type I error control in the setting of the use of external information will likely be misleading to many readers. When appropriate information is borrowed, this is inferentially equivalent to the “seeding” of a traditional standalone trial with an initial set of participants. If those data are consistent with efficacy of the experimental treatment, then the probability of a positive conclusion, if the new data arise from a population in which there is no treatment benefit, should not be controlled at the usual type I error rate. An increase in the “nominal” type I error risk reflects, in essence, the information and value of the external information. Maintaining traditional type I error control would require making the criteria for demonstrating efficacy sufficiently stringent to neutralize the positive effect of the promising borrowed information, an approach that would defeat the intended purpose of the design.
As currently written, the draft Guideline could be interpreted as suggesting the need for nominal type I error control, e.g., 0.025 one-tailed, in the current trial with the inclusion of borrowing. We agree that the probability of a positive trial result if all new data arise from a population experiencing no treatment benefit should be quantified, e.g., through simulation, and understood; however, the acceptable rate of a positive trial result in this context should depend on the details of the clinical and regulatory setting.
The discussion of Bayesian approaches to adaptive design needs to be separate from the discussion of borrowing of information; while Bayesian approaches are well suited for sophisticated approaches to borrowing (e.g., dynamic borrowing), these are separate concepts and frequentist methods of borrowing should also be acknowledged.
The draft Guideline focuses on trial design in the confirmatory setting and, in that context, the inclusion of Section 5.5 on exploratory trials seems out of place and confusing. The role of adaptive design in the exploratory or learn-phase setting is well established and the balance between the needs for flexibility, efficiency, control of type I error, and integration of efficacy and safety considerations (e.g., in dose selection) are all quantitatively and qualitatively different than in a confirmatory setting. The draft Guideline risks a false equivalence, implying that exploratory trials should adhere to the requirements for confirmatory trials. While many of the general points about steps to ensure trial validity apply across these two settings, those considerations are generally not specific to adaptive designs and are covered elsewhere in ICH Guidelines and other regulatory guidance documents. We strongly recommend that the discussions of exploratory trials be removed from this Guideline, with an explicit statement that the associated considerations are discussed elsewhere.
Specific Comments
[Lines 40-43] Confirmatory randomized trials are typically complex—multicenter, multiregional, with centers that differ in clinical practice, language, and other factors—and the planning for such trials, whether adaptive or non-adaptive, requires care to maintain confidentiality and trial integrity, while simultaneously monitoring safety with the associated need for access to unblinded information by safety monitors and IDMCs. In practice, additional logistical complexity associated with the implementation of an adaptive design is relatively minor. The need for access to, and interim analysis of, unblinded information is present in both adaptive and modern non-adaptive trials to ensure ongoing scientific validity and ethical balance. To be complete, the guideline should acknowledge the need to consider all aspects of the design, including non-adaptive monitoring requirements, in ensuring appropriate confidentiality and trial integrity.
[Lines 52-53] The phrase “Therefore, special analysis methods for hypothesis testing and estimation that account for the adaptive design usually need to be used”, in combination with later text, suggests that the usual estimators result in significant bias. While bias often exists, it is also often of insufficient magnitude to represent a meaningful threat to the validity of the conclusions to be drawn from a trial. The Guideline should communicate the importance of quantifying bias, e.g., through simulation, and determining whether the magnitude of bias represents a meaningful threat to the validity of the clinical trial. We suggest consideration of text to read: “Bias and Type I error under the proposed adaptive design should be quantified, compared to similar non-adaptive approaches and special analysis methods for hypothesis testing and estimation that account for the adaptive design may need to be considered if the bias is found to be substantial.”
[Lines 79-83] A primary motivation for the use of adaptive designs is the limitations of non-adaptive designs. This text, which states
“This justification should discuss how the proposed design addresses inherent needs of the clinical setting and should provide an evaluation of advantages and limitations as compared to alternative designs (including non-adaptive designs), including a comparison of important trial operating characteristics (e.g., power, expected sample size, reliability of adaptation decisions) between candidate designs.”
should be modified to explicitly include the specific limitations of the non-adaptive design(s) that the adaptive design mitigates or addresses, e.g., quantitative assessments of limitations in power or the in the selection of an optimal dose associated with a non-adaptive approach. The motivation for the adaptive design is inadequately communicated without identifying the limitations of a non-adaptive design in the proposed setting.
[Lines 101-102] There is an opportunity to improve clarity here, in the sentence that read “The number and complexity of adaptations at the confirmatory stage should generally be limited.” As it stands, this statement is overly broad. What is overly complex and what number of adaptations is too many is a function of the particular clinical setting, the details of the proposed trial design, and the needs of the development program. In lines 108-118 you provide an example design that is, in fact, a good design in one context and perceived to be inadequate in another. Determining whether a design is too complex or requires too many interim analyses is highly dependent on the context. Instead, please revise to state: “The complexity of proposed adaptations and the number of interim analyses should be thoroughly investigated and appropriate for the context of the trial.” Otherwise, the Guideline risks being used to support a contention that a proposed trial design cannot be considered confirmatory because it has too many interims, a position that is unsupportable out of context.
[Lines 108-118] A seamless phase 2/3 study, integrating dose finding with evaluation of possibly more than one dose in the second stage, may well have better operating characteristics and probability of selecting the optimal dose than the combination of a traditional dose-ranging study followed by the evaluation of a single dose in phase 3. The implication that a seamless design is generally inferior is overly broad and often incorrect, as the relative performance of the two strategies depends on the details of each trial design. For example, a seamless 2/3 design may naturally enroll more patients in dose finding, with improved dose selection. This section should likely be removed, and this can be done without loss of continuity of the document. Further, the statement that “(a)n adaptive design should generally not serve as a replacement for a proper dose-ranging trial” implies that an adaptive trial cannot perform equivalently or better than a traditional dose-ranging trial, which is not true; again, the relative performance depends on the specifics of each trial’s design.
[Line 127] The phrase “…anticipated rule governing the adaptation decision…” suggests that a precise rule is not required in an adaptive design, which is inconsistent with other sections of the Guideline and with requirements for well-defined operating characteristics. For example, we would not say the use of 1:1 randomization is an “anticipated” feature of a simple non-adaptive trial, and a change from this approach would be considered a major deviation from the intended design. Similarly, if a prespecified rule is not followed, e.g., due to safety data that were not a prespecified component of a dose selection rule, then that represents a deviation from the original design. Here and elsewhere, the Guideline should emphasize the critical role an IDMC has in determining when a deviation from the prespecified adaptive design is appropriate without implying that the IDMC should have a role determining how and when an adaptive rule is applied as an integral part of the prespecified design. The word “anticipated” appears 25 times in the draft Guideline, mostly in text that implies that prespecified rules are to be considered flexibly and in a non-binding manner, at least in the sense that modifications of these rules are allowed “within” the adaptive trial design. An IDMC is charged with safety and trial integrity, and thus must be free to recommend changes when prespecified rules clash with those charges. However, the IDMC must include in this assessment the threat to trial validity resulting from any change that invalidates the trial’s operating characteristics,
[Lines 143-148] The statement that adequate planning and pretrial discussion “…ensures the IDMC is prepared to review interim results and make adaptation recommendations during the trial while also protecting individual trial participants’ safety.” further conveys the concept that the IDMC decides, based on an information not captured by the prespecified adaptive rule, whether or not specific adaptations are implemented. This should be modified to read “…ensures the IDMC is prepared to review interim results and confirm that the application of the prespecified adaptive rules remains scientifically and ethically appropriate, and that no deviation from the prespecified design is necessary to protect individual trial participants’ safety.” This allows the pivotal role the IDMC plays in maintaining scientific appropriateness and protection of participants to be emphasized, without undermining the importance of prespecification of the adaptive rules in maintaining the designed operating characteristics.
[Lines 150-154] The text:
“The extent to which the anticipated rule governing the adaptation decision needs to be adhered to at an interim analysis, however, can vary depending on the type of adaptation and the statistical inferential methods being used. It is generally recommended to use analysis methods that provide valid inference while allowing flexibility to deviate from the anticipated adaptation rule based on the overall benefit-risk assessment at an interim analysis.”
may be interpreted to suggest that highly conservative analysis methods that maintain type I error control over a wide range of possible interim decisions be used, without addressing the attendant loss of statistical efficiency and potential ethical implications (e.g., requiring a larger number of participants than would otherwise be required). The choice regarding an analysis method that allows this flexibility without compromising type I error control should be based on quantitative or semiquantitative assessment of the relative risks of the two approaches, rather than one approach being “generally” preferred.
[Lines 163-166] The phrase “If the planned statistical methods instead require strict adherence to the rule governing the interim decision to ensure valid inference (e.g., Type I error probability control), the importance of adhering to the rule should be documented in the trial protocol.” is critically important but could be misconstrued as suggesting this is not usually the case when, for a confirmatory trial, it is generally true.
[Lines 168 – 174] There is no mention in this section on “erroneous conclusions” about type II error. This should also be discussed, as such errors also compromise population heath, and also to be consistent with existing text on lines 196-198.
[Lines 181-185] This discussion on frequentist vs. Bayesian feels out of place in this section and removing it would improve the logical flow of the draft Guideline.
[Lines 186-198] This section erroneously implies that an adaptive design results in less data or information. For a trial with early stopping or sample size re-estimation, compared to the trial with full follow-up or a trial that goes to maximum N, the adaptive trial may result in less data. However, it is often true that if the adaptive trial were not selected, the sponsor would have chosen a smaller or shorter-duration trial. Thus, the inclusion of adaptations can often be accurately reframed as “going longer” or increasing sample size—but only when necessary—relative to the non-adaptive trial that would have been conducted. Adaptive trials with multiple doses often result in more data on the target dose. Additionally, this section focuses on safety and secondary endpoints. It is important in general to ensure adequate power for all endpoints and limit type II error. This section should be expanded to discuss type II error in general and acknowledge that a well-designed adaptive trial may result in greater information, precision, and power in evaluating outcomes associated with the arm(s) that ultimately turn out to be of greatest regulatory importance.
[Lines 206-208] The text “In the trade-off between bias and variance, the expectation is generally for limited to no bias in the primary estimate of the treatment effect.” suggests that bias should be minimized without respect to the effect on variance and without any quantification of the relative contributions of bias and variance in the validity of the treatment effect estimate or the probability of the trial reaching the correct overall conclusion regarding treatment efficacy. In fact, excess variance with the associated type II error may be a much greater risk than the effect of bias on type I error risk and, only by quantifying the relative risks, can an optimal design be determined.
[Lines 218-220] The text “Adaptive design proposals should therefore evaluate bias and variability of treatment effect estimates and provide support of their reliability.” should be strengthened to emphasize the importance of quantification of bias and variance in making design decisions.
[Lines 221-222] The statement “For some designs, specific estimation methods have been derived with improved reliability, and these should be used.” is too broad and should be qualified to state that these methods “should be used when they meaningfully improve the accuracy of estimates of treatment effect.”
[Line 254] Please consider rewording the text that currently reads “A fundamental aspect of many types of adaptive designs is the need for some level of access to unblinded interim results.” as “A fundamental aspect of many types of adaptive and non-adaptive designs, e.g., when monitored by an IDMC, is the need for some level of access to unblinded interim results.” to more accurate convey the general need for access to such efficacy and safety data.
[Lines 260-262] The phrase “the IDMC can have an additional role of reviewing interim data for the purpose of implementing the planned adaptations”, although unclear in meaning, may be read as implying the IDMC should decide whether a planned adaptation should be implemented as an integral part of the prespecified design. Please consider rewording as “the IDMC can have an additional role of reviewing interim data for the purpose of verifying the continued scientific appropriateness and safety of implementing the planned adaptations” to better characterize the IDMC as a safety check rather than deciding the trial design based on knowledge of interim results.
[Lines 275-278] Adaptive designs should be chosen to ensure that the design needs are met and operating characteristics optimized through careful quantification of the benefit and risks of the design and its alternatives. Suggesting that adaptive rules should be selected to limit the ability of back calculating the effect size of the trial if an interim decision is known may potentially require the use of a design that is otherwise suboptimal. To maintain balance, the tradeoff associated with taking the approach mentioned should be described. For example, the document could state:
“However, limiting the information communicated by knowledge of interim decisions may require compromise of other design goals, such as avoiding the enrollment of either smaller or larger populations than needed, or avoiding exposure to treatment arms or doses that appear to be less effective.”
Even with a traditional, frequentist group-sequential trial, knowledge that a trial has continued after a planned interim analysis provides substantial information about the observed interim treatment effect; however, not including an interim analysis results in risks to both participants and trial sponsors that are generally of greater concern.
[Lines 311-313] The following discussion is general and could be applied to both Bayesian and frequentist methods. Suggest rewording as “While the discussion focuses on designs using frequentist approaches for statistical analysis, many of the considerations apply to both Bayesian and frequentist methods.”
[Lines 331-334] The text that reads
“In addition, methods for calculating the primary treatment effect estimate and associated confidence interval that adjust for the interim analyses should be planned to limit bias and improve performance on measures such as the mean squared error (Section 3.4)”
should directly acknowledge and address the bias-variance tradeoff. The focus should be on maximizing the probability that the trial result is correct overall, a combination of minimizing false-positive and false-negative results. In many contexts, the risk of a false-negative result is a substantial component of the risk of “getting the wrong answer,” and the goal of minimizing this risk may appropriately motivate the choice of estimation procedures that result in some bias. The key is that the tradeoff between different measures of validity in estimation is transparent, quantified, and justified in terms of maximizing the trial’s potential to support improvements in treatment and health of the affected population.
[Lines 335-338] The text, stating
“A trial that is stopped early for efficacy will provide less information (e.g., because of a smaller sample size and/or shorter duration of follow-up) for the evaluation of safety, important secondary efficacy endpoints, and relevant patient subgroups, which are important for the overall benefit-risk assessment”
has important underlying assumptions and, if these are not communicated clearly, risks being misinterpreted in an overly broad manner. For example, a smaller adaptive trial that allocates a larger fraction of the total population to the arm or dose that ultimately is of the greatest interest may, in fact, provide more information of importance than a larger non-adaptive trial that allocates participants equally across arms. Please consider rewording and expanding as “For any particular trial design, adaptive or non-adaptive, a trial that is stopped early for efficacy will generally provide less information than one that proceeds to its maximum N.” The next existing sentence then follows naturally, “Therefore, the timing of interim analyses should be selected such that the sample size is large enough and the duration of follow-up is long enough to ensure sufficient information is available for decision-making” and no modification of it is needed.
[Lines 345-349] There are multiple important motivations for using adaptive designs with early stopping for efficacy, e.g., there is no effective therapy for a condition that is not life-threatening but causes substantial morbidity or suffering. To avoid an overly narrow interpretation of the point being made, please consider revising this sentence to read “Interim analyses with the potential for early stopping are often considered in circumstances where there are compelling ethical reasons (e.g., the primary endpoint is survival), and efficacy stopping rules typically require highly persuasive results in terms of both the magnitude of the estimated treatment effect and the strength of evidence of an effect.” Deleting “Furthermore” and “more” makes this a statement about one utilization of these approaches rather than suggesting the use is limited to this setting.
[Lines 350-359] In the setting of “overrun” with the arrival of additional outcome data after an interim decision to stop a trial for efficacy, while we agree wholeheartedly with the recommendation that all data be completely and transparently reported, it is important that the prespecified design define which dataset—the dataset that resulted in the decision to stop or the complete dataset—is to be considered the primary trial result. Either choice is defensible, however: (i) if the dataset that motivated the stopping decision is considered primary, then there should be expected to be some regression to the mean in the final data and a “loss” of nominal statistical significance should not alter the overall conclusion; and (ii) if the complete dataset is considered primary then there is a non-zero but usually small chance that the final result may not meet the original stopping threshold and the trial must be interpreted as negative. In either case, realistic simulations–implementing various design decisions—can be used to understand the magnitude and likelihood of these occurrences and the effects on error rates, and support the ultimate design choice.
Specifically, we suggest that the following be inserted in place of the sentence currently on lines 356-359:
“When such “overrun” is possible or occurs, it is critically important that all data be completely and transparently reported. Moreover, during the design of the trial, it is important that the prespecified design defines which dataset—the dataset that resulted in the decision to stop or the complete dataset—is to be considered the primary trial result. Either choice is defensible, however: (i) if the dataset that motivated the stopping decision is considered primary, then there should be expected to be some regression to the mean in the final data and a “loss” of nominal statistical significance should not alter the overall conclusion; and (ii) if the complete dataset is considered primary then there is a non-zero chance that the final result may not meet the original stopping threshold and the trial must be interpreted as negative. In either case, realistic simulations–implementing various design decisions—can be used to understand the magnitude and likelihood of these occurrences and the effects on error rates, and support the ultimate design choice.”
[Line 392-394] The current text misses the opportunity to discuss the risks of the approach of using nuisance parameter estimates from data aggregated across treatment groups (“blinded” data). While this may minimize the risk for trial integrity—an advantage that should be explicitly stated—it may also substantially increase the risk of making an erroneous interim decision as the estimates for the nuisance parameters based on pooled data are influenced by the treatment effect that is not unaccounted for. For example, the sample size may be increased unnecessarily when there is a larger treatment effect because the pooled estimate of the variance is inflated due to the treatment effect.
Please consider replacing the word “should” on line 392 with “may” and inserting the following text on line 394, between the two existing sentences:
“While this approach may facilitate protecting the integrity of the trial, it risks introducing bias in the sample size reestimation as the estimates for the nuisance parameters based on pooled data are influenced by the treatment effect that is not unaccounted for. For example, the sample size may be increased unnecessarily when there is a larger treatment effect because the pooled estimate of the variance is inflated due to the treatment effect.”
[Line 457] The enrichment discussion primarily involves experiments with two population subgroups (such as biomarker positive and negative). This section should also include a discussion of basket trials, where multiple indications or subgroups might be considered, e.g., in rare oncologic disease settings. Within this setting, the use of Bayesian models or similar frequentist strategies should be discussed, particularly as the usual discussion of bias becomes problematic. For example, it is well known that, even when raw estimates are unbiased in isolation, the highest raw estimate from a group of raw estimates is biased upward, so the use of individually unbiased estimators does not guarantee unbiased estimates after selection of the population of interest. Hierarchical models are intended to address this form of bias and produce superior estimates.
[Lines 472-473] Please see comment above re lines 331-334. Please consider rewording to read:
“…and estimates that reduce mean-square error or bias should be considered if the evaluation of the conventional treatment effect estimates demonstrates that the likely magnitude of the bias is sufficient to risk compromising the interpretation of the trial.”
[Lines 480-483] In the setting of an adaptive “enrichment” trial with prespecified population selection criteria, the requirement that the adaptive trial design must ensure “…that the trial will provide adequate information on the benefit-risk profile in the complementary subpopulation” is overly burdensome and far beyond what is required of a traditional non-adaptive trial. If a positive trial, after the selection of a smaller target subpopulation, is intended only to support use of the therapy in that subpopulation, determining the benefit-risk profile in a different population is unnecessary. A sponsor that runs a non-adaptive confirmatory trial to support regulatory approval in one population does not have to (and rarely does) identify the risk-benefit of the treatment in a complimentary population for which the treatment is not intended.
[Lines 505-507] The adverse consequences on trial efficiency associated with allowing flexibility in adherence to adaptation rules need to be enumerated and discussed; the current text is silent on these issues. Since allowing flexibility while maintaining desired operating characteristics, e.g., type I error control, generally requires more stringent thresholds for declaring efficacy than if the prespecified adaptation rules can be assumed to be followed, the flexibility can result in a requirement for a larger sample size or reduced power. A design that requires adherence to prespecified adaptation rules also means that, if a decision is made to deviate from those rules, e.g., in response to data patterns in secondary or safety endpoints, then the designed operating characteristics may not be preserved. Thus, the decision to allow—or not allow—such flexibility within the prespecified design should be motivated by an explicit consideration of the advantages and disadvantages of both approaches, rather than assuming allowing flexibility is uniformly preferable.
[Lines 529-534] The concerns regarding the use of response-adaptive randomization (RAR) detailed here assume a particularly naïve implementation of RAR that would be inconsistent with current best-practices in a setting in which a change in overall prognosis with time is plausible. It should go without saying that all clinical trial design strategies can be implemented poorly, with an attendant compromising of the validity of the trial result; it is not a valid criticism of a technique that it can be done poorly. Line 529 could be revised to state “…valid statistical inference. If poorly or naively implemented, RAR designs…”.
[Lines 591-592] Not all adaptations require IDMC review and approval, e.g., routine updates to RAR proportions within specified bounds, and this possibility should be acknowledged within the draft Guideline. Please consider adding, between lines 590 and 591, a new paragraph to read:
“Many, but not all, adaptations to be implemented based on prespecified decision rules should be reviewed by an IDMC or similar body prior to implementation, to ensure that the prespecified rule remains scientifically and ethically appropriate. In this context, the IDMC should be aware that deviations from the prespecified rules may compromise the integrity and operating characteristics of the trial, so should only occur when necessary. There may be some more routine adaptations, e.g., routine updates to RAR proportions within specified bounds, that do not require IDMC review prior to implementation.”
[Lines 613-629] An important use of simulation which is missing from this overview is the ability to use simulations to quantify bias in estimation of treatment effects and, specifically, to determine whether the bias, if any, is of a sufficient magnitude to require the use of alternative estimation methodology or the alteration of a design feature, e.g., the timing of a first interim analysis. This use is mentioned briefly on line 638 and in more detail on lines 646-647; however, it should be included in the introductory summary of the uses of simulation because of its importance.
Please consider adding the following text on line 625, between the two existing sentences:
“An important use of simulations to quantify bias in estimation of treatment effects and, specifically, to determine whether the bias, if any, is of a sufficient magnitude to require the use of alternative estimation methodology or the alteration of a design feature, e.g., the timing of a first interim analysis.”
[Lines 635-642] The preceding lines discuss the need for simulation to adequately explore the chosen design and compare its performance to relevant, potentially simpler designs. However, the text on lines 635-642 could be interpreted as expanding this discussion to suggest including in regulatory submission the performance of other designs of potentially equal complexity. While we wholeheartedly endorse such simulations in the planning phase of a trial, we would recommend excluding such non-selected designs in a submission to regulators. The selection of a design among equally complex options is typically governed by sponsor-specific criteria (cost of interim analyses versus efficiency, expected time to completion relative to the competitive landscape, etc.). If the selected design meets the criteria outlined in the document and is superior to alternative, simpler designs, it is substantially and overly burdensome to both prepare and review the full, often months long process of design selection among equally complex options. We suggest removing all text suggesting a need to submit simulation results for other designs of potentially equal (or greater) complexity that are not being proposed.
[Lines 677-678] For many adaptive designs that require simulation for determination of type I error risk, the direction of the effects of nuisance parameters and other factors on type I error is easily known. This allows the determination of the “worst case” type I error risk within the plausible range of these parameters. Thus, the “additional uncertainty” mentioned here may or may not exist and it would be more accurate to write “Thus, there may be additional uncertainty for designs…” and “When additional uncertainty exists, additional justification…”.
[Lines 758-759] Borrowing of information and the expected reduction in expected mean-square error associated with some approaches are not unique to Bayesian estimation. The improvement in the mean-square error with James Stein estimation is long established in a frequentist context or, for example, with the use of frequentist hierarchical random-effects models. The point made here, that borrowing information that is not “fit for purpose,” e.g., is not representative of the likely true treatment effect, will increase uncertainty or bias in estimated treatment effects, is true in both Bayesian and frequentist contexts. Both statistical paradigms are vulnerable to non-representative data whether those data are analyzed alone or incorporated indirectly through borrowing. Please consider rewording the end of the paragraph, beginning on line 756, to read: “Ensuring that a prior accurately reflects complete and relevant available information is critical to ensuring valid inference.”
[Lines 767-770] The statement that “(p)atient-level data are generally expected” when external information is used is overly simplistic and limiting. As noted, when such data are readily available, they may be of tremendous value; however, the advantages of using aggregated data for which, e.g., only summary statistics are available, may outweigh the disadvantages. In many cases, such data are down-weighted, and this partially mitigates the risks associated with the inability to adjust for patient-level covariates in the analysis.
Please consider revising the end of the paragraph, beginning on line 767, to read:
“Patient-level data, if available, are generally of the greatest value because they allow a thorough evaluation at the planning stage of the relevance of the external information and may facilitate strategies to address potential conflict between the prior and current trial data at the assessment stage. However, using aggregated data, e.g., for which only summary statistics are available, may also be advantageous compared to omitting relevant external information altogether. In many cases, such data are down-weighted, to mitigate the risks associated with the inability to adjust for patient-level covariates in the analysis.”
[Lines 771-772] This text could be misinterpreted as uniformly suggesting a static approach to borrowing. It is often a poor decision to pre-specify a fixed, static “amount of borrowing from the external data”. Instead, the recommendation should be that the sponsors pre-specify “the exact and quantitative approach to borrowing, including whether borrowing is static or dynamic, the structure of the inferential model, and all prior probability distributions” or something similar. It would also be useful to explicitly state that the choice of prior distributions in hierarchical models, e.g., used for dynamic borrowing, should generally be supported by simulations evaluating operating characteristics.
[Lines 808-824] In discussing the considerations when a patient may contribute information both before and after an interim analysis, the draft Guideline should explicitly mention the potential value of simulation of this data structure and timing to quantify the effects, if any, on operating characteristics including type I error control. In many cases, the quantitative effects—while real—are of an insufficient magnitude to constitute a meaningful threat to trial validity. This could be added as a new paragraph after the paragraph that ends on line 824. Suggested text for that paragraph:
“Alternatively, there may be value in simulating the proposed trial and associated data structure, including the data from participants who contribute information both before and after an interim analysis, to quantify the effects, if any, on operating characteristics including type I error control. In many cases, the quantitative effects—while real—are of an insufficient magnitude to constitute a meaningful threat to trial validity.”
[Lines 858-859] The description of possibly complex adaptive elements in an informed consent document may be confusing and even misleading to prospective trial participants, so their inclusion should be considered on a case-by-case basis by the appropriate ethics committees or equivalent. For the individual prospective participant, while it is critically important that they be informed regarding the goals of the trial and current state of knowledge, what may happen later in the trial may be largely irrelevant to their own benefit-risk evaluation. For example, for a prospective participant considering enrollment in the first stage of an adaptive trial with population enrichment, it is important that they are informed that it is unknown whether there will be benefit but it may not be useful to know that the inclusion/exclusion criteria may be changed, possibly years later after their involvement is completed. In some cases, it may be appropriate or necessary to modify the informed consent document after an adaptation, e.g., if an active arm is dropped from a multi-arm trial, but the possibility of that adaptation may not be information that is useful to prospective participants.