ICH E20 Reactions: Group Sequential Designs
By Kert Viele
Note for this series – The ICH E20 draft guidance on adaptive designs is out. Its mere existence is evidence of the growth of adaptive designs in the past decades, with patients and sponsors benefitting from efficient designs that answer modern research questions. As a guidance for sponsors submitting adaptive designs to regulators, one of its main purposes is to identify potential points of contention, preferred paths to pursue or avoid, and other potential problems. It’s not a guidance on if or when to choose an adaptive design, nor an explainer.
In this series I want to focus on several different adaptive designs and our experience in the how and why sponsors choose (or don’t choose!) an adaptive design over a non-adaptive design, explain some of the issues discussed in the ICH E20 draft in more detail, and provide my quick reactions to the draft. In many (most?) I agree with much of the draft. In other places I may disagree a bit but that disagreement is down in the weeds. And in a couple places, I think the draft could be greatly improved.
I’ve tried to maintain a common format to these entries, focusing on the motivation for the adaptive design, a simple case study to show how the design is implemented, quantification of the benefits of the design, a discussion of the risks, and a discussion of the corresponding ICH E20 draft text. I’ve tried to italicize the main points in the explainer, and place the ICH E20 comments together at the end.
Today’s blog, 1st of the series, is on Group Sequential Designs.
Why use a group sequential?
Group sequential designs (GSDs) are among the oldest adaptive designs. They are motivated by two related ideas.
First, during design we have limited information the treatment effect, often pilot or phase 2 data with considerable uncertainty. Suppose your phase 2 was promising, with a meaningful 5 point effect on some scale and a confidence interval of -1 to 11. You determine that 700 patients will obtain 90% power for 5-point effect. The effect isn’t known to be 5. Suppose 3 is marketable, and 7 is also possible. While a 5-point effect needs 700 patients, a 3-point effect would need almost 1950, and a 7-point effect would only need 350. These are massive differences, all for plausible effects given your past data. If you choose a nonadaptive design with a fixed sample size, you are at large risk of being either underpowered or overpowered, either missing a meaningful effect or expending vastly more resources and time than necessary.
Second, when we design a trial for 90% power, we are buying an expensive insurance policy against bad luck. Suppose we run a 90% powered trial and observe exactly the effect used for powering. What is the p-value? It’s not a one sided 0.025, it’s a one sided 0.0006. Why so small? When we achieve 90% power, a lot of that 90% is there to account for the times our observed effects are smaller than the truth. At least half the time, our effect is not smaller than the truth. We often don’t need that large a sample to obtain convincing evidence.
Together, these suggest using flexible sample sizes, with interim analyses allowing the trial to stop early when the research question has been answered.
What does this look like in practice?
Suppose we are running a trial with a dichotomous endpoint (responses are good). We believe the control rate is about 30% and hope to increase that to 50% with our new therapy. If we were to run an N=200 study (100 per arm), we would achieve about 83% power.
With a group sequential, we could place interim analyses at N=100, 120, 140, 160, 180, 200 with a final analysis at N=220. Some immediate questions. Why did the maximal sample size go up? Generally, group sequential designs with the same maximal sample size have slightly less power than a fixed trial with the same maximum. We can either accept the slight power loss (1-3% often) or slightly increase the maximum N, as here. Why so many interims? They indeed may not be needed. Generally, more interims is better statistically, but there are diminishing statistical returns for each additional interim, and operational costs. In practice we design trials with many interims and later see if we can remove them with minimal statistical cost. We may find that 100, 140, 180, 220 does almost as well (or 100, 160, 220, etc.). Final designs often have 2-5 interims, but there is no required minimum or maximum. It’s an old myth “you should have two interims”. Maybe that is the right answer, maybe not. Pick the right number of interims for you.
At each interim analysis, we see if the data is sufficiently compelling to stop, or if the trial should continue. The table below shows the p-value required (you could also be Bayesian and use posterior probabilities) for success at each interim. We have chosen O’Brien Fleming bounds for success, which are a common choice that maintain most of the power of a fixed trial while providing significant sample size savings. Importantly, these bounds account for the multiplicity of looking at the data multiple times. Even though we are looking at the data repeatedly, our chance of making a type 1 error, in total, is still a one sided 2.5% (these values need not add to 2.5% due to the correlation between interims).

We do not recommend success stopping alone. Many investigational therapies don’t work, and stopping for futility is, in my opinion, the single most important adaptation, allowing patients and resources to be reallocated to more promising investigations. For futility, we use the predictive probability of success at each interim (frequentist methods are also available). Using Bayesian methods, we compute the predictive probability that our trial will be successful in the future, given our current data. If that predictive probability falls below 5%, we will stop the trial for futility. The 5% is a sponsor choice. More aggressive futility is better for stopping “bad” therapies, but it can mistakenly catch some good therapies with unlucky early data, reducing power. Sponsors need to find a balance that works for them (often 1-10%).
To conduct the trial, at each interim analysis we have a third party, separate from the sponsor to maintain blinding and operational secrecy, look at the data to determine if the trial should stop for success or futility. This result is typically reviewed by a DSMB and the sponsor only receives the recommendation “continue” or “stop”. This continues until the trial stops or reaches its maximal sample size.
Quantifying the benefits
As with any trial, we can compute the trial’s power (the probability of declaring efficacy if the therapy works) and type 1 error (the probability of declaring efficacy if the therapy is a null). In an adaptive design the sample size is also random, so we can compute the probabilities the trial will stop for success or futility at any given sample size, and the expected (average) sample size of the trial.
The table below gives the probability of success and futility at each interim, for the rules described above, when the therapy is a null (control rate 30%, therapy rate 30%). For example, in the table we see there is a 0.3% chance the trial stops for success at the N=100 interim, while there is a 58.9% chance the trial stops for futility at the N=100 interim.

When the therapy is a null, the type 1 error is controlled under 2.5%, and almost 80% of trials stop at one of the first three interims (N=100, 120, 140). The average sample size of the trial is 123 (no trial stops at exactly 123, this is an average of the times the trial might stop at 100, 120, 140, etc.). While the maximal sample size is N=220 (as opposed to N=200 for our fixed trial), we only get to the maximal sample 3.6% of the time. In repeated use, for null therapies this GSD will save almost 40% of resources compared to repeatedly running a fixed nonadaptive trial.
The next table gives the probability of success and futility at each interim when the therapy is effective (control rate 30%, therapy rate 50%).

The overall power of the trial is 83.2%, slightly higher but essentially identical to the fixed trial. The trial has a meaningful probability of stopping at each interim analysis, but only a 10.6% of needing the maximal N=220 sample size. The average sample size in the alternative hypothesis is just under 150 patients. This GSD saves the sponsor 25% of the patients with repeated use on effective therapies. Note that we stop early for futility about 11% of the time, even in this scenario where the therapy is effective. Most of these stops, but not all, reflect situations where the trial would not ultimately be successful (e.g. the question is not whether the trial is successful or not, but whether the trial is unsuccessful at a smaller sample size or unsuccessful at a larger one).
In total, the group sequential design provides the same power and type 1 error as the nonadaptive trial. While the adaptive trial has a small probability of requiring the larger N=220 sample sizes, repeated use of group sequential designs will save significant resources over nonadaptive trials. With savings of 25-40%, imagine a granting agency currently funding nonadaptive trials. Switching to group sequential trials, when appropriate, would allow the granting agency to fund 1 additional trial for every 2 they currently funded (NNF = Number Needed to Fund??).
Note if the effect is larger than anticipated, or the therapy is harmful, the trial will stop even earlier. This behavior addresses both the motivations above. If our treatment effect is uncertain, we can design our trial with a large maximal sample size but can be confident that the trial will stop early if a larger effect is present, or simply if we avoid bad luck with our anticipated effect. Similarly, futility stopping allows us to stop null or harmful drugs quickly, allowing patients and sponsor resources to be allocated to more promising therapies.
What are the risks?
The general concerns about group sequential designs center on the lesser information acquired for other, non-primary, endpoints, and the accuracy of estimation of treatment effects.
Group sequential designs typically have lower sample sizes than comparable fixed trials. In the above example, it could be that data at N=100 provides sufficient evidence to conclude efficacy but is insufficient to establish safety. In this case it may be necessary to place the first interim analysis later to mitigate this risk, or perhaps also include safety in the interim analysis rules, requiring demonstrations of both efficacy and safety to stop early. Similarly, if certain subgroups need to be powered, that should be considered as well. We note that often safety and subgroups are underpowered even in phase 3 trials (rare safety events often require post marketing surveillance). At N=150 (the average sample size for the alternative scenario above), confidence intervals are only 15% larger than confidence intervals at N=200, so care needs to be taken that going to the larger sample size truly provides information that will benefit patients. For example, is knowing the increased risk of a therapy is X% plus or minus 5% meaningfully better than knowing the increased risk of a therapy is X% plus or minus 5.8%? These issues should always be considered, they should be quantified and balanced against the benefits of the design.
There are also concerns about biased point estimate arising from group sequential trials. Note that bias is complex to define for group sequential trials. We typically refer to bias as the difference between the average value of an estimator compared to its true value, with the average taken over the full distribution of the estimator. In group sequential designs, we may be interested in a narrower question such as “given the trial stops for success at the first interim, what is the bias of the estimator”. This question is clearly relevant but also represents some severe cherry picking. Early interims for a group sequential require very strong results (in the example we needed p<0.0023 at the first interim). For mediocre true effects, these strong observed results are indeed biased upward when they occur, but they also do not occur very often.
There are frequentist methods for adjusting group sequential designs for bias. We tend to explore these designs from a Bayesian perspective, in which case the correct point estimate comes from the posterior distribution (which typically shrinks extreme estimates naturally). We also recommend that care should be taken in the placement of the first interim. If the resulting design will only stop at implausible values of the parameter, we recommend delaying early stopping until that is no longer true (early futility may still be applicable).
From an operational perspective, interim analyses require third parties to be available to conduct them, and sufficient expertise to maintain secrecy of the results while the trial is ongoing. We recommend working with partners and DSMB members with experience in group sequential designs.
Finally, group sequentials may be inappropriate for long delayed endpoints. With delayed endpoints, we may have information on few patients at the interim analyses, lowering or eliminating the benefits of the group sequential. There are methods for handling delayed endpoint, both within a frequentist paradigm and from a Bayesian perspective (often called “Goldilocks trials”). Regardless of your inferential paradigm, these methods can benefit if early endpoints are observed for each patient that are predictive of final endpoints.
Summary
Group sequential designs mitigate risks related to pretrial uncertainty in the treatment effect and can minimize the required number of patients in a trial by stopping the trial when the research question is answered. This is accomplished through periodic interim analyses, with thresholds chosen to maintain type 1 error control, ideally combined with good futility rules. Compared to fixed, nonadaptive trials, group sequential designs can save 20-40% of the required sample size of a trial while achieving equivalent power and type 1 error. Care must be taken to provide adequate power for safety and secondary questions, to accurately estimate treatment effects and other parameters, and to maintain operational integrity of the trial.
Quick reactions to ICH E20 draft
Group sequential designs are covered in section 4.1 of the draft. Generally, we agree with the text.
The discussion of bias appears to be focused entirely on frequentist methods. Bayesian methods estimate parameters based on the posterior distribution, and the text would be improved to discuss this option. Bayesians should take care to avoid situations with early interims where large “prior/data conflict” can occur. If an effect of 2 or more points on a scale is unlikely, having an interim that only stops for effects of 5 or more will generate more “prior driven” rather than “data driven” conclusions. In my opinion this is to be avoided by only including interims that stop at plausible values of the treatment effect.
On a particularly technical note, the text suggests employing non-binding futility rules, which sponsors need not follow during trial implementation. This contrasts with binding futility, rules that must be followed and that may be used in the calculation of type 1 error for the trial (thus allowing slightly less stringent success bounds). The text notes that the flexibility of non-binding rules “…is important because decision-making about whether to stop for futility or continue is usually not an algorithmic process and may need to incorporate additional information beyond the primary efficacy endpoint, such as safety or other efficacy data”. It is unclear to us how a sponsor could actually utilize this flexibility without obtaining information that would run afoul of the operational integrity principles. Can the sponsor (presumably a firewalled subset of the sponsor) view unblinded safety or other efficacy data?