Kert Viele
(With thanks to Jeff Wetherington, Helen Zhou, Anna McGlothlin, and Liz Krachey)
You are starting a development program. While currently unknown to you, your novel therapy is effective, at least at the right dose. In front of you is a phase 1 to find the maximally tolerated dose, a phase 2 proof of concept, and a phase 3 confirmatory trial.
What is the chance your program will succeed? You won’t like the answer….
Our program is simple, with dichotomous outcomes for both safety and efficacy. We’re going to keep the same endpoints throughout the program and assume approval can happen from a single phase 3 trial. Finally, we will assume that the mechanism of action requires efficacy to increase or plateau with dose. Finally, we have good information the placebo efficacy rate is around 30% and we would like to increase that to 55%.
A standard program in this setting might be
- Run an n=30 phase 1 trial to find the maximally tolerated dose (MTD), defined as the highest dose with less than 33% adverse event rate. We will assume we use mTPI here, but it could be a 3+3 (worse) or a CRM (potentially better but harder to calibrate).
- After finding the MTD in phase 1, continue with an n=30 patient phase 2 proof of concept trial using the MTD. This phase 2 trial can result in three outcomes
- If the observed efficacy rate is more than 45% and observed adverse event rate is less than 33%, we will go to phase 3
- If the observed efficacy rate is below 45%, we will terminate the program (call this “no 3”)
- If the adverse event rate is over 33%, we will also terminate the program for safety reasons. In theory we might run another trial in another dose, but regardless this is a negative outcome (call this “unsafe 2”)
- If we run phase 3, we employ an n=93 per arm trial (obtains 90% power for our desired 30% vs 55% improvement). This trial has three outcomes
- If p<0.025 for efficacy and the adverse event rate is less than 33%, we win phase 3 and get approval (call this “win 3”)
- If p<0.025 for efficacy but the adverse event rate is over 33%, this is a mixed result of good efficacy with safety concerns (call this “mixed 3”)
- If p>0.025 we lose the phase 3 trial and terminate the program (call this “lose 3”)
Again, this is a somewhat simplified program, but if sufficiently complicated to show our main points. We will revisit some of the complexity at the end. Most of this complexity makes it even harder to get our therapy to market.
How well does this standard program work?
Suppose we have a good therapy, with table 1 showing the efficacy and adverse event (DLT) rates for each of 7 doses. The DLT rate increases with dose from 10% to 45%. Thus dose 7 is unsafe, while dose 6 is borderline, with doses 1-5 nicely below 33%. For efficacy, doses 4-7 achieve our desired 55%.
Any dose between 4-6 would in truth be acceptable, but the ideal is dose 4, achieving maximal efficacy at a low DLT rate. Doses 1-3 do not achieve adequate efficacy, while dose 7 is toxic.
Dose | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
DLT rate |
0.10 | 0.10 | 0.10 | 0.15 | 0.20 | 0.30 | 0.45 |
Eff Rate |
0.30 | 0.35 | 0.45 | 0.55 | 0.55 | 0.55 | 0.55 |
effective | Ineffective | Ineffective | Great | Good | OK | Toxic |
Table 1, true parameters
We estimate the MTD with our n=30 phase 1 mTPI trial. Table 2 shows the selection probabilities for dose after phase 1, based on the results of 1000 simulated trials. Note 15 of the trials concluded all doses were toxic (the first cohort incorrectly showed toxicity) and the program ended. Note that 674/1000 trials selected one of the acceptable doses 4-6, although the ideal dose 4 was only selected about 11% of the time. Our MTD is farther out on the plateau of efficacy than needed.
Our program can potentially enroll 246 patients. After n=30 of them, over 30% of programs have already locked in a bad dose, with limited chance of recovery.
Dose | None | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Total |
DLT rate | 0.10 | 0.10 | 0.10 | 0.15 | 0.20 | 0.30 | 0.45 | ||
Eff Rate |
0.30 | 0.35 | 0.45 | 0.55 | 0.55 | 0.55 | 0.55 | ||
Ineff | Ineff | Ineff | Good | Good | OK | Toxic | |||
Pr(MTD) | 0.015 | 0.022 | 0.036 | 0.106 | 0.164 | 0.214 | 0.296 | 0.146 | 1.000 |
Table 2. Selection of MTD (Pr(MTD) gives the chance of selecting each of the seven doses).
Assuming an MTD was identified, we run our n=30 single arm phase 2, and again look at safety and efficacy. Many programs stop at this stage either from safety concerns, particularly if dose 7 was chosen, or from lack of efficacy. Lack of efficacy often occurs when too low an MTD was selected but can also happen by random chance even if one of doses 4-6 were selected. Many programs are heading to phase 3, but we can count the programs that stopped for lack of efficacy (“no3”) or for safety concerns (“unsafe2”). To avoid creating another category we will lump the programs that said all doses were toxic into the “no3” category.
Dose | None | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Total |
DLT rate | 0.10 | 0.10 | 0.10 | 0.15 | 0.20 | 0.30 | 0.45 | ||
Eff Rate | 0.30 | 0.35 | 0.45 | 0.55 | 0.55 | 0.55 | 0.55 | ||
Ineff | Ineff | Ineff | Good | Good | OK | Toxic | |||
Pr(MTD) | 0.015 | 0.022 | 0.036 | 0.106 | 0.164 | 0.214 | 0.296 | 0.146 | 1.000 |
No3 | 0.000 | 0.020 | 0.031 | 0.053 | 0.015 | 0.027 | 0.048 | 0.012 | 0.206 |
Unsafe2 | 0.015 | 0.000 | 0.000 | 0.000 | 0.000 | 0.006 | 0.069 | 0.116 | 0.206 |
Table 3. Results after phase 2 proof of concept.
Coincidentally our program failed for lack of efficacy 20.6% of the time and failed for safety 20.6% of the time. This should be disturbing since we started off with the good treatment and couldn’t even get to phase 3 40% of the time. Half of the failures are attributable to poor MTD selection, with the remainder bad luck. We could potentially improve this using different rules for going to phase 3, but with n=30 there are no perfect rules. Making it easier to go to phase 3 will result in more failed phase 3 trials.
If we get to phase 3, we have three more outcomes. We can win the phase 3 trial (“win 3”, significant efficacy improvement and no safety issues), we can lose phase 3 on efficacy (“lose 3”) or we can get a significant result on efficacy but have safety issues (“mixed 3”).
Dose | None | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Total |
DLT rate | 0.10 | 0.10 | 0.10 | 0.15 | 0.20 | 0.30 | 0.45 | ||
Eff Rate | 0.30 | 0.35 | 0.45 | 0.55 | 0.55 | 0.55 | 0.55 | ||
Ineff | Ineff | Ineff | Good | Good | OK | Toxic | |||
Pr(MTD) | 0.015 | 0.022 | 0.036 | 0.106 | 0.164 | 0.214 | 0.296 | 0.146 | 1.000 |
No3 | 0.000 | 0.020 | 0.031 | 0.053 | 0.015 | 0.027 | 0.048 | 0.012 | 0.206 |
Unsafe2 | 0.015 | 0.000 | 0.000 | 0.000 | 0.000 | 0.006 | 0.069 | 0.116 | 0.206 |
Lose3 | 0.000 | 0.002 | 0.005 | 0.029 | 0.013 | 0.018 | 0.013 | 0.001 | 0.081 |
Mixed3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 | 0.043 | 0.017 | 0.061 |
Win3 | 0.000 | 0.000 | 0.000 | 0.024 | 0.136 | 0.162 | 0.123 | 0.000 | 0.445 |
Table 4…final results after phase 3
If we get to phase 3, we have a good chance to win. If we picked doses 6 or 7, there are some safety issues. A few of the dose 3 selections that made it through phase 3 fail here, but for the most part we run phase 3 on an effective dose and get the benefit of the 90% power.
At the end of day, we started with a good therapy, but only have a 44-45% chance of getting through phase 3. The majority of mistakes we make are not a consequence of the phase 3 trial, but of the n=30 phase 1 trial choosing a bad dose (it doesn’t help dose 6 is so close to the toxicity boundary).
How do we do better?
I’m sure everyone will see this and have good ideas on how to prevent these issues. That’s the point! This kind of program is pretty standard, for example in rarer oncology settings, where a single phase 3 might make sense. The example is simplistic in many ways, for example often early phase trials use different endpoints than phase 3. This might help select a dose by having a more sensitive endpoint, but also might make dose selection worse if the early phase endpoint is not sufficiently predictive of the phase 3 endpoint. Many programs lack the certainty that efficacy increases with dose or expect efficacy with much smaller doses than the MTD. This would require a dose finding trial which itself might make mistakes.
Regardless, the main point remains, small early phase trials make mistakes. These mistakes are typically locked into future trials, with permanent negative consequences for the program. Thus, it’s vital that early phase trials be designed properly, and prudent to reevaluate conclusions from those early trials throughout development.
This can be done with minimal, or even no, increase in resources. Consider the following program, which requires the same resources, makes the same go/no-go decision after n=60 patients, and has the same phase 3 design and rules.
Instead of using an n=30 phase 1 and locking in the result, we continue combine escalation and proof of concept.
- At any given time, we have three patients in escalation, allocated by mTPI rules. Whenever one of those patients completes the study they will be replaced.
- If at any given time we have 3 patients in escalation AND a dose lower than the current escalation dose has promising efficacy (we define this as 1 efficacy response), then a patient can be assigned to an expansion cohort. These patients must be on doses lower than the current escalation dose and are assigned on the basis of a dose response model and response adaptive randomization on efficacy (so doses with higher observed efficacy get more patients).
- This process continues until n=60 (the size of our phase 1 and phase 2 combined in the simple program).
The point of this is to continue escalation for more than n=30 patients. We use the safety data from the expansion cohorts in assigning escalation subjects. If we get to dose 7, there is more time to recognize dose 7 is toxic. If the escalation is slow, there is more time to recognize doses 1-3 are quite safe and escalate into the therapeutic range.
Dose | None | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Total |
DLT rate | 0.10 | 0.10 | 0.10 | 0.15 | 0.20 | 0.30 | 0.45 | ||
Eff Rate | 0.30 | 0.35 | 0.45 | 0.55 | 0.55 | 0.55 | 0.55 | ||
Ineff | Ineff | Ineff | Good | Good | OK | Toxic | |||
Pr(MTD) | 0.081 | 0.006 | 0.015 | 0.065 | 0.279 | 0.418 | 0.131 | 0.005 | 1.000 |
No3 | 0.081 | 0.002 | 0.003 | 0.015 | 0.034 | 0.061 | 0.015 | 0.001 | 0.212 |
Unsafe2 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Lose3 | 0.000 | 0.004 | 0.012 | 0.023 | 0.031 | 0.030 | 0.007 | 0.001 | 0.108 |
Mixed3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.002 | 0.029 | 0.003 | 0.034 |
Win3 | 0.000 | 0.000 | 0.000 | 0.027 | 0.214 | 0.325 | 0.080 | 0.000 | 0.646 |
Table 5. Program results after revising first n=60 of program.
With the same resources, we have increased the chance of a successful program from 44.5% to 64.6% simply by changing how we allocate our first 60 patients. These gains are largely obtained by the significantly improved dose selection. The toxic dose 7 is only chosen 0.5% of the time (compared to 14.6% before). We choose a therapeutic dose 82.8% of the time, compared to 64.4% before.
These changes produce great value for patients and sponsors, and dramatically increase the chance of a good therapy making it to approval. Another question….can we do even better by increasing the resources in phases 1 and 2, and paying for them by more phase 3 successes and not running failed phase 3 trials?