**Blog by: Kert Viele**

Bayesians draw conclusions based on posterior probabilities and utilities. Once an experiment is complete, Bayesian draw the same conclusions with the same data, regardless of whether that data came from a trial with 0 interim analyses or 100, whether we stopped when a specific sample size was reached, or a specific number of responders were obtained, or we just stopped because it was dinnertime.

This begs the question: How should a Bayesian design a trial? Does it matter whether the trial has 1 interim or 100? If a decision rule is appropriate at the end of the trial, can I apply that same rule to every interim analysis?

In this post I want to talk about three things

1) There is a key difference in perspective between design and analysis. Bayesians condition on what they know and integrate over the uncertainty in things they don’t know. After an experiment is complete, the data are known. Prior to the experiment, they aren’t. This mathematically creates different formulas for valuing designs, as opposed to results. For the trial designer, these have overlap with standard frequentist concepts.

2) Different experimental designs can result in very different Bayesian expected utilities. Whether we are valuing patient outcomes or revenue, designs are vitally important. While some interim analyses may increase expected utility, even some intuitively reasonable analyses can be detrimental, often severely so, particularly when many are employed.

3) While computationally difficult, there are methods for constructing optimal designs. We illustrate this for a simple example. In more complex situations, we often have to rely on heuristic searches and simulations, but a good exploration of possible designs can dramatically improve trial performance over some standard choices.

**A Bayesian analyzing a completed trial**

After an experiment, the likelihood of the data and the prior distribution combine to form a posterior distribution, which drives all Bayesian inference. Identical data, even from different experiments, typically produce proportional likelihoods. Since a posterior distribution is a ratio, the experimental design contribution effectively “cancels out”. We draw the same conclusions regardless of how these data came to be. As noted by others, there is no consideration of possible data that might have occurred but didn’t.

Trial design is different. The data haven’t been observed yet. They are still random, and this changes our perspective. Here is an analogy. Jordan and Marcus wake up one morning, each with a magic coin. Each will flip their coin and, depending on the result, win a prize. Jordan’s coin gives her equal odds of winning $100 or $10,000,000. Marcus’s coin gives him equal odds of winning $100 or $0. They each flip their coin, and each win $100.

After the flip has occurred, we might equally swap places with either of them. While perhaps they might feel differently about what might have been, their available actions after the flips depend only on their $100 prize.

Prior to the experiment, we view the coins very differently. We would clearly prefer to be Jordan prior to the flips, with the possibility of winning $10,000,000.

Here is a simplified example. Assume we know patient outcomes on a therapy are normally distributed with unknown mean μ and known standard deviation σ=10. We also know, for simplicity, that either μ=0 (ineffective therapy) or μ=1 (effective therapy), and have a prior probability p=Pr(μ=1)=20%. At the conclusion of our trial, we must either give/approve the therapy or don’t give/approve the therapy. Within the Bayesian paradigm, we assign utilities to each combination of decision and state of the world. These might be

These are sufficiently general for this setting, with 0 and 1 anchoring the scale while -a and -b represent penalties for incorrect decisions. These penalties might involve patient outcomes, for example adverse events, or they might involve the societal harm of an incorrect decision. Mistakenly approving an ineffective therapy could result in future research pursuing an unpromising mechanism of action, for example. These could also reflect a public policy utility in maintaining faith in medical advice by limiting the number of errors in that advice.

The values of -a and -b may vary by scenario and stakeholder, but for illustration we will assume a=9 and b=0. These reflect a high aversion to approving ineffective therapies, but limited aversion to missing effective ones. Not giving an effective therapy to a patient, while missing an opportunity, should result in similar patient outcomes to not giving an ineffective therapy. In either case they didn’t receive anything. The choices of -a and -b will alter what follows numerically, but not the overall qualitative conclusions. In a more general setting, for example with a continuum of values for μ, our utility would consist of two functions over μ, one for giving the therapy and one for not giving the therapy.

Once our experiment is complete, we have data and our only decision is whether to give the therapy or not. We compute our posterior probability π and pick the decision with the highest expected utility. These expected utilities are

And we will give the therapy whenever E[U(give)] > E[U(don’t)], which requires

Clearly no frequentist ideas have emerged so far. There is no requirement of a 2.5% type 1 error rate, for example. This rule “give if the posterior probability exceeds 90%” applies at the completion of any experimental design (including the completion of trial with different sample sizes).

**How do we value an experimental design, prior to data collection?**

Prior to an experiment, the data are unknown, and our expected utility must integrate over the data uncertainty. Every possible combination of effective/ineffective and give/don’t give are possible. Additionally, we must consider the cost of our experiment (if data were free, our optimal design would always be to collect as much data as possible). The expected utility is

While our decision with known data contained limited connection to frequentist inference, here more frequentist ideas appear, with α and β being the usual type 1 error and power computed in most frequentist trial designs. While the Bayesian utility in no way requires α=2.5%, both the Bayesian and the frequentist desire low α and high β. Instead of fixing α and maximizing β, the Bayesian will maximize a linear combination of both quantities, weighted by the priors and utilities (and adjusted for the cost of the trial). If we had a continuum of values for the parameter μ, as opposed to simply effective or ineffective, the utility above would involve integrals over the standard frequentist power function, again weighted by the priors and utilities.

I think acknowledging these similarities is important toward adoption of Bayesian methods. Don Berry once told us “when you are far from someone you yell. When you walk beside them you talk. To get anything done you need to talk”. Asking frequentists to throw out ideas they have extensive experience in and intuition using is difficult. Having similar quantities underlying both methodologies, while weighted differently, is a smaller difference to cross, with more common ground. We should acknowledge common ground whenever possible.

This also allows for conversations in frequentist terms such as “when/why might a Bayesian be comfortable with high type 1 error” for a future experiment. If a good Bayesian design (read “high utility”) has a large type 1 error rate, then the Bayesian must either have a low prior probability the null is true, or place minimal negative utility on committing a type 1 error, or simply place so high a cost on obtaining data that high error rates are unavoidable. Again, these relationships facilitate communication with statisticians or clinicians more versed in frequentist, rather than Bayesian, methods.

**Some sample designs and their expected utilities**

To explore, let’s consider a few different designs with differing decisions and number of interim analyses. In our example (where a=9 and b=0) we know that upon completion of the trial we will give the therapy if the posterior probability exceeds 90%. Since this would apply regardless of the sample size, we might consider using this rule at interim analyses.

Success only designs – Perform interim analyses at equally spaced intervals, stop and give the therapy whenever the posterior probability exceeds 90%. Otherwise continue until N=1000, at which point give the therapy if and only if the posterior probability exceeds 90%.

Success and Futility designs – Success handled identically to the success only designs (give the therapy whenever the posterior probability exceeds 90%) but also at each interim stop the trial for futility (e.g. stop and don’t give the therapy) if the posterior probability is below 10%.

We will consider designs with interim analyses every 1000 patients (e.g. a fixed trial with only one possible analysis at N=1000), every 500 patients, every 200, 100, 50, 20, 10, and 1 patient. The expected utilities for these experiments is in the table below. For these calculations we assume cost(D) = 0.00005 n, where n is the sample size. In other words, each patient costs 0.00005 of the utility of giving an effective therapy. While utilities may often be in terms of patient outcomes, in terms of revenue this might indicate giving an effective therapy results in $1 billion, while each patient costs $50,000 (1000 patients would be $50,000,000).

Interims conducted every N patients where N is | Expected utilities for “Success only”
Stop and give if posterior>90% No futility stopping |
Expected utilities for “Success and Futility”
Stop and give if posterior>90% Stop and don’t give if posterior<10% |

1000 (only a final analysis) | 0.0603 | 0.0603 |

500 (one interim) | 0.0472 | 0.0622 |

200 | 0.0339 | 0.0549 |

100 | 0.0190 | 0.0412 |

50 | 0.0069 | 0.0301 |

20 | -0.0086 | 0.0189 |

10 | -0.0161 | 0.0132 |

1 (1000 interim analyses) | -0.0324 | 0.0015 |

Our first observation is that the design has a large effect on expected utility, with some designs having negative utility. If giving an effective therapy (utility=1) were equivalent to saving 10,000 lives, then a utility of 0.0622 (corresponding to success and futility analyses every 500 patients) would say we expect to save 622 lives, while the utility of 0.0015 (success and futility analyses after every patient) would only expect to save 15 lives. For a more revenue specific comparison, with U=1 equivalent to $1 billion, our utility of U=0.0622 would be $62,200,000 as opposed to U=0.0015 corresponding to $1,500,000.

Note that futility is extremely valuable here, which is intuitive given our aversion to incorrectly giving an ineffective therapy combined with our low prior probability the therapy is effective.

Clearly the addition of many interims, with these rules, is detrimental. The reason is that our success rule (stop and give if the posterior probability is greater than 90%) is only optimal when we are done collecting data, and thus the only possible decisions are give or don’t give the therapy. When a third option, continue the trial, is available, our decision space changes and so do our optimal rules.

**How do we find an optimal design?**

In this simplified setting, we can (approximately) derive the optimal design. In a more general setting, the calculations below can become infeasible, in which case more heuristic searches over designs, with simulations to compute design metrics, are necessary. In those situations a good, but perhaps not optimal, design can typically be found.

In our simplified setting, at each interim analysis we have three potential decisions. We can stop and give the therapy, stop and not give the therapy, or continue. As with all such decisions, we should pick the decision with the highest expected utility. These are

Where 𝜋_{n} is the current posterior probability, α_{c} is the conditional power, given the current trial state, when the therapy is ineffective (recall power for an ineffective therapy is the type 1 error rate), β_{c} is the conditional power when the therapy is effective, and cost_{c} is the expected cost of the remainder of the trial, if the trial continues. As before, we encounter ideas similar to frequentist designs in terms of conditional power.

Typically, Bayes optimal designs are computed starting at the end and working backwards through the interims. We find the optimal decision for the final analysis, then given that information find the optimal decision for the last interim, then given that optimum find the optimal decision for the 2^{nd} to last interim, and so on. Given this is computationally intensive we only consider the case of an interim every 100 patients.

The posterior probabilities required for optimal decisions at each interim are in the table below

Interim | Stop and give if posterior probability is greater than | Stop and don’t give if posterior probability is less than |

N=100 | 0.9972 | 0.0417 |

N=200 | 0.9970 | 0.0477 |

N=300 | 0.9967 | 0.0562 |

N=400 | 0.9963 | 0.0684 |

N=500 | 0.9957 | 0.0868 |

N=600 | 0.9948 | 0.1160 |

N=700 | 0.9933 | 0.1680 |

N=800 | 0.9902 | 0.2661 |

N=900 | 0.9821 | 0.4741 |

N=1000 | 0.9000 | 0.9000 |

This experiment has an expected utility of 0.0833, compared the maximal 0.0622 in the previous set of designs, representing a 34% increase in expected utility for our computational work. This increase might be measured in hundreds or thousands of lives saved or in tens of millions more in expected return. Performing more interims, as long as thresholds are optimally chosen, will increase the expected utility yet further (as seen in the prior table, suboptimal interim analyses can reduce expected utility), but the marginal increase of additional interim analyses may be small.

Here again we see some similarities to frequentist ideas. While the thresholds above do not correspond exactly to a standard group sequential alpha spending function, they do match the idea in one way. A spending function like O’Brien Fleming saves sample size while maintaining most of the inferential performance of a fixed (read “always go the maximal N”) design, such as equivalent overall type 1 error rate and only modestly decreased power. The inferential pieces of the expected utility for a fixed N=1000 design (which maximizes the utility of those inferential pieces) have α=0.00332 and β=0.67290. For the optimal “look every 100” design, we find α=0.00313 and β=0.64571, while the expected sample sizes are 377 when μ=0 and 817 when μ=1.

**Summary**

After an experiment is completed, the experimental design is in the past and all that matters is the observed data (with its likelihood), your priors, and your utilities. Prior to an experiment, we need to integrate over the uncertainty in the future data, and a carefully crafted experiment can have substantially higher expected utility than other options, even if those other options seem intuitively reasonable. In terms of interim analyses, optimally designed interims will only increase expected utility, but suboptimally chosen interim analyses can lower utility. As can be seen in the optimal “interims every 100” design, these optimally chosen thresholds are fairly conservative at first, requiring posterior probabilities well in excess of 99% to stop very early for success.