Guide to the Draft FDA Bayesian Guidance 2026
By Kert Viele, Ph.D.
The new FDA draft Bayesian guidance is a wonderful document, formalizing progress in the past two decades and provides direction for future advancement. It is both welcoming of innovation and scientifically rigorous, providing appropriate caution without dampening enthusiasm.
We appreciate the years of effort spent in both writing this document and in the many reviews and approvals using Bayesian designs over decades. We applaud the thoroughness of the document.
Here we provide an overview of the document with explanations of the key concepts and motivations for the guidance recommendations. We also include some comments for suggested revisions for the document. No doubt further thought will be helpful in refining these comments into actionable responses to the draft, which Berry intends to provide to the FDA as part of the standard comment process.
Section I, Introduction
We note that the guidance applies to CDER/CBER, with CDRH having a separate guidance on Bayesian methods. In many cases these guidances and practices are in agreement.
Section II.A and II.B, Concept and Definitions.
Theses section involve definitions and concepts in Bayesian inference that are familiar to most with Bayesian training (priors, posteriors, likelihoods, Bayes Theorem).
Comments: We would suggest a minor addition to add predictive probabilities, which are largely absent from the document, and a remaining fundamental Bayesian concept used in clinical trials. Predictive probabilities are often used midtrial to govern adaptations, or between phases of development to guide the initiation of new trials (e.g. run phase 3 if the predictive probability of phase 3 success is high enough). While there is less emphasis in the remainder of the document on predictive probabilities (this document focuses more on borrowing than adaptive trials), predictive probabilities would complete the Bayesian definitions and be useful to have in one place.
Section III, Situations where Bayesian methods have been used
This section provides many different forms of clinical investigation, in all phases of development, where Bayesian methods have been utilized and contributed to drug and biologic approvals. These include, with examples, borrowing of historical or external information in the form of an informative prior, using nonconcurrent control data in a platform trial, extrapolating adult data to pediatrics, borrowing across disease subtypes, borrowing across subgroups, and dose finding trials in oncology. Several of these are revisited elsewhere in the document.
This section provides a strong argument that the FDA is not new to Bayes, and that the guidance reflects extensive experience with real world clinical trials from conception to approval.
Many of these areas overlap in terms of goals and methodology. The first three (borrowing historical or external data, nonconcurrent controls, and adult to pediatric extrapolation) all attempt to use data from outside a randomized comparison to inform the results. The simplest to describe is the first, borrowing historical data, where we typically assume there is a parameter for each trial under consideration, and we perform a combined analysis across all trials, often with a hierarchical and/or mixture model. The central benefit/risk tradeoff in such trials is the degree of similarity between the trials being borrowed. When they match, we make better inferences, and when they do not match, we make worse inferences. Much of current research in this area is to maximize the magnitude and range of benefit (e.g. achieving benefit while requiring less agreement between the trials). Adult to pediatric borrowing is similar in principle but requires special attention to the differences between pediatrics and adults (ongoing growth, etc.). Pediatric borrowing also must consider that not all children are identical, and thus we expect greater similarities between teenagers and adults than we do between infants and adults.
Nonconcurrent controls in platform trials are used for the same purpose, to bring more information into the primary analysis, although there are greater opportunities given the structure of a platform trial. Nonconcurrent controls are enrolled prior to the treatment under consideration entering the trial. Unlike data completely external to the trial, nonconcurrent controls are enrolled under the same process, providing less risk of differences due to site or protocol, but retaining the possibility of differences across time. While they could be used identically to any other historical data, the platform trial structure allows greater methodological options (one such option is references in this section). Platform trials enroll multiple arms simultaneously, allowing for greater ability to estimate time trends. Thus we typically estimate treatment effects over time in the platform settings, as opposed to the “one parameter per trial” approach to borrowing historical data.
Borrowing across different disease types and different subgroups overlap significantly. Both are often handled with hierarchical models. With different disease types (often called a basket trial), we use a hierarchical model to borrow strength between different related diseases. The model includes the possibilities of strong or weak agreement between the disease subtypes, and borrows more or less depending on the agreement in the observed data. For subgroups, we often employ identical methodology. At times the border between “subtypes” and “subgroups” is unclear and is often irrelevant given the overlap in methodology.
Dose finding is typically handled through different methods, using a continuous function to relate dose strength to safety or other endpoints.
Comments: This section provides a nice review of previous work, but would be enhanced by a strong statement that Bayesian methods can be used in other settings, including standard simple trials. The guidance has a statement “Bayesian methods can also be considered in other settings” (and certainly have been), but we have encountered “unless you are using informative priors, you should not do Bayesian” from stakeholders often enough a stronger soundbite statement would be greatly appreciated.
Section IV.A Success Criteria
In a standard frequentist trial, efficacy is typically claimed with p<0.025. This section discusses Bayesian methods for claiming efficacy, both in terms of form and justification. The section primarily focuses on criteria of the form Pr(parameter > a) > threshold (they use the notation Pr(d>a) > c). While not applying to all trials, these posterior probability criteria are by far the most commonly used decision rules. Note if the parameter of interest is a treatment effect, then we would typically choose a=0 for testing superiority, and a=NIM (non inferiority margin) for a non-inferiority trial. This typically leaves the choice of threshold as the remaining piece.
Trials have traditionally been evaluated from a frequentist standpoint, requiring type 1 error rates of 2.5% (one sided). The standard “p<0.025” typically achieves this by construction. Section IV.A.1 discusses applying this same standard to Bayesian designs, choosing the threshold so that the trial has a 2.5% of falsely declaring efficacy when the null hypothesis is true. Such a trial is type 1 error controlled in the same sense as any other frequentist trial, but is driven by Bayesian machinery. Essentially, the trial is both Bayesian and frequentist. This framework has been used by most Bayesian trials at FDA to date. With simple trials and noninformative priors, this creates a threshold that looks similar to frequentist criteria, replacing p<0.025 with Pr(parameter > 0) > 0.975 (0.975=1-0.025). With more complex trials the threshold to achieve type 1 error control must be obtained through clinical trial simulation (discussed in great detail in the adaptive design guidance).
This framework of type 1 error is problematic with informative priors. When borrowing, for example, we noted previously that inferences are better when the historical data and current trial match, and worse when they diverge. We do not achieve a single type 1 error rate, but rather a function which varies based on that agreement. When the historical data is most applicable, type 1 error rates can be reduced from 2.5% by borrowing, while large amounts of divergence can inflate the type 1 error rate greatly, depending on the direction of the divergence.
Standard frequentist definitions of type 1 error rate require us to look at the maximal value of this type 1 error function, which is typically (1) large, and (2) only occurs for parameter values the borrowed data already suggests are unlikely. As the guidance later states, this use of type 1 error control is inconsistent with the philosophy of borrowing.
Recognizing this, borrowing is typically allowed with some degree of type 1 error inflation, based on a negotiation between the sponsor and the FDA. The degree of inflation is a function of the indication as well as the quality and relevance of the borrowed data. At heart, we are relying for the totality of the evidence to drive conclusions. Our current trial provides only a portion of the total data, will have a reduced burden to meet by itself.
In section IV.A.2 and IV.A.3, we abandon the type 1 error framework and consider direct interpretation of the posterior probability, choosing a threshold for success that corresponds to a “sufficient” probability of efficacy, or a sufficient benefit/risk profile. This framework completely avoids the issues with type 1 error above, but requires sponsors to reach agreement on (1) that the prior distribution accurately reflects what is known or unknown about the therapy, and (2) the quantification of benefits and risks under consideration. Within a Bayesian framework, these benefits and risks are often simply referred to as a loss or a utility function. These utilities can include efficacy as a benefit and adverse events as a risk, with an agreed upon balancing between them.
This is essentially a “purely Bayesian analysis”, and is easy to implement once the pieces are agreed on. The difficulty is obtaining this agreement. There are some precedents here that may be useful. In most borrowing examples there has been both a discussion of type 1 error control (or magnitude of allowed inflation), but that discussion is usually accompanied by discussion of the prior, so this is not an entirely new issue (allowing the discussion of the prior to be the focus, as opposed to type 1 error issues, is new). There are also examples of utility functions in phase 3 clinical trials. For example, the AWARD-5 trial of dulaglutide (Trulicity) employed a utility function incorporating several endpoints (efficacy, safety, and weight loss. Thus, again there is some experience.
The guidance is vague on the process for obtaining this agreement, but this process will need to ensure fairness (how do we ensure in different reviews that sponsor A and sponsor B have similar requirements for priors, given that different experts can have different beliefs) and consistency over time. Reviewers can change, and sponsors need assurance that the prior they agreed to will not change without intervening data (the reporting of other related trials may necessitate changes in belief).
Comment: Sections IV.A.2 and IV.A.3 are very closely related. They subsections are described as different approaches, but it is unclear how one can implement IV.A.2 without bringing in considerations of benefit and risk and IV.A.3. Thus section IV.A.2 does not seem like an approach on its own and might be better simply combined with section IV.A.3
Comment: Many online commentators have focused on the minutiae of the language, such a prior summarizing the “state of belief”. I sympathize with these points philosophically, but it’s unclear that such quibbles would change the central points of “what data is needed to reach drug approval” or change patient’s lives. We thus avoid such philosophical distinctions or debates here.
Section IV.A.4 Additional considerations
This section primarily focuses on Bayesian sequential designs justified using type 1 error control, noting that the thresholds at each interim require adjustment to maintain overall type 1 error control. There is also a note that, within a type 1 error controlled framework, secondary endpoints will also require type 1 error calibration. This is an implication of type 1 error control in general (these comments would apply to frequentist designs). Within the decision analysis framework, these considerations may be unnecessary if agreement can be found as noted above.
Section IV.B Operating Characteristics
Operating characteristics broadly refer to any statistical quantification of the behavior of a trial. From a frequentist perspective, these include type 1 error and power, but also expected sample size, probability of picking the right dose in a dose finding trial, and anything else of importance.
Whether or not you are Bayesian or a frequentist, these quantities can be computed conditional on any fixed set of parameters. We can ask the question “given the mean difference between the arms is 2, what is the probability the trial claims efficacy (e.g. power)?”.
From a frequentist perspective, we are usually interested in looking at these conditional operating characteristics at each individual parameter value. For example, type 1 error is the probability of declaring efficacy when the null is true. If the null is complex, this may involve a set of parameters. In a dichotomous trial testing whether a control and treatment arm have the same rate (e.g. p0=p1), we may need to look at the type 1 error rate when p0=p1=0.2, or when p0=p1=0.5, or any other value (usually the type 1 error rate is defined as the largest value across that set). The power can be computed conditional on many separate differences between arms, and expected sample sizes might be examined across a wide range of possible rates.
From a Bayesian perspective, we place a prior distribution on those rates, and this affects which parameter values a Bayesian should pay attention to. If we expect rates to be small (e.g. a trial with expected low mortality rates), then the behavior of the trial when p0=p1=0.8 is of little interest (as are any high rates). Thus, the Bayesian focuses their attention on the values of p0 and p1 that are likely to occur and integrates the operating characteristics over that prior distribution to obtain the overall expected behavior of the trial.
The guidance also notes that “false positive conclusion” can mean different things in frequentist and Bayesian contexts (the concepts are mathematically clean to either, but the wording and emphasis may change). Frequentists often refer to the false positive conclusions in the context of type 1 errors, meaning the probability of falsely claiming efficacy conditional on the null being true. Bayesians often refer to false positive conclusions closer to false discovery rate context, meaning the probability that a claim of efficacy is false. Mathematically, the frequentist is computing Pr(claim efficacy given null is true), while the Bayesian is computing Pr(null is true given claim of efficacy). These are related through Bayes theorem, but certainty not equal to each other, and in fact can be quite different.
This difference in philosophy is immediately impactful when borrowing external information. When that external information indicates certain parameter values are unlikely (for example strong prior evidence a therapy is effective), the behavior of the trial at those unlikely parameter values is less important. Thus, with strong prior evidence a therapy is effective, we focus less on type 1 error.
Comment: Even in a Bayesian context, where the evaluation of a design depends on integration over the prior, there is value to examining the operating characteristics at particular points. For example, futility rules are often most valuable when parameters are close to the null hypothesis, while early success rules are most valuable for larger effects. Given that particular rules are more important for specific ranges of parameters, looking at individual points can be more efficient when constructing a trial design, even if the overall goal is integrated behavior. Essentially, if you want to maximize an integral, maximizing the function at each point will facilitate that goal.
Section IV.B.1 Trials calibrated to type 1 error rate
This section emphasizes that many of the trial operating characteristics that apply to frequentist trials apply to Bayesian trials as well. Given any set of decision rules, Bayesian or frequentist, we can evaluate the frequentist behavior of those rules. If the trial is justified by type 1 error control, then control needs to be demonstrated across the space of null hypotheses, as discussed above. Note the phrase “plausible range of assumptions”. This is important given that many trials may be controlled across a range of null hypothesis, but not all. In the example above, where low mortality rates might be expected, there is no need to justify type 1 error control for p0=p1=0.999, where many dichotomous testing procedures might inflate type 1 error. The scenario p0=p1=0.999 is not plausible.
Comment: Clearly trials justified on type 1 error rate must compute the type 1 error rate. However, it is unclear this requires computing all other operating characteristics in a frequentist manner. There may still be great use in computing Bayesian (e.g. integrated) behavior for power, average sample size, the probability a decision is correct, etc.
Section IV.B.2 Trials not calibrated to type 1 error rate
This section provides more detail on the basic idea above where operating characteristics are integrated over the prior. Sample operating characteristics include Bayesian power (the frequentist power curve integrated over the prior distribution, also called expected power or “assurance”), the probability of reaching a correct decision, as well as the expected bias and MSE of point estimates and the expected coverage of credible intervals. These are all essentially integrated versions of their frequentist counterparts, with the exception of the probability of making a correct decision, which again flips the conditional (similar to how positive and negative predictive value can be computed for a diagnostic test as opposed to sensitivity and specificity).
The guidance also defines the terms “analysis prior” and “design prior”. The analysis prior is the chosen prior used to make trial decisions. There is a single analysis prior, which requires agreement with the FDA (the section on prior distributions is later in the document). In contrast, we may ask the question “how does the trial behave for someone with a different prior?”. If, for example, we begin with an optimistic analysis prior we may choose a smaller sample size. Data that would be convincing evidence of efficacy for the analysis prior may not be convincing for someone with a different, more skeptical prior. It is valuable for a trial to be convincing to a range of possible prior beliefs. These alternative priors are called “design priors”. Note that, like type 1 error control, these may be limited to plausible alternative priors. For example, in a time to event trial it is completely implausible a therapy grants immortality, so considering a design prior where the hazard ratio is 0 with probability 1 would be unnecessary (as would many other priors). Clearly the variability in the design priors will be context specific, but should reflect the range of plausible opinions for the indication and treatment under study. Given that decision rules are optimized with respect to the analysis prior, we expect performance to differ when different design priors are used. The question is by how much, and with respect to which operating characteristics.
Comment: This section primarily involves statistical quantities as operating characteristics. This is consistent with a statistical guidance, but there is an extensive Bayesian literature focusing on optimizing the treatment of patients, rather than optimizing our estimation of parameters (this are clearly correlated, but not identical). For example, if our endpoint is mortality, we might ask the question “what decision rules save the most lives in the future” as opposed to “what decision rules provide the best estimate of the mortality rate”. These can lead to different decisions, for example in the size of the trial or interim allocation rules. These should be included as possible operating characteristics, and in fact even preferred operating characteristics. Better treatment of patients is the end goal, parameter testing and estimation are means to that end, not ends in themselves.
Section IV.B.3 Additional Considerations
This section notes that safety and other features are also important to consider when designing a trial.
Section V.A Prior Distributions Overview
This section notes that agreement on priors is a key part of any Bayesian analysis. Section V in general notes the difference between noninformative priors, which may require less justification, and informative priors, which will require considerably more justification.
Section V.B Noninformative priors
This section notes that noninformative (or minimally informative) priors are generally overwhelmed by the data quickly and thus conclusions are almost entirely driven by the observed data, resulting in less need to justify than informative priors.
There are two caveats to this statement which users should be aware of. The first is that some noninformative priors can have odd “edge cases” where they become very informative. For example, a Beta(epsilon,epsilon) prior on a rate is in some sense noninformative. Point estimates nearly exactly match standard frequentist estimates. However, in situations where the data consists of either all responses, or all nonresponses, we can obtain overly informative posterior distributions. Suppose we observe 5 responders and 0 non responders. The posterior distribution is Beta(5+epsilon,epsilon), which has a 99.95% chance of exceeding 99.9% (essentially, it looks like a point mass near 100%). This is usually unwanted. This situation may be unlikely to occur in many situations, with larger sample sizes and very high probabilities of having at least one response and at least one nonresponse, but may be worrisome in other situations. For example, in a small trial with early interim analyses, such priors could result in premature conclusions of futility or efficacy based on artificially high posterior probabilities.
Similarly, noninformative priors on one scale may yield informative priors on another scale, and thus if interest centers on multiple scales, we need to be sure our priors are suitable noninformative on all important scales.
Both of these situations are typically understood by experienced practitioners and avoided.
Section V.C Skeptical (and enthusiastic) priors
This section describes a situation where considerable negative information exists (for example several related failed trials) and thus the prior distribution may result in great pessimism about a beneficial treatment effect. This section also discusses adaptive rules where skeptical priors might be used to obtain a desired type 1 error rate, or using enthusiastic (the opposite of skeptical, favoring the treatment) prior for futility rules.
Comment: The guidance notes such skeptical priors have not typically been used in practice. It is difficult to imagine a sponsor voluntarily proposing any prior that results in more burden than standard frequentist methods, and it is difficult to imagine the FDA forcing a skeptical Bayesian prior on a sponsor proposing a frequentist method.
Comment: This section is very difficult to place in the context of analysis and design priors discussed earlier, and a bit inconsistent with the usage of skeptical and enthusiastic priors that I am familiar with. For example, a standard discussion setting is how should an investigator run a trial knowing that others have different prior beliefs. Here, one might consider running a trial long enough so that the investigator, a skeptic, and an enthusiast all converge to similarly beliefs. In those settings the skeptic and enthusiast are priors relative to the investigator, as opposed to an investigator that is negative because of considerable negative information. From that perspective, this section would integrate better into the section on design priors by describing skeptical and enthusiastic priors as likely design priors, as opposed to this section which seems to be creating a different definition (skeptical just means the analysis prior is negative for the treatment). The second paragraph, with adaptive rules based on skeptical and enthusiastic priors, seems more consistent with the discussion above about reaching agreement. In the context of design and analysis priors, the argument here would be that using the analysis prior for all interim decisions may stop the trial priors to reaching agreement among all relevant parties. In effect, under certain skeptical or enthusiastic design priors, the trial will perform poorly by having a high likelihood of being inconclusive. This can be addressed by directly requiring the adaptive rules to use priors that do not match the analysis prior used in the primary analysis. All in all, I would recommend deleting this section and placing these ideas in the discussion of design and analysis priors.
V.D. Informative priors for borrowing historical information
Here we reach the most complex situation, where historical or external data must be synthesized and agreement with FDA will require the most discussion. Given this complexity, this section begins with the reasonable statement that the use of informative priors should deliver value, meaning that easier to justify trials would result in meaningfully sacrificing performance.
The guidance also notes that informative priors have been used mostly in pediatrics and rare diseases.
Comment: This sentence can be interpreted as a historical fact, or as a recommendation that informative priors are best suited to pediatrics and rare diseases, and it is important to be clear. There is no question the statement is true as a historical fact, largely because pediatrics and rare diseases have historically been areas where experimentation with design methods has been allowed. This does not necessarily mean those are the best areas for use of informative priors. Pediatrics and rare diseases do invite borrowing because borrowing always lowers variability, and the smaller sizes encountered in these areas provide ample room for lowering variability. On the other hand, typically common diseases have far greater amounts of high quality data (for example more randomized trials), still allowing for meaningful reductions in variability while also providing substantial ability to minimize bias and perform covariate adjustments. This sentence should be clarified. The sentence after that “Additional cases can be considered on a case by case basis”, is typically read negatively by sponsors as indicating the default position of FDA will be to say no, regardless of the FDAs actual intent.
This section continues noting that agreement on priors will require extensive discussion with FDA both about the methodology and the data being borrowed (further sections expand on the content of this discussion). This section encourages methods which address the possibility of “prior data conflict”, meaning the possibility that the observed data will be inconsistent with the prior distribution (e.g. you have a prior saying a parameter is likely between 3 and 5, and the data comes back indicating it is between 7 and 9). In such situations it is desirable to have a prior distribution that can result in posterior distributions away from the initial prior range (e.g. when the data indicates 7-9 the posterior can reflect larger probability in the 7-9 range and less in the initial 3-5 range). Typically this is achieved by using priors with “heavy tails”, often through hierarchical modeling or mixture priors.
Comment: This section makes complete sense, but it is interesting that this is one of the few areas where practice in CDRH is different than CDER/CBER, where CDRH often prefers simpler borrowing methods (e.g. “static borrowing”) that does not adjust to prior data conflict. There is justification for this in that devices tend to move more incrementally and thus we expect less prior data conflict, but this is not guaranteed.
Section V.B.2 Identification and Review of Available External Information
This section lays out criteria for data that may be successfully borrowed. This section notes (1) data quality and reliability, (2) the desire for statistical analysis of the historical information to be prespecified, (3) the need for comparability between historical information and the current trial in terms of inclusion/exclusion criteria, endpoint construction, time recency, and aspects of standard of care, (4) the design of historical studies (randomized comparisons are preferred), and (5) the availability of patient level data.
Comment: Patient level data is extremely valuable, but it is often difficult when many studies are available without patient level data. Viewing that information as less valuable is certainty reasonable, but simply excluding information that obeys the other criteria (particularly data with advantages on the other metrics, such as greater recency) may also be inadvisable.
Section V.B.3 Prior construction
This section begins by noting that all relevant information should be used in constructing a prior. In the extreme, cherry picking only favorable studies is clearly bad. While that would be downright nefarious, there are subtler forms of bias that can creep into selections. For example, suppose the historical literature reveals only a subset of patients are likely to benefit, and the current trial investigates that subset. Unless proper discounting is used to account for the subset selection, using the historical data both for subset selection and as historical data will be inappropriate. This section notes the similarities between best practices in borrowing information and best practices in other systematic reviews such as meta-analyses.
This section notes that different data sources may differ in relevance, and hence be weighted differently in constructing a prior. This may result in case by case discussions.
There is a discussion of exchangeability, including the formal requirement (exchangeability essentially says you are using a model where all the studies under consideration are treated equally, meaning that you could change the indexes in the mathematical equation without changing the resulting prior distribution). This is a fundamental property of hierarchical models. These models might be augmented by allowing for the possibility of covariate adjustment, which may correct for bias related to covariate balance differences between studies.
Comment: Exchangeability is often discussed in this context but it is also often irrelevant. There is no practical way to “prove” exchangeability holds, and the benefits of external borrowing are dependent on closeness, not exchangeability. I would rather borrow from two studies that have a small systematic bias between them (hence not exchangable) but low trial to trial variability, as opposed to completely exchangeable studies with high study to study variability.
Comment: We also note here that mixture models are deliberately nonexchangeable. In these models we may use a hierarchical model for the external studies, but the current study is allowed an “out” where, with some prior probability, it is completely different than the others. This is an effective way to address prior data conflict, but does not result in exchangeability of the current study.
This section concludes with several other case by case examples of issues that have arisen in other situations, for example a trial combining two population that have been previously studies and thus required an overall prior combining borrowing on the two populations.
Section V.B.4 Discounting
When borrowing information, we typically view historical or external patients as providing less information than patients within the current study. The historical patients may systemically differ and are not directly randomized with the current patients. Many methodologies exist that attempt to assign a weight to the historical patients. The simplest methods perform “static discounting” where a weight is selected in advance and is used regardless of the current data (e.g. each historical patient is assumed to have one fourth of the information of a current patient). This is in contrast to dynamic discounting, where the weight given the historical data depends on the current data, with higher agreement between historical and current data resulted in higher weights, and lesser agreement resulting in lower weights.
Comment: My experience is that this is more commonly called “dynamic borrowing” rather than “dynamic discounting”. I’ve typically heard “discounting” in general referring to static weighting, making “dynamic discounting” an oxymoron (which of course it isn’t as used here). Others experience may vary. The intent here is clear.
Virtually all borrowing presents the same qualitative benefit risk tradeoff, which depends on the agreement between historical data and the current trial, also called “drift”. When the drift is small, borrowing presents benefits (better inferences with lower sample sizes). When the drift is large (for example borrowing from historical data with a 50% rate in a trial that actually has a 70% rate), we obtain biases and poor inferences. These poor inferences may be inflated type 1 error rates or reduced power, depending on the direction of the drift (upward or downward).
The range of drift where we achieve benefit depends on the methodology used. Typically weaker borrowing produces modest benefits over a larger range of drift, with modest risks outside of that range. Stronger borrowing produces greater benefits over a narrow range of drift, with greater risks outside of that range. Thus, when using stronger borrowing, you need to be quite confident that the drift will indeed be small, while weaker borrowing may still be useful under greater uncertainty.
This section describes a wide variety of methodologies used to navigate this tradeoff, which we do not separately summarize here. All the research in this area aims to maximize the benefits and minimize the risks, and to achieve benefit over as broad a range of drift as possible.
All dynamic borrowing methods have user selectable parameters which govern the degree of borrowing, and the value of these parameters should be commensurate to the data being borrowed, and the possible degree of drift. Large amounts of highly relevant data may allow greater levels of borrowing, while weaker data may create unacceptable risks.
Comment: One class discussed here has interesting properties. It defines the discount weight as a function of the observed discrepancy between historical and current data (e.g. if history equals the current trial, perhaps the weight is 0.6, while if the difference is 1, the weight is 0.4, if it is 2, the weight is 0.2, etc.). This function may be selected by the user. One simple method for dynamic borrowing, typically not used, is “test than pool” where a significance test is performed on whether history equals the current trial. If the null is not rejected, we use a weight of 1, pooling the historical data. If the null is rejected, we use a weight of 0. This is one simple function.
Such methods can lead to odd conclusions. Depending on the function selected, we can create situations where a specific dataset from our current trial would not claim efficacy, whereas a weaker dataset would. This can occur if the weaker dataset has greater agreement with the current study, and the assumed functions increased weight compensates for the reduced observed treatment effect. We have not seen this behavior with other methods. It would be quite awkward to put in a paper “we were unable to claim efficacy, but if our current trial results had been a little worse, we would have”.
Section V.E Quantifying the influence of the prior distribution
Statistical models for borrowing can be complex, and thus we typically explore several metrics to quantify the influence of the prior distribution. These metrics may include the difference between point estimates when borrowing is employed versus a model with no borrowing, or effective sample sizes borrowed.
Effective sample sizes are often quite straightforward with static borrowing. If we have a Beta(alpha,beta) prior and observe dichotomous data, we obtain a Beta(alpha+#responders, beta+n-#responders) posterior distribution. The sum of the parameters always increases by n, the sample size. Given the relationship between the sum of the parameters and the sample size, the prior effective sample size (e.g. prior to the data) can be quantified as alpha+beta. This creates a straightforward mathematical relationship
Posterior Effective Sample Size = Prior Effective Sample Size + n
There are similar relationships for other common distributions, especially those with conjugate priors. This equation holds regardless of the observed data in the current trial.
This relationship is much more complex with dynamic borrowing. The degree of borrowing depends on the observed data, with greater or lesser borrowing based on greater or less agreement between history and the current trial. Thus, there is no single effective sample size for the prior. Instead we must look at the posterior effective sample size as a function of the observed data. For each specific possible dataset, we find a posterior effective sample size and define the effective number of borrowed patients as the posterior effective sample size minus n. Typically we can plot this as a function of the sufficient statistics in the current trial. For example, if we are borrowing on the control arm for a dichotomous endpoint, we might plot the effective sample size borrowed versus the observed responses in the current control arm.
Looking at this function, we can see the maximal sample size borrowed, usually occurring when the historical and current values are equal, and the degree to which the sample size borrowed decreases as the historical and current values diverge.
Note that most definitions of effective sample size are heuristic (based on equating moments or other similar quantities). As such, effective sample size curves can have negative values or be greater than the sample size of the borrowed data. Thus, while useful, they should not be viewed as an absolute measure of the influence of the prior.
This section also notes that type 1 error inflation has been used to measure the influence of the prior, but also discourages this use. As the guidance states “Evaluating the degree of borrowing based on the expected outcome when there is no effect is philosophically inconsistent given a prior which assumes a nonzero effect”.
Section V.F Sensitivity analyses
Similar to the notion of design priors and analysis priors earlier, the guidance suggests sensitivity analyses should be performed with alternative priors. One analysis we have used is a “tipping point” analysis, assessing what value of borrowing is necessary to maintain a successful result (of course, it is possible the result remains successful with no borrowing).
Section VI. Estimands and Missing Data in a Bayesian Setting
Missing data is an unfortunate reality in many datasets. When borrowing historical information, we often encounter missing data in the historical data as well as in the current trial. When we borrow historical information with missing data, it is important that missing data is handled similarly in both the historical and current datasets. Where possible, this can often be achieved by simply aligning the current trial to match the historical procedures, or by reanalyzing the historical trial using the methods in the current trial (noting that any selections here should not be cherry picked to obtain a favorable result). Unfortunately, this is not always possible, for example needed elements may be absent in the historical database (for example, followup data might not be available).
Section VII. Software and Computation
The guidance notes that FDA does not require use of any specific software for statistical analysis, however any software that is used should be reliable and adequately tested. In addition, this section notes that Markov Chain Monte Carlo (MCMC) techniques are often used in Bayesian analyses, and MCMC techniques require confirmation that they are “mixing properly”, meaning that sponsors are using algorithms which adequately explore the entirety of the posterior distribution.
Section VIII.A Documenting Plans for Bayesian Analysis
This section provides a guide for submissions to FDA describing clinical trial plans. At heart, this section revisits all the components in the prior sections and indicates that the documentation should adequately indicate that the guidance has been followed. This includes the prior specification and justification and, if borrowing, the methodology and data sources to be used for borrowing. Sponsor should also describe their decision criteria (including at any interim analyses) and provide operating characteristics for the trial. If type 1 error control is being used to justify the Bayesian design, then that should be emphasized in submissions.
There is a greater emphasis on convergence diagnostics here than present in prior guidances. Additionally, the guidance notes the potential increased time to review a complex submission and encourages early discussion with FDA.
Section VIII.B Reporting Bayesian Analyses
Once the trial is complete, sponsor should report to FDA (1) The information in the original plan, with any amendments made over the course of the trial, (2) the results of the study, including the results with the analysis prior as well as any design priors used for sensitivity analyses (3) result of model checking and convergence diagnostics if applicable, (4) the software used for the analysis, and (5) a discussion of overall conclusions. They note the more technical aspects of this reporting, for example the convergence diagnostics, can be included in an appendix.
Closing Comment: To come back to where we started, this guidance is wonderfully thought out and written. We applaud the effort of the many people who were involved. Our comments are mostly minor with suggestions for clarity, and we look forward to continuing our work in Bayesian clinical trial design under this more formal paradigm.