Hierarchical modeling is a powerful tool for making inferences about multiple groups in clinical trials, whether these groups are multiple indications or tumor types in an oncology study, multiple sites in a device clinical trial, or any other source of patient heterogeneity.
Hierarchical models that share information across groups form a backbone of modern Bayesian thinking. We currently employ hierarchical modeling most commonly in our oncology basket trials, see for example Berry et. al (2013)
We also employ hierarchical modeling in recent work on the ADAPT antibiotic platform trial, where we share information across multiple body sites
Hierarchical modeling promises to decrease sample sizes and increase the power of clinical trials but also requires care in implementation. Our goal here is to explain some of the intuition behind hierarchical modeling and how that relates to the advantages and disadvantages of inferences for multiple groups.
Suppose I’m running an oncology basket trial where success rates on standard of care run about 10%. These are trials that investigate a targeted agent against tumors in multiple body sites, under the assumption that if the targeted agent is effective against a particular mutation, then the agent should generally (but not always) be effective against tumors throughout the body regardless of location. These trials are also referred to as enrichment or umbrella trials in the literature, see for example Woodcock and LaVange (2016)
If I observe four tumor types, 20 patients each, and the data indicate 3/20, 4/20, 5/20, and 6/20 successes in the four groups, then what is my best guess of the success rate for the group with 6/20 successes? Is it greater than 10%?
The standard estimate for a binomial rate is to take the observed proportion 6/20=30%. If I had only collected data in that group, this estimate would be unbiased (on average it’s equal to the true rate for the group) and have the greatest precision. Statisticians have a fancy name for such an estimate, the minimum variance unbiased estimate (MVUE). And for one group, the MVUE is essentially the best estimate you can make, absent prior information. After all, being unbiased and minimum variance are really good things.
But I didn’t collect data from one tumor type, I collected data from 4 types and I also know that 6/20 is the highest result among the 4 types. Should that change my estimate? And if so, how?
It is well established that testing multiple hypotheses is prone to multiplicity errors. Multiple methods (Bonferroni corrections, false discovery rates, gatekeeping strategies) exist which emphasize that p=0.024 means something very different in a group of p-values as opposed to the single primary analysis from a study. See for example Yadav and Lewis (2017).
Within a basket trial, we do not need to control family wise type I error. A sponsor could run separate trials in each indication and would not need to adjust alpha. Nevertheless, a sponsor would be wise to interpret the result taking into account multiplicities, both negative, in that one spurious result may be a random high, and positive, in that repeated good results are highly suggestive of a real positive trend. For example, all four groups above have observed rates in excess of 10%.
Clearly, taking account of the extra information in multiple groups is fundamental to hypothesis testing. Similar phenomena apply to estimation and suggest our usual definitions of bias are limited when considering multiple groups. If I want to estimate the underlying true rates of the 4 groups from the data 3/20, 4/20, 5/20, and 6/20, the standard estimates are the sample proportions 15%, 20%, 25%, and 30%. Within each group, these estimates are unbiased and have the smallest possible variance.
In the 1950s, Charles Stein made a startling discovery that, once multiple groups were considered, biased estimates are superior in terms of accuracy (mean square error). In other words, while alternative estimates may be biased, they will tend to be closer to the right answer. Their reduction in variance outweighs the bias. Stein’s arguments were technical, but we do have intuition for the results and a motivation for why hierarchical models present a natural inferential framework.
One basic observation in statistics is that whenever you add noise to a system, variation increases. This seems obvious but has fundamental ramifications. Whenever we observe multiple groups, the highest observed value from the groups is likely biased high and the lower observed value from the groups is likely biased low. Returning to our example with 3/20, 4/20, 5/20, and 6/20 in the four groups, the 30% estimate for the highest group is likely an overestimate, and the 15% in the lowest group is likely an underestimate.
This phenomenon is commonly seen in sports, for example in baseball where we see batting averages from multiple players. Early in the season we often see several players batting over 0.400 and can observe some extreme slumps at the same time. Neither of these extremes is likely to be real. The highest batting averages are most likely to be good players, but players who are both good and lucky. If you want to estimate how good they really are, you need to remove the luck. For the players with the highest observed averages, their estimates should be lower than their observed rates.
Whenever we observe multiple groups, either at bats within batters or success rates within tumor types, the observed rates are likely farther apart than the underlying truth. If we want to estimate that underlying truth, it makes sense to move those observed rates together.
Suppose the true underlying rates were 19%, 21%, 24%, and 26% in the four groups. The true standard deviation for the four groups is 3.11%. If we observe 20 observations in each group, we might get 4/20 in each and no variation, but it’s more likely we get a broader range. In fact, with random sampling the expected observed standard deviation is 9.06%, almost triple the standard deviation of the underlying truth. On average, the largest of the 4 observed rates is 32.8%, far higher than the true highest 26%. The lowest of the 4 observed rates is, on average, 12.8%, much lower than the true lowest 19%.
In this example, it clearly makes sense to estimate the lowest group higher than the observed rate (after all, it’s clearly biased low) and estimate the highest group as lower than it’s observed rate (it’s clearly biased high). In effect, the estimates should be moved closer together. The resulting estimators are typically called “shrinkage estimators”.
But how far to shrink? With N=20 per group and true rates of 19%, 21%, 24%, and 26%, we found
–average observed SD 9.06%, almost triple SD of true rates 3.11%
–average observed high rate 32.8%, true high rate 26%
–average observed low rate 12.8%, true low rate 19%.
What about N=200 per group? The results are a lot closer
–average observed SD 4.0%, much closer to 3.11%
–average observed high 27.0%, much closer to 26%
–average observed low 18.1%, much closer to 19%
Thus, with a large sample size, we should “shrink” far less. Suppose our true rates were farther apart, with true rates of 5%, 20%, 35%, and 50%, and we observed 20 per group. The standard deviation of the true rates is 19.4%
–average observed SD 20.9%, close to true SD
–average observed high 51.4%, not far from 50%
–average observed low 4.7%, not far from 5%
In the later examples, we should clearly shrink less than in the earlier example given the observed results (standard deviation, minimum and maximum) are much closer to their true counterparts.
The key to determining the appropriate amount of shrinkage is the ratio of the true across group variance to the within group variation (the variation resulting from the sampling variation in observed N=20, or 200, observations per group). When the within group variability is much larger than the across group variability, as in our first example, we should shrink our estimates considerably. In the later examples, we should shrink less, either because the N=200 samples resulted in far less within group variation in the second example, or because the large discrepancy in underlying true rates in the third example results in large across group variation. To return to our baseball analogy, the within group variability is the “luck”, while the true across group variation determines whether groups are truly “good” or “bad”. We want to estimate away the random within group variability and leave the true across group variability.
Hierarchical models estimate the appropriate amount of shrinkage by expressly modeling the within and across group variability.
Part of the inferential structure describes the likelihood, where each group’s data arises from a Binomial distribution. A higher level of the prior (the levels of this structure motivate the term “hierarchical”) describes how the groups are related. If the underlying parameters in the G groups are theta_1,theta_2,…,theta_G (often these are transformed, for example to logodds), then we often model
theta_1,theta_2,..,theta_G ~ N(eta,tau^2)
The thetas are the true parameters, and tau is the across group variance. To finalize the model within a Bayesian perspective, we place priors on eta and tau and estimate them on the basis of the data. The resulting inference provides estimates of the within group and across group variation, and provides shrinkage estimates appropriate for the ratio between them.
The details of the pros and cons of this design are longer than this blog post will allow (we may revisit it in another). Generally speaking, shrinkage estimators perform admirably and decrease required sample sizes whenever the true parameters approximately obey the N(eta,tau^2) assumption. As with any model, remember “all models are wrong, but some are useful” and, thus, exact normality is not required. Results are particularly strong in the joint null, where a drug has no effect in any group. Shrinkage estimates act as automatic multiplicity adjustments, reducing type I error. When there are consistent strong effects, the use of across group estimates greatly increases power within each group, allowing repeated promising results in multiple groups to be combined into a conclusive narrative of effectiveness. When a large range of true effects is present, the large across group variation essentially eliminates shrinkage in favor of separate estimates per group, with no gain or loss relative to standard analyses. The model above will perform poorly when the N(eta,tau^2) assumption clearly does not fit, which occurs in what we call “nugget” situations, where all groups are null, save for a single group where the drug is effective, or vice versa where the drug is effective in all but a single null group. In such situations, outlying groups are not modeled well by the across group normal distribution and performance suffers, either in terms of reduced power for a single effective group, or inflated type I error for a single effective group. Balancing the pros and cons relative to the expected performance of the groups is a key part of the design process involving any clinical trial with a hierarchical model. The design can be calibrated to properly balance these risks.
(Many thanks to Anna McGlothlin and Barbara Wendelberger for helpful comments)