Blog by: Kert Viele
If you see a significant result in a manuscript, should you assume the result is biased? We often see discussions of biases in complex, potentially adaptive trials. Early stopping trials, for example, have been criticized for producing biased estimates of treatment effect.
There are potential bias issues in early stopping, but it’s worth spending some time discussing equivalently caused biases in fixed trials. Suppose someone enrolls 200 participants at random in a simple, non-adaptive, single-arm trial testing whether the proportion of responders is greater than 50%. I’m testing
If I see a trial publication with a successful result, should I take the result at face value, or worry the result is biased?
We are often taught simple trials produce unbiased results, and that is true if we consider all outcomes. If the true rate in the trial is 50%, then I might get any observed rate from 0/200 to 200/200 (most likely something closer to 50%). If I average all the possible outcomes, I will get 50%. This happens for any true rate, so we refer to the observed rate as an unbiased estimator.
However, most medical journals do not publish any result, they preferentially publish successful results. The single arm trial above is successful (claims efficacy) if we observe 115/200=57.5% or more responders. Thus, the published result is typically one of these successful trials, not every possible outcome.
If we focus only on successful trials, the trials we might see in the literature, we obtain biases. Suppose again the true rate is 50%. If I restrict attention to the successful results (115/200 or better), their average is not 50%, but 58.5% (all of them are over 57.5%, after all).
This bias depends on the true rate. If the true rate is 60%, more of the outcomes are above 115/200. We are still cherry-picking the highest results, but less so than when the true rate is 50% The average successful trial has an observed rate of 61.3% (bias 1.3%). If the true rate is 70%, virtually all observed results are successful, and the bias is <0.001%.
We can plot the bias as a function of the true rate.
True rates near or above our threshold of 57.5% have small biases compared to true rates less than 57.5%. Small true rates can have immense biases among successful trials.
Of course, we do not know the true rate, thus we don’t know the bias. Unlike some standard problems in statistics where the bias is proportional (one of the reasons we divide by n-1 in the sample variance as opposed to n), here the bias varies widely as the true rate varies.
How much you should worry about biased estimates depends on your beliefs about the true rate (yes this is a Bayesian take on a frequentist issue). Clearly if you were quite sure the true rate is between 40-50%, you will think successful trials will be strongly biased. If you think the true rate is between 70-80%, you won’t worry about bias at all.
Of course given this trial ran, we likely care most about the range from 50-70% (the trial is just shy of 80% powered for p=0.6). Suppose I thought each value in that range was equally likely, producing biases between 8.5% and 0%.
What is my “average” bias? Here we do not average the bias curve equally over the 50-70% range. While we believe each true rate is equally likely, 70% is virtually guaranteed to generate a successful trial, while 50% only has a 2% chance of generating a successful trial. Thus, when we focus on successful trials, they tend to come from the higher true rates, with the lower biases. If we average over the successful trials, the average bias is 0.93%. Not zero, but probably not meaningful in terms of drug development.
Clearly, if I was more certain the true rate was (uniformly) between 50-55%, then my average bias will increase. The successful trials require observed values of 57.5%, which are implausible under this assumption. The average bias is 5.6%. If you told me the trial was successful, I’d tell you the estimate is likely too high.
All this discussion was from a simple, fixed trial. What about early stopping? The bias there is qualitatively no different (successful trials ending at implausible values), but it happens more often in early stopping because the early interims often require more extreme effects, potentially beyond the plausible scientific values for the parameter. This will generate bias at those early stops, particularly the first one (ironically, the later interims are somewhat “protected”, because random highs in the data at the second or later interims are often avoided. Such data often would have stopped at the first interim)
My personal opinion is that you shouldn’t have stopping rules that allow stopping at implausible values. This typically requires the first interim being late enough so that the stopping threshold is plausible. However, this does not imply you need to avoid early stopping altogether. Lots of trials have reason for good results to be plausible. The COVID-19 vaccine trials are an example. The earliest stops in those trials required around 80% vaccine efficacy, but that was plausible given the pre-clinical results. The trials were powered for much smaller effects, but there was no reason to be concerned about a 90% result, which indeed happened and was confirmed in other work.
In summary, to say “early stopping is biased” paints with too broad a brush. All statistical methods should be matched to the substantive context. There is such a thing as “stopping too early”, where the required observed values are too extreme for the context of the trial. There, we expect large biases and difficulty interpreting the trial. There are also stops which are plausible, for which we expect small biases, and which allow better stewardship of our limited clinical trial resources.