Why "Successful" Drug Trials Fail
Preclinical Research Matters, and P-values Are Only Part of the Equation
Like it or not, you can’t escape the p-value. It’s a crude, imperfect framework that necessarily involves fitting the square peg of a binary and discrete outcome into the round hole of complex and continuous problem states. But in a world where we need answers to critical questions like “should we allow doctors to put this chemical into sick people or not?” it shines a bit of light on otherwise dark, uncharted terrain. The upshot? If you’ve taken a medication in your lifetime, you’ve probably done so based on evidence from some incarnation of a p-value calculation (at a threshold of 0.05), which in turn informed regulators that there was less than about a five percent chance that the medication isn’t actually doing anything but random fluctuations tricked us into observing that it did. Or, well, not exactly—it’s a bit more technical than that in the sense that what we’re really doing is assuming a null hypothesis (i.e. that the drug has no effect), then choosing a model and finding the probability that the null hypothesis holds under the model.
The difference between rejecting the null hypothesis and finding the probability that a treatment effect is real may feel imagined or pedantic, but it’s actually a critical distinction with potentially enormous variance. And since we’re making decisions about health and safety that affect large swaths of people, we’d better be getting these things right. One critical element in understanding this difference is base rate neglect, which utterly confounds our interpretation of p-values when we look at them in a vacuum. There’s a commonly invoked example of base rate neglect that is actually extremely informative in this context, and it’s stated something like this:
Imagine we’ve developed a test for a disease that has a 5% false positive rate (and a zero false negative rate). You take the test and find out that you’ve tested positive—what’s the probability that you have the disease?
Seems pretty simple, right? After all, when the test was applied to, say, millions of petri dishes lacking the disease, it only told us erroneously that the disease was present 5% of the time, and it never missed a single petri dish where the disease was actually there! So then, should we be 95% sure that we’ve got the disease? Not even close (probably).
We’re missing a critical piece of information—the base rate of the disease, also sometimes referred to as the disease prevalence, or basically just how likely any given person is to have the disease in the first place. For, say, a common cold that’s moving through the population like wildfire and has 40% of everyone afflicted? In that case we’d be about 93% confident that we had it. But for a rare disease that only occurs in about one in every 100,000 people? An observed positive test in that case would be a false positive about 99.98% of the time! We really couldn’t be sure at all—it’s still far more likely that we actually don’t have the disease. So why is it so tempting to think of this test as “95% accurate” when in reality it could be anywhere from 95% to, well, 0% (in the case where, say, we’re testing people for a disease that can’t even infect humans)?
Unveiling the Base Rate
Think for a moment about how we might design an experiment to determine the “accuracy” of a test. I mentioned applying it to petri dishes, so let’s stick with that idea. In this thought experiment, we’ll assume that we have perfect information about what’s in the petri dish—either it has the target pathogen or it doesn’t. So if we want to get the true/false positive rates, we just have to apply the test to a lot of petri dishes with the pathogen; for the true/false negative, do the same thing but in petri dishes that don’t have it. In reality it will be more complex, of course (maybe a petri dish had the pathogen but only trace amounts that couldn’t be detected; or dishes got swapped; or maybe the test is sensitive to a different pathogen as well), but we’re just interested in the simplified case where things are nice and tidy. What do we notice about this setup? Well, the base rate is 100% in each case! We’ve sort of implicitly controlled for it in the experiment design without really realizing that that’s what we’re doing. Why would we muddy our measurement of the true/false positive rates by mixing in some petri dishes that don’t have the pathogen (or vice versa)? Of course we wouldn’t! It makes no sense to do that.
Problem is, in the giant petri dish that is our messy world we generally don’t get to control for the base rate. When we apply the test to real people with the intent that it be used as a diagnostic, we’re basically pulling a human petri dish out of a mixture of pathogenic and non-pathogenic populations. If we knew which population the person was coming from, then we wouldn’t need the diagnostic!
So yes, something could reasonably be called “95% accurate” and yet provide little additional clarity about the probability of the very event that it intends to measure. It’s a subtle distinction, but it’s clearer when we think about it in terms of Bayes’ rule:
Pretty clear! Ok, maybe not so much, so let’s go over what this equation means. First, A and B are simply two probabilistic events that we can define as we please. Here, we’ll say that A represents “has the disease” while B represents “tested positive.” P(A) is the probability of event A, and P(A|B) is the conditional probability of event A given that event B has occurred. So P(A|B) in plain English would be read as “the probability that we have the disease given that we’ve tested positive,” which is the thing that we actually want to know (assuming that we tested positive). The funny looking “¬” symbol is a negation, so basically just the opposite of the event defined as A (or, equivalently, 1 - P(A)). We expand the equation out to the version on the far right because that’s the one where we have all of the necessary information. It’s really just three terms:
P(A|B) is the probability that we have the disease given a positive test.
P(B|A)P(A) is the probability of getting a positive test given that we have the disease, multiplied by the base probability that we have the disease.
P(B|¬A)P(¬A) is the probability of getting a positive test given that we don’t have the disease, multiplied by the base probability that we don’t have the disease.
Circling back to our petri dish thought experiment, P(B|A) is the true positive probability and P(B|¬A) is the false positive probability (you know, the ones we measured very precisely when we had perfectly distinct populations of petri dishes). But notice how each one is being modified by its respective base probability—P(A) and P(¬A) are the base rates, or essentially the probability that we have the disease (or not) absent any qualifying information. They must sum to 1, which means that the more weight that P(A) has the more meaningful (to us, at least) the outcome of the test; which is to say, the more likely that an observed positive is a true positive.
As a quick aside, all of this same logic applies to a negative test as well, where we just flip the events around and call A “does not have the disease” and B “tested negative.” We made the negative test considerations disappear by saying that we never get a false negative, which is of course ridiculous, but in the parts that follow we’re also a lot less concerned with false negatives anyway. I think it will become clear why, but if you’re interested just note that you could read this entire article again inverting all positive and negative outcomes/rates and it should still make sense.
Preclinical Research Matters
So what does this have to do with drug development? Well, let’s define A and B a little differently. Let’s say that A is the probability that the drug improves the target condition, while B is the positive observation of the clinical trial’s primary endpoint(s). Again, reality is a bit more messy, but as a simple model this maps pretty well—the clinical trial looks like a diagnostic for a drug candidate, and the drug either improves a condition or doesn’t. And what we really want to know is P(A|B), or the probability that the drug improves the condition given a positive clinical trial outcome. Great. So where does our magic p < 0.05 fit into all of this?
Unfortunately, finding that the primary endpoint was met at p < 0.05 is definitely not the same thing as P(A|B) > 0.95, which I think should be pretty clear to everyone familiar with clinical trials. No matter how you slice it, succeeding on primary endpoints in any phase of a clinical trial confers nowhere near a 95% chance of approval. Now, FDA approval is not really the proper surrogate for P(A|B)—there are drugs that are approved but turn out not to have the apparent benefit they showed during trials—but getting information about approved drugs which were later withdrawn by the FDA specifically for efficacy reasons is not trivial. Usually safety or registration bureaucracy are involved and may obfuscate efficacy issues, as the former are more closely monitored after approval than the latter. Perhaps a better stand-in would be the probability of meeting endpoints in Phase 3 after meeting them in Phase 2, which tends to be estimated somewhere between 30% and 50%. In any case, it’s much lower than 95%.
What does achieving p < 0.05 on the trial’s primary endpoint actually represent in this context, then? It’s kind of like an observation of a positive test result, which is just defined here as event B. Measuring a pre-defined statistical test against a threshold (in general, 0.05) works to turn the endpoint into a binary event, which is helpful when we need to make a binary decision (approve or don’t approve; publish or don’t publish) but also a source of much frustration among scientists and statisticians—it doesn’t make much mathematical sense to try and discretize continuous statistical tests into binary outcomes. Nevertheless, that’s how it goes, and in some sense you can interpret the p-value as a bootstrapped version of the false positive rate of the diagnostic. The FDA is sort of saying “show us a positive test result which, under the chosen diagnostic, has less than a 5% false positive rate.”
Now comes the big reveal—we still need to identify a base rate! We’ve done all of this work, designed a diagnostic for our clinical trial, found that it met our arbitrary-but-important threshold and yet we’re still missing that critical piece of information that could pretty much make it all for naught. If our chances of finding a meaningful drug were only one in a million from the start, then we should hardly muster even a casual shrug at a clinical trial that showed, say, p = 0.01 on its endpoint. It’s a false positive, almost certainly.
So what frees us from the clutches of the smothering base rate? Preclinical research does.
Bayes, Rinse, Repeat
A key advantage of the Bayesian framework is that it gives us a method for updating our beliefs with more evidence. What I’ve spent most of this article describing is one step in the process of Bayesian inference, where we are calculating an updated probability after taking into account some new information—in this case a positive test with an approximate false positive rate (the p-value). When we undertake this process of updating our beliefs, we’re always fighting against the base rate. So if we have a new drug that we think treats, say, Alzheimer’s disease, then what should we take as our base rate? Well, out of all possible synthesizable compounds, what percentage do we think improves the condition of Alzheimer’s disease? I think we can all agree that it’s way, way less than 1%, and probably still a whole lot less than 0.01%. One in a million? If we’re lucky. The exact answer is unknowable, but for any drug it’s going to be incredibly tiny.
Ok, but if the base rate for all of these diseases is so small, how are we able to discover any drugs at all? Research, and lots of it. The unfathomable gap between picking a random molecule out of a hat and finding that it helps improve a condition is traversed by a series of powerful Bayesian updates. It’s these updates, generated by solid preclinical research, that take us from the abyss to the starting line of the clinical trial marathon. Identifying biological targets; verifying that the drugs binds to said targets, and modulates them; measuring physiological changes in laboratory animals; doing boring old Western blots and tissue analysis and the like—these are all absolutely necessary prerequisites for holding any strong belief that what we’re seeing from a clinical trial is meaningful. Take them away, and all that’s left is modern alchemy.