I remarked in my previous article about Anavex Life Sciences that sowing confusion might have been “exactly the point” of their press release and subsequent conference presentation about the results of their latest trial in Alzheimer’s disease. And, well, there has definitely been ample confusion! This confusion was not eased when the CEO claimed that “statistics don’t work like that” (they do) or in their conference call where questions about the statistics went either unaddressed or unasked, despite a sell-side “analyst” declaring that “simplification of statistical calculations circulating on social media [are] inaccurate” (still waiting for the analyst or Anavex to show us what’s inaccurate about them). I understand the frustration from investors in Anavex—who no doubt aren’t big fans of mine—but like I’ve cautioned with peer company Cassava Sciences, the company is the one with all of the data and the means to clarify or push back against legitimate questions and criticism.
In the meantime, I’d like to do what I can to clear up some sources of confusion. And look, I’m not infallible—there’s a perfectly reasonable chance that I’ve made a mistake or two in my analyses. But I’m not aware of one yet, and I’m trying to support as much of what I write as possible with clear-headed statistics and references to FDA guidance documents. There are also plenty of gray areas, which means that we have to get into the heads of potential FDA decision makers a bit, and use our judgment about what the proper statistical tests and techniques look like. So there’s room to agree to disagree, though I’d argue not a whole lot in this case. As always, everything I write here is in good faith with the goal of cutting through the noise and determining the reality of the situation. So without further ado, here are some items that I think could use some clarification.
The Endpoints
The source of confusion around the reporting of the primary endpoints arises directly from Anavex’s slide deck, the relevant exhibits being slides 19, 20, and 21. Before we turn to those, however, let’s take a look at what the actual filed endpoints are:
There are two co-primary endpoints, each dealing with relative rates of decline on two standard Alzheimer’s disease assessments. Let’s be clear about what these endpoints are not: they are not a responder analysis, where patients who do not “respond” to treatment are allowed to be dropped. They are a comparison of the score distributions of two tests, given at separate points in time for every patient. Yes, some patients will need to be dropped for various legitimate reasons, but arbitrary “responder” cutoffs are not one of those reasons. Looking at slides 19 and 20, they are reporting a responder analyses for ADAS-Cog and ADCS-ADL, respectively—they drop patients based on a cutoff and then compute an odds ratio, which is not how you compare a “reduction in decline” between two groups. Slide 21, on the other hand, is the proper analysis of the ADAS-Cog endpoint that adheres to the trial design. There is no companion analysis for ADCS-ADL. Thus, they have reported only one of their two pre-specified primary endpoints.
Multiplicity
I think that there is some general understanding of the issues that arise from the multiple comparisons problem, also referred to as simply “multiplicity.” The FDA guidance is pretty clear about this one, noting that “the overall Type I error rate in favor of the drug nearly doubles when two independent endpoints are tested.” That doubling of Type I error will not be acceptable to the FDA and requires correction. Keeping in mind that the second endpoint was not reported, that means a correction is required for the ADAS-Cog endpoint. If we had any notions about the responder analysis helping out Anavex’s case, the FDA dispels them in the document:
Even when a single outcome variable is being assessed, if multiple facets of that outcome are analyzed (e.g., multiple dose groups, multiple time points, or multiple subject subgroups based on demographic or other characteristics) and if any one of the analyses is used to conclude that the drug has been shown to produce a beneficial effect, the multiplicity of analyses may cause inflation of the Type I error rate.
Basically, they’re suggesting that we could just start identifying “responder cutoffs” and trying again and again to find statistical significance, and we might just find that magic cutoff! That won’t pass muster with the FDA, which is sort of the whole point of pre-specified endpoints in the first place. Again, looking at the document:
For controlling multiplicity, an important principle is to first prospectively specify all planned endpoints, time points, analysis populations, doses, and analyses; then, once these factors are specified, appropriate adjustments for multiple endpoints and analyses can be selected, prespecified, and applied, as appropriate.
To summarize, the FDA guidance states that you 1. clearly state your endpoints 2. report them as stated and 3. correct for multiplicity. Anavex did not.
The Number of Tails
An astute observer noticed that Anavex reported on slide 26 (AEs) a population of 161 in their placebo group and 301 in their treatment group. Good catch! I’ll assume like this person did that these are the actual populations that were reported in the ADAS-Cog endpoint. They also noticed that there is a selection box in the tool that I used to run the t-test where you can select your hypothesis: d = 0, d ≤ 0, or d ≥ 0. Here’s the test, for reference:
The hypotheses are for doing a one-tailed versus two-tailed t-test. A hypothesis of d = 0 is a two-tailed test, while the other two options are one-tailed tests (note the shading of the orange distribution that represents the statistically significant regions). There is a good explainer for what I’m talking about from the UCLA statistical consulting center, along with some basics on when and why to select one-tailed versus two-tailed t-tests. The FDA has the following to say about thresholds for statistical significance in t-tests:
The most widely used values for α are 0.05 for two-sided tests and 0.025 for one-sided tests. In the case of two-sided tests, an α of 0.05 means that the probability of falsely concluding that the drug differs from the control in either direction (benefit or harm) when no difference exists is no more than 5%, or 1 chance in 20. In the case of one-sided tests, an α of 0.025 means that the probability of falsely concluding a beneficial effect of the drug when none exists is no more than 2.5%, or 1 chance in 40.
There are exceptions, but generally speaking the FDA is going to want to see p < 0.025 for a one-sided t-test. The European Medical Agency echoes this in their guidance:
In order to demonstrate non-inferiority, the recommended approach is to pre-specify a margin of noninferiority in the protocol. After study completion, a two-sided 95% confidence interval (or one-sided 97.5% interval) for the true difference between the two agents will be constructed.
Of course, Anavex has yet to actually confirm that a one-sided test is in fact how they arrived at their conclusion of a statistically significant treatment effect, but I’ve yet to see any alternative explanation. And like I said last time, you can punch the trial numbers into a Tukey HSD test if you want to circumvent these issues (there is only one tail in this test by definition), but it’s still short of statistical significance. And that’s before the multiplicity correction!
Closing Thoughts
A certain level of confusion around trials results is always expected—by definition we’re dealing with noise and uncertainty—but it’s always unfortunate when the sources of the uncertainty could be removed and are instead left to fester. To be clear, these are not difficult questions to answer. Was the t-test one-tailed or two-tailed? How many patients were included? What happened to the other endpoint? As the age-old adage goes, the data you get to see is the data that looks best for the sponsor. There are rarely reasons to withhold data that is positive, just like there are rarely reasons to change endpoints that the sponsor is confident it can meet. I haven’t been following Anavex for long, but my impression is that these are patterns in the company’s trials. As always, be vigilant, and ask for the data.
Thanks.
In your closing thoughts, do you mean to say "There are rarely reasons to withhold data that is* positive..."? Not necessarily comparable, but I find Clovis Oncology's case instructive https://www.sec.gov/litigation/litreleases/2018/lr24273.htm
Thanks again for your impartial blogs. Keep up the good work! All the best!
ps. A disclosure: I own AVXL shares.