
Statistical Power in AB Testing: Why Your Test Might Not Be Able to Find What You’re Looking For
A null result doesn’t mean the test failed. It might mean the test was never capable of finding out.
Most e-commerce teams have a clear process for calling a winning test. The significance threshold hits 95%, the variant gets released, the result gets reported. What most teams don’t have is an equally clear process for the question that should be asked before any of that: is this test actually capable of detecting a real effect?
That question is what statistical power answers. And the vast majority of e-commerce experimentation programmes never ask it. The result is a testing pipeline full of inconclusive results that teams interpret as ‘the intervention didn’t work’, when the honest answer is ‘the test couldn’t tell us either way’. Those are not the same thing. And confusing them is one of the most expensive mistakes a lean digital team can make.
Statistical Significance and Statistical Power Are Not the Same Thing
Statistical significance and statistical power are two sides of the same coin, but they protect against different types of error.
Significance protects you from false positives, seeing an effect that isn’t really there.
At 95% statistical significance, you’re accepting a 5% chance of incorrectly declaring a winner when the variant made no real difference.
Statistical power protects you from false negatives, missing an effect that genuinely exists.
At 80% statistical power, if a real lift is present, your test has an 80% chance of detecting it. The flip side: a one in five chance of running a test on something that works and concluding it doesn’t.
Most e-commerce teams think carefully about significance and never think about power at all. The consequence is a testing programme that is reasonably good at avoiding false wins and structurally blind to false losses. For a lean team where every test consumes meaningful time and traffic, false losses are not a theoretical problem. They are a direct cost.
What 80% Power Actually Means in Practice
The convention in AB testing is to run experiments at 80% power. This is not a law, it is an accepted industry standard, a pragmatic trade-off between the sample size required and the detection reliability delivered.
What it means concretely: if your variant genuinely improves conversion by 2%, a test run at 80% power will detect that improvement eight times out of ten. Two times out of ten, the same test on the same variant will return an inconclusive result, not because the effect isn’t real, but because the statistical conditions weren’t sufficient to surface it.
This is directly analogous to a scientific experiment. A researcher running a clinical trial with an insufficient sample size doesn’t conclude the drug doesn’t work when the result is inconclusive, they conclude the study wasn’t powered to answer the question. E-commerce teams draw exactly the wrong conclusion from underpowered tests constantly, shelving interventions that might have been commercially valuable because the test said nothing rather than something.
The Minimum Detectable Effect: The Number You Should Calculate Before Every Test
Power is most practically expressed through a single output: the minimum detectable effect, or MDE. This is the smallest lift your test can reliably detect at your chosen power level, given your available traffic and test duration.
The MDE is determined by four inputs: your baseline conversion rate, your traffic volume per variant, your significance threshold, and your power target. You don’t need to understand the formula, free calculators handle the maths (Evan Miller’s Sample Size Calculator is the standard reference). What you do need to understand is what the output tells you.
If your MDE comes back at 0.8% and you’re testing on a page with a 3% baseline conversion rate, your test can detect a lift from 3% to 3.8% or above. Anything smaller than that is invisible to the test at your available traffic level. The question then becomes commercial: is a lift smaller than 0.8% on this page worth detecting? On £5M revenue a 0.5% conversion improvement is £25,000. If the answer is yes, you need more traffic, a longer test, or a different page to test on.
If your MDE comes back at 3% on a page with a 3% baseline conversion rate, your test can only detect an effect that doubles your conversion rate. That is almost certainly not a realistic expectation from a UX change, which means the test is not worth running in its current form. Most teams discover this four weeks in, after the result comes back inconclusive. Calculating the MDE before the test starts takes five minutes and prevents exactly that outcome.
Why Low Traffic Makes Power the Binding Constraint for Lean E-Commerce Teams
For enterprise e-commerce teams with millions of monthly sessions, power is rarely the problem. The traffic volumes involved mean that even small effects are detectable within sensible test windows. The MDE at that scale is often fractions of a percentage point, which means almost any real effect can be found.
For lean e-commerce teams with tens of thousands of monthly sessions, power is almost always the binding constraint. The MDE at lower traffic volumes is substantially higher, which means tests can only reliably detect effects large enough to be visible through the noise of a smaller dataset. The commercial implication is direct: lean teams need to be more selective about what they test, more deliberate about where they test it, and more honest about what their traffic can and cannot tell them.
The worst outcome is not an inconclusive test. It is an inconclusive test that gets interpreted as a negative result, causing the team to abandon a genuinely valuable intervention and move on to the next item in a best-practice library. Underpowered testing doesn’t just waste time, it actively misdirects the programme.
The Connection Between Power and Significance: Why Underpowered Significant Results Are the Most Dangerous Ones
Here is the counterintuitive part of the power story that most teams never encounter: an underpowered test that reaches statistical significance is not a lucky outcome. It is a warning sign.
When a test is underpowered, the only results that cross the significance threshold tend to be ones where random noise inflated the observed effect beyond its true size. The test was unlikely to detect a real 2% lift, but it did manage to detect what looked like a 5% lift, because an unusual cluster of sessions happened to fall in the variant at the right moment. The result is significant. The effect is overstated. And the variant gets released with expectations that production will never meet.
This is the practical version of what statisticians call the winner’s curse: underpowered significant results systematically overestimate true effect sizes. A lean team that releases on the back of them will consistently find that production performance falls short of test performance, and may never connect that pattern to the power problem that caused it.
Statistical Power in AB Testing: Ask the Question Before the Test Runs
Statistical power is not a complex statistical concept. It is a practical question with a practical answer: given my traffic, my baseline, and my test duration, what is the smallest effect this test can reliably detect? If that number is commercially meaningful and realistic, the test is worth running. If it isn’t, running it anyway produces noise, not knowledge.
The discipline is straightforward: calculate the MDE before committing to a test. If the MDE is too large to be useful, adjust the inputs, test on a higher-traffic page, extend the duration, or focus on a metric with lower variance. If none of those adjustments are available, don’t run the test.
A null result from an adequately powered test is genuinely informative, it tells you the intervention didn’t work at the magnitude you expected. A null result from an underpowered test tells you nothing at all. Knowing the difference between those two outcomes is what separates an experimentation programme that accumulates commercial knowledge from one that accumulates inconclusive results and calls it learning.
