Statistical Power in AB Testing: Why Your Test Might Not Be Able to Find What You’re Looking For

May 23, 2026

A null result doesn’t mean the test failed. It might mean the test was never capable of finding out.

Most experimentation teams have a clear process for calling a winning test. The significance threshold hits 95%, the variant gets released, the result gets reported.

What most teams don’t have is an equally clear process for the question that should be asked before any of that: is this test actually capable of detecting a real effect?

That question is what statistical power answers. And the vast majority of e-commerce experimentation programmes never ask it. The result is a testing pipeline full of inconclusive results that teams interpret as ‘the variation didn’t work’, when the honest answer is ‘the test couldn’t tell us either way’.

Those are not the same thing. And confusing them is one of the most expensive mistakes a lean digital team can make.

Statistical Significance and Statistical Power Are Not the Same Thing

Statistical significance and statistical power are two sides of the same coin, but they protect against different types of error.

Significance protects you from false positives, seeing an effect that isn’t really there. At 95% confidence, you’re accepting a 5% chance of incorrectly declaring a winner when the variant made no real difference.

Statistical power protects you from false negatives, missing an effect that genuinely exists. At 80% power, if a real lift is present, your test has an 80% chance of detecting it. The flip side: a one in five chance of running a test on something that works and concluding it doesn’t.

Most e-commerce teams think carefully about significance and never think about power at all. The consequence is a testing programme that is reasonably good at avoiding false wins and structurally blind to false losses. For an experimentation team where every test consumes meaningful time and traffic, false losses are not a theoretical problem.

They are a direct cost.

What 80% Power Actually Means in Practice

The convention in AB testing is to run experiments at 80% power. This is an accepted industry standard, a pragmatic trade-off between the sample size required and the detection reliability delivered.

What it means is that if your variant genuinely improves conversion by 2%, a test run at 80% power will detect that improvement eight times out of ten. Two times out of ten, the same test on the same variant will return an inconclusive result, not because the effect isn’t real, but because the statistical conditions weren’t sufficient to surface it.

This is directly analogous to a scientific experiment. A researcher running a clinical trial with an insufficient sample size doesn’t conclude the drug doesn’t work when the result is inconclusive, they conclude the study wasn’t powered to answer the question. E-commerce teams draw the wrong conclusion from underpowered tests all the time, shelving interventions that might have been commercially valuable because the test said nothing rather than something.

The Minimum Detectable Effect: The Number You Should Calculate Before Every Test

Power is most practically expressed through a single output: the minimum detectable effect, or MDE. This is the smallest lift your test can reliably detect at your chosen power level, given your available traffic and test duration.

The MDE is determined by four inputs: your baseline conversion rate, your traffic volume per variant, your significance threshold, and your power target. You don’t need to understand the formula, the are plenty of free calculators to handle the maths. What you do need to understand is what the output tells you.

If your MDE comes back at 0.8% and you’re testing on a page with a 3% baseline conversion rate, your test can detect a lift from 3% to 3.8% or above. Anything smaller than that is invisible to the test at your available traffic level.

The question then becomes commercial: is a lift smaller than 0.8% on this page worth detecting? On £5M revenue a 0.5% conversion improvement is £25,000. If the answer is yes, you need more traffic, a longer test, or a different page to test on.

If your MDE comes back at 3% on a page with a 3% baseline conversion rate, your test can only detect an effect that doubles your conversion rate. That is almost certainly not a realistic expectation from a UX change, which means the test is not worth running in its current form. Most teams discover this four weeks in, after the result comes back inconclusive.

Calculating the MDE before the test starts takes five minutes and prevents exactly that outcome.

Why Low Traffic Makes Power the Binding Constraint for Lean E-Commerce Teams

For enterprise e-commerce teams with millions of monthly sessions, power is rarely the problem. The traffic volumes involved mean that even small effects are detectable within sensible test windows. The MDE at that scale is often fractions of a percentage point, which means almost any real effect can be found.

But for e-commerce teams with tens of thousands of monthly sessions, power is almost always the binding constraint. The MDE at lower traffic volumes is substantially higher, which means tests can only reliably detect effects large enough to be visible through the noise of a smaller dataset. The commercial implication is direct: teams need to be more selective about what they test, more deliberate about where they test it, and more honest about what their traffic can and cannot tell them.

The worst outcome is not an inconclusive test. It is an inconclusive test that gets interpreted as a negative result, causing the team to abandon a genuinely valuable idea and move on to the next item in a roadmap.

Underpowered testing doesn’t just waste time, it actively misdirects the programme.

The Connection Between Power and Significance: Why Underpowered Significant Results Are the Most Dangerous Ones

Here is the counterintuitive part of the power story that most teams never encounter: an underpowered test that reaches statistical significance is not a lucky outcome. It is a warning sign.

When a test is underpowered, the only results that cross the significance threshold tend to be ones where random noise inflated the observed effect beyond its true size. The test was unlikely to detect a real 2% lift, but it did manage to detect what looked like a 5% lift, because an unusual cluster of sessions happened to fall in the variant at the right moment. The result is significant. The effect is overstated. And the variant gets released with expectations that production will never meet.

The statistical interplay between significance and power is complex, and to be honest – not something the majority of digital teams need worry about, however at the heart of the issue is this:

Even if an AB test has found a statistically significant impact, you must check the statistical power to ensure it’s not just lucky chance.

Statistical Power in AB Testing: Ask the Question Before the Test Runs

Statistical power can be a confusing statistical concept. But it is also a practical question with a practical answer: given my traffic, my baseline, and my test duration, what is the smallest effect this test can reliably detect? If that number is commercially meaningful and realistic, the test is worth running. If it isn’t, running it anyway produces noise, not knowledge.

The discipline is straightforward: calculate the MDE before committing to a test. If the MDE is too large to be useful, adjust the inputs, test on a higher-traffic page, extend the duration, or focus on a metric with lower variance. If none of those adjustments are available, don’t run the test.

A null result from an adequately powered test is genuinely informative, it tells you the variation didn’t work at the magnitude you expected. A null result from an underpowered test tells you nothing at all. Knowing the difference between those two outcomes is what separates an experimentation programme that accumulates commercial knowledge from one that accumulates inconclusive results and calls it learning.

Strike a chord? We can help

If this article resonates with a challenge you are facing, we can help you build an experimentation programme that knows the difference between a test that failed and a test that was never capable of finding out.

Our Customer Experience Innovation programmes identify and respond to the largest volumes of commercial value are being lost in your funnel, so even limited traffic is spent testing the insights that matter rather than the ones a best-practice library suggests.

We bring a forensic, scientific method to how tests are scoped, powered and read - so a null result tells you something real, an underpowered winner doesn't get released on a number production will never meet, and every test you run earns its place against the traffic it consumes.

The outcome is a testing programme that accumulates trustworthy commercial impact, not academic wins.

Explore our services

Our mission

To combine expertise in data, insight and the scientific method, working with ambitious digital organisations to challenge, inform and support teams deliver the greatest commercial impact from every investment in digital channels.