The Value of That P-Value

Is it working?

Photograph: Hulton Archive/Getty Images

Decisions affecting the lives and livelihoods of millions of people should be made using the best possible information. That’s why researchers, public officials and anyone with views on social policy should pay attention to a long-running controversy in the world of statistics.

The lesson to be drawn from this debate: Whenever you see a claim of the form "x is significantly related to y," watch out.

At issue is a statistical test that researchers in a wide range of disciplines, from medicine to economics, use to draw conclusions from data. Let’s say you have a pill that’s supposed to make people rich. You give it to 30 people, and they wind up 1 percent richer than a similar group that took a placebo.

Before you can attribute this difference to your magic pill, you need to test your results with a narrow and dangerously subtle question: How likely would you be to get this result if your pill had no effect whatsoever? If this probability, or so-called p-value, is less than a stated threshold -- often set at 5 percent -- the result is deemed "statistically significant."

The problem is, people tend to place great weight on this declaration of statistical significance without understanding what it really means. A low p-value does not, for example, mean that the pill almost certainly works. Any such conclusion would need more information -- including, for a start, some reason to think the pill could make you richer.

In addition, statistical significance is not policy significance. The size of the estimated effect matters, too. It might be so small as to lack practical or explanatory value, even though it’s statistically significant. The converse is also true: An estimated effect might be so strong as to demand attention, even though it fails the p-value test.

These reservations apply even to statistical investigation done right. Unfortunately, it very often isn’t -- as the American Statistical Association made clear earlier this month in what amounts to an academic cri de coeur. Researchers commonly engage in "p-hacking," tweaking data in ways that generate low p-values but actually undermine the test. Absurd results can be made to pass the p-value test, and important findings can fail. Despite all this, a good p-value tends to be a prerequisite for publication in scholarly journals. As a result, only a small and unrepresentative sample of research ever sees the light of day.

Why aren’t bad studies rooted out? Sometimes they are, but academic success depends on publishing novel results, so researchers have little incentive to check the work of others. One rare replication project managed to confirm the results of only 11 out of 18 papers published in leading economic journals. That looks pretty good compared with psychology, where a similar (albeit contested) study of 98 papers produced a replication rate of less than half.

What to do? Journals that publish research, and institutions that fund it, should demand more transparency. Require researchers to document their work, including any negative or "insignificant" results produced along the way. Insist on replication. Supplement p-values with other measures, such as confidence intervals that indicate the size of the estimated effect as well as its statistical precision.

Most important, users of statistics need to wise up to the limits of the science. Empirical studies are a vital guide to policy, but must be used carefully. Look at the evidence as a whole, and beware results that haven’t been repeated, or that depend on a single method of measurement. Hold findings to a higher standard if they conflict with common sense.

Policy makers can’t ask statistical analysis for certainty. That’s unattainable. But they can and should demand conclusions that are clear and realistic enough to withstand scrutiny.

To contact the senior editor responsible for Bloomberg View’s editorials: David Shipley at