How to Pick Your P-Value
The Oxford English Dictionary defines "significant" as "sufficiently great or important to be worthy of attention." It's a meaning that policy makers should keep in mind when weighing the statistical evidence for or against a course of action.
The word "significant" has a special place in the world of statistics, thanks to a test that researchers use to avoid jumping to conclusions from too little data. Suppose a researcher has what looks like an exciting result: She gave 30 kids a new kind of lunch, and they all got better grades than a control group that didn’t get the lunch. Before concluding that the lunch helped, she must ask the question: If it actually had no effect, how likely would I be to get this result? If that probability, or p-value, is below a certain threshold -- typically set at 5 percent -- the result is deemed "statistically significant."
Clearly, this statistical significance is not the same as real-world significance -- all it offers is an indication of whether you're seeing an effect where there is none. Even this narrow technical meaning, though, depends on where you set the threshold at which you are willing to discard the "null hypothesis" -- that is, in the above case, the possibility that there is no effect. I would argue that there's no good reason to always set it at 5 percent. Rather, it should depend on what is being studied, and on the risks involved in acting -- or failing to act -- on the conclusions.
Setting the threshold entails a trade-off. A lower threshold means a researcher can be comfortable that she hasn't made what statisticians call a "Type I error" -- meaning that she hasn't incorrectly rejected the null hypothesis. But setting it higher can help avoid a "Type II error" -- in which the researcher concludes that the null hypothesis is true, when in fact those lunches could really have helped kids get better grades.
Suppose a policy maker is deciding whether to undertake fiscal stimulus. The policy maker’s null hypothesis is that the stimulus will be slightly costly for the economy. However, there is some evidence and theory that suggests that the stimulus will be hugely beneficial. As a result, the policy maker believes that the cost of failing to do the stimulus when it actually would help (Type II error) is 10 times greater than the cost of doing it if it actually didn’t help (Type I error).
Should the policy maker use the conventional p-value threshold of 5 percent when testing the null hypothesis? In this case, setting a higher threshold might make a lot of sense, because a Type II error is ten times as costly as a Type I error. As long as the increase in the threshold above 5 percent decreases the chance of a Type II error by at least a tenth as much as it increases the chance of a Type I error, it's worth doing.
This example illustrates three lessons. First, researchers shouldn't blindly follow convention in picking an appropriate p-value cutoff. Second, in order to choose the right p-value threshold, they need to know how the threshold affects the probability of a Type II error. Finally, they should consider, as best they can, the costs associated with the two kinds of errors.
Statistics is a powerful tool. But, like any powerful tool, it can’t be used the same way in all situations.
This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.
To contact the author of this story:
Narayana Kocherlakota at firstname.lastname@example.org
To contact the editor responsible for this story:
Mark Whitehouse at email@example.com