How To Really Avoid P-Value Hacking In Statistics? - A table with the following columns: Test, Fail, and Pass; and the following rows: Reality, effective, and not effective. Top left: "False Negative", Top right: "True Positive", Bottom left: "True Negative", Bottom right: "False Positive".

P-value hacking is a phenomenon that is currently ensuring that most published statistical research findings are false. In my essay on statistical significance, I briefly covered what a P-value is.

In this essay, we will first do a quick recap of the statistical method and what the P-value means. Following this, we will proceed to see how P-value hacking occurs and what its direct consequences are. Finally, we will explore solutions to the problem at hand. In our ever-so-increasingly data-driven world, everyone would benefit from awareness and knowledge of this topic.

The Statistical Method — A Recap

The modern statistical method can be summarized in the following steps:

1. Formulate the null-hypothesis.

2. Conduct an experiment and record observations/data.

3. Calculate the probability of obtaining the observed results as exceptions (extremes) to the null-hypothesis, assuming that the null-hypothesis is true. This probability value is known in the biz as the p-value.

4. If the p-value is less than a threshold value, it means that your experimental results reject the null-hypothesis, and consequently, are statistically significant. If the p-value is above the threshold value, then it means that the null-hypothesis has not been rejected.

As a reminder, the null hypothesis is framed in such a way that it describes the hypothesis that you would like to reject/discredit. R.A. Fisher was one of the foremost pioneers of the modern statistical method. He preferred a threshold p-value of 0.05, and the convention stuck.

If all this information is too abstract for you, let us go through a quick example. Let us say that you are conducting a statistical experiment to see if a drug has the intended effect.

The corresponding null hypothesis is that the drug has no effect. Once you have the experimental results, you calculate the probability that your observed results would be the case if the drug were to indeed have no effect (due to luck). This probability value is the p-value.

A very low p-value (say, below Fisher’s 0.05) suggests that the probability that the observed results occurred purely by chance is very low. So, it boosts your confidence that the drug indeed has the intended effect. So far, so good.

Before, we move on to the issue of p-value hacking, it is beneficial to cover the issue of confidence hacking first.

This essay is supported by Generatebg

A product with a beautiful background featuring the sponsor: Generatebg - a service that generates high-resolution backgrounds in just one click. The description says "No more costly photographers" and displays a "Get Started" button beneath the description.

The Philosophy of Statistics — The Issue of Confidence Hacking

Say that you are conducting a statistical analysis that correlates 10,0000 different organic ingredients with positive effects on human beings. Let us say that in reality, 20 of these ingredients truly have an effect (though we don’t know this at the time of the study).

After applying Fisher’s recommended p-value filter of 0.05, we end up with the following results table:

How To Really Avoid P-Value Hacking In Statistics? — A table with the following columns: Test, Fail, and Pass; and the following rows: Reality, effective, and not effective. Top left: “False Negative” — 10 → “Not Good”, Top right: “True Positive” — 10 → “Good!”, Bottom left: “True Negative” — 9480 → “Good!”, Bottom right: “False Positive” — 500 → “Very Bad!”.
The Statistical Analysis Results Table — Created by the author

Out of the 20 ingredients that truly have a positive effect, 10 don’t make it past the 0.05 threshold (top-left corner). This is not good; but it is to be expected. The remaining 10 out of 20 ingredients that truly have an effect are correctly categorized as true positives (top-right corner).

Clearly, most of the ingredients that don’t have an effect are rejected by the analysis. This is a good thing, and is also to be expected (bottom-left corner). Finally, we have on the bottom-right corner, 500 ingredients that have no effect falsely classified as ingredients that have a positive effect. This is very bad. Still, this is ALSO to be expected. Even though the number is 500, it is within what the p-value threshold promises to reject/allow.

In short, we have ended up in a situation where we have statistically significant results that give us very little confidence.

The principle of confidence hacking works as follows:

If “p < 0.05”, then the null hypothesis must be false. If “p > 0.05”, then the null hypothesis must be true.

Basically, the issue of confidence hacking treats the p-value filter as a hard truth-criterion (which it is not). If you wish to read an in-depth discussion on this topic, check out my essay on the philosophy of inferential statistics. But for now, let us move on to the main topic of this essay.

What is P-value Hacking?

P-value hacking is a special case of data manipulation when the p-value is right on the limit of being statistically significant (for example, p = 0.06). The biggest offenders in this domain are modern scientific researchers.

Say that a scientific researcher has been researching on a hypothesis for the past 3 years. Finally, after gathering the funds to conduct a proper experiment, she sets off to work on the numbers. The statistics reveal that the p-value is 0.059 (for statistical significance p needs be lesser than 0.05 in her case).

After three years of hard work, the researcher is facing a situation where it might all end-up meaningless. She goes through all of the data sets.

“Did I over-weight the pressure factor? Did I “by mistake” exclude any outliers in the data? Did I “by mistake” include any outliers in the data?

Okay, what happens if I remove these five data points? Wow, the p-value is now down to 0.041. Right, that settles that.”

Would you blame the researcher? I would not. We all know of the teacher who gives away a couple of mercy points to her students who are on the verge of failing. If two points are going to pass them, what’s wrong with giving the points away? Scientific Researchers are human beings, just as teachers are.

Blame the system; don’t blame the players.

Even though I am empathetic to the researchers, small manipulative mistakes do add up. In 2005, John Ioannidis published a paper titled “Why Most Published Research Findings Are False” (linked in the references section at the end of the essay).

In this paper, he critically showed how such minor rule-bending adds up to catastrophic systemic consequences. Science involving statistical methods (especially social science) has become less trustworthy over the years.


How to Really Avoid P-Value Hacking in Statistics?

The issue we are facing here is systemic. So, it makes sense that the solution(s) be systemic as well. Firstly, there seems to be an asymmetry as to how statistical results are being handled. Currently, most science publications are interested in publishing “statistically significant” studies/research.

Research with statistically insignificant results goes in the back-office drawer, only to be never seen again. This leads to two issues:

Firstly, the rest of the world does not know about the “failed” experiment/study. So, other researchers are bound to repeat the same experiment/study until the writing is on the wall for the entire community. This leads to the wastage of precious resources.

Secondly, researchers are incentivized to “cheat”. If a human being is punished for not p-value hacking and rewarded for p-value hacking, can you really blame the human being for his/her actions?

Publications need to start incentivizing statistically insignificant research results as well. This is not a silver bullet, but it is a start!

Fisher once wrote:

“A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”

– R.A. Fisher.

In short, even after a statistically significant result is published once, it could be purely by luck. So, publications also need to promote and publish follow-on research work that aims to reproduce results from first-time research. You’d be surprised to know how many studies are not reproducible or were purely due to luck.

Final Thoughts

To conclude, p-value hacking is rampant, and the situation is dire. However, the statistics community is already aware of the issue and is slowly taking action to resolve it.

As an end-user of statistical products in our data-driven world, it is immensely beneficial for you to be aware of the p-value hack. This enables you to ask the right questions before you make critical decisions!


References and credit: John Ioannidis and Jordon Ellenberg.

If you’d like to get notified when interesting content gets published here, consider subscribing.

Further reading that might interest you: Why Is The Hot Hand Fallacy Really A Fallacy? and How To Really Benefit From Curves Of Constant Width?

If you would like to support me as an author, consider contributing on Patreon.

Street Science

Explore humanity's most curious questions!

Sign up to receive more of our awesome content in your inbox!

Select your update frequency:

We don’t spam! Read our privacy policy for more info.