Correlation (III): How To Avoid Berkson's Paradox - An illustration showing a stick-figure with a chef's hat stirring dough in a pan while holding an engaged look on its face. It is quoted saying "I'm skilled and ethical."

Berkson’s paradox is a veridical paradox that involves the notions of correlation and conditional probability. The physicist/physician/statistician Joseph Berkson first demonstrated this phenomenon as a retrospective study. Consequently, his name stuck.

In my previous essays on correlation, I covered how the correlation between two variables might not necessarily be causally linked.

There could be other variables that we are not even tracking, which actually cause the correlation. But in the case of Berkson’s paradox, the observed correlation is not because of a common cause, but because of a common effect.

In this essay, I demonstrate this firstly using an intuitive example. Then, I cover the mathematics behind Berkson’s paradox. Finally, we look at how you can avoid this trap. Let us begin.

This essay is supported by Generatebg

A product with a beautiful background featuring the sponsor: Generatebg - a service that generates high-resolution backgrounds in just one click. The description says "No more costly photographers" and displays a "Get Started" button beneath the description.

Skilled Chefs and Ethical Responsibility

For your new restaurant, let us say that you are looking for skilled Chefs who have a strong sense of ethical responsibility. You invite Chefs who tick at least one of the boxes for interview.

In other words, the interview candidates are ethically responsible or skilled or both. During the interview process, you realise that skill and ethical responsibility are inversely correlated.

That is, skilled chefs tend to have a low sense of ethical responsibility and chefs with admirable ethical traits tend to be below average on skills.You get frustrated and wish to solve this problem for the whole gastronomy industry.

To start, you wish to get to the root cause of this correlation. You start by suspecting that the industry culture causes skilled chefs to lose ethical sense over time. But then, you hear a loud voice:

“Hang on a minute!”

That’s where I come in and say that there is something else going on here. Let me clarify.

Berkson’s Paradox Emerges

Firstly, your empirical observation (which is fictitious, by the way) is valid. There is indeed a negative correlation between skilled chefs and ethical responsibility.

However, this occurs not because of a common cause, but is a result of a common effect. This effect is the fact that your entire dataset is comprised of “your” interview pool.

If you think about it, you never invited chefs who were below average on skills and below average on ethical responsibility to the interview.

Graphically, your interview pool would look like the shaded triangle inside the bigger square shown below. The unshaded region represents the subset of chefs who never made it to your interview list.

Correlation (III): How To Avoid Berkson’s Paradox — An illustration shows a plot made of a box. In the x-axis, ethical responsibility increases from left to right. In the y-axis, skill increases from bottom to top. In the right-hand top corner, a relatively small triangle is marked and shaded. It seems to indicate highly skilled and highly ethical chefs.
Skill Vs. Ethical Responsibility — Illustration created by the author

If you had considered the uninvited subset of chefs, your two variables might not be correlated at all or might even be positively correlated. We do not know because you never invited them.

Before we cover how to avoid this situation, let us look at the mathematics behind this phenomenon.


The Mathematics Behind Berkson’s Paradox

Let us now consider two independent events: A and B. Both ‘A’ and ‘B’ have an equal probability of occurring. That is:

P(A) = ½

P(B) = ½

To make things easier to visualise, let us consider a dataset of 100 instances, with A representing the instance of skilled chef and B representing the instance of an ethical chef.

Based on equal probability, the occurrence distribution would look as shown in the below table:

Correlation (III): How To Avoid Berkson’s Paradox — A table with the following combinations of cells: A and B — 25, A and (NOT B) — 25, (NOT A) and B — 25, (NOT A) and (NOT B) — 25.
Equal Probability distribution of A and B — Table created by the author

Inviting chefs who are ethical or skilled or both is akin to reducing your sample space to the (yellow) shaded region of the table as shown below:

Correlation (III): How To Avoid Berkson’s Paradox — A table with the following combinations of cells: A and B — 25, A and (NOT B) — 25, (NOT A) and B — 25, (NOT A) and (NOT B) — 25. This time, all cells except (NOT A) and (NOT B) are shaded in yellow.
Shaded subset selection of A and B — Table created by the author

Given this setting, the probability of the correlation you look for is given by:

P(A|AUB)

Since your interview list is only made of chefs who are skilled or ethical or both, your probability arrives from the shaded region as follows:

P(A|AUB) = 50/75 = 2/3 (considering shaded sample space only)

Do you realise what is going on here? All of a sudden the two unconditional random variables have become conditional! The reason is that we have excluded the unshaded box.

Let us now repeat the calculation whilst considering the unshaded box from the table as well:

P[A|(A)U(B)U(~A)U(~B)] = P(A) = 50/100 = ½ (considering both shaded and unshaded sample spaces)

When we consider the entire sample space (shaded AND unshaded), the dependency goes away, and the conditional probability bias dissolves.

Now that we have covered the mathematics behind Berkson’s paradox, let us see how one can avoid it.

How To Avoid Berkson’s Paradox

Berkson’s paradox is a veridical paradox for a reason; the negative correlation you observed (again, from our fictitious example) is very much true. However, before we can claim any correlation, we HAVE to look at our dataset.

To avoid Berkson’s paradox, ask yourself the following question:

“Is my source dataset completely random?”

In the case of your interview list, you introduced a bias that made the dataset non-random enough to bring conditional probabilities into effect. In our fictitious example, we could easily deduce the reason for the issue.

However, in real-life applications, things are much more complex. Researchers can go very far with their clever statistical approaches without realising that there is a major issue with their source dataset.

Therefore, it makes sense to:

1. Be aware of Berkson’s paradox in the context of correlation.

2. Question your dataset for bias before claiming correlation.


References and credit: Joseph Berkson and Jordon Ellenberg.

If you’d like to get notified when interesting content gets published here, consider subscribing.

Further reading that might interest you:

If you would like to support me as an author, consider contributing on Patreon.

Street Science

Explore humanity's most curious questions!

Sign up to receive more of our awesome content in your inbox!

Select your update frequency:

We don’t spam! Read our privacy policy for more info.