Correlation (III): How To Avoid Berkson’s paradox

Berkson’s paradox is a veridical paradox that involves the notions of correlation and conditional probability. The physicist/physician/statistician Joseph Berkson first demonstrated this phenomenon as a retrospective study. Consequently, his name stuck.

In my previous essays on correlation, I covered how the correlation between two variables might not necessarily be causally linked.

There could be other variables that we are not even tracking, which actually cause the correlation. But in the case of Berkson’s paradox, the observed correlation is not because of a common cause, but because of a common effect.

In this essay, I demonstrate this firstly using an intuitive example. Then, I cover the mathematics behind Berkson’s paradox. Finally, we look at how you can avoid this trap. Let us begin.

This essay is supported by Generatebg

Skilled Chefs and Ethical Responsibility

For your new restaurant, let us say that you are looking for skilled Chefs who have a strong sense of ethical responsibility. You invite Chefs who tick at least one of the boxes for interview.

In other words, the interview candidates are ethically responsible or skilled or both. During the interview process, you realise that skill and ethical responsibility are inversely correlated.

That is, skilled chefs tend to have a low sense of ethical responsibility and chefs with admirable ethical traits tend to be below average on skills.You get frustrated and wish to solve this problem for the whole gastronomy industry.

To start, you wish to get to the root cause of this correlation. You start by suspecting that the industry culture causes skilled chefs to lose ethical sense over time. But then, you hear a loud voice:

“Hang on a minute!”

That’s where I come in and say that there is something else going on here. Let me clarify.

Berkson’s Paradox Emerges

Firstly, your empirical observation (which is fictitious, by the way) is valid. There is indeed a negative correlation between skilled chefs and ethical responsibility.

However, this occurs not because of a common cause, but is a result of a common effect. This effect is the fact that your entire dataset is comprised of “your” interview pool.

If you think about it, you never invited chefs who were below average on skills and below average on ethical responsibility to the interview.

Graphically, your interview pool would look like the shaded triangle inside the bigger square shown below. The unshaded region represents the subset of chefs who never made it to your interview list.

Correlation (III): How To Avoid Berkson’s Paradox — An illustration shows a plot made of a box. In the x-axis, ethical responsibility increases from left to right. In the y-axis, skill increases from bottom to top. In the right-hand top corner, a relatively small triangle is marked and shaded. It seems to indicate highly skilled and highly ethical chefs. — Skill Vs. Ethical Responsibility — Illustration created by the author

If you had considered the uninvited subset of chefs, your two variables might not be correlated at all or might even be positively correlated. We do not know because you never invited them.

Before we cover how to avoid this situation, let us look at the mathematics behind this phenomenon.

The Mathematics Behind Berkson’s Paradox

Let us now consider two independent events: A and B. Both ‘A’ and ‘B’ have an equal probability of occurring. That is:

P(A) = ½

P(B) = ½

To make things easier to visualise, let us consider a dataset of 100 instances, with A representing the instance of skilled chef and B representing the instance of an ethical chef.

Based on equal probability, the occurrence distribution would look as shown in the below table:

Correlation (III): How To Avoid Berkson’s Paradox — A table with the following combinations of cells: A and B — 25, A and (NOT B) — 25, (NOT A) and B — 25, (NOT A) and (NOT B) — 25. — Equal Probability distribution of A and B — Table created by the author

Inviting chefs who are ethical or skilled or both is akin to reducing your sample space to the (yellow) shaded region of the table as shown below:

Given this setting, the probability of the correlation you look for is given by:

P(A|AUB)

Since your interview list is only made of chefs who are skilled or ethical or both, your probability arrives from the shaded region as follows:

P(A|AUB) = 50/75 = 2/3 (considering shaded sample space only)

Do you realise what is going on here? All of a sudden the two unconditional random variables have become conditional! The reason is that we have excluded the unshaded box.

Let us now repeat the calculation whilst considering the unshaded box from the table as well:

P[A|(A)U(B)U(~A)U(~B)] = P(A) = 50/100 = ½ (considering both shaded and unshaded sample spaces)

When we consider the entire sample space (shaded AND unshaded), the dependency goes away, and the conditional probability bias dissolves.

Now that we have covered the mathematics behind Berkson’s paradox, let us see how one can avoid it.

How To Avoid Berkson’s Paradox

Berkson’s paradox is a veridical paradox for a reason; the negative correlation you observed (again, from our fictitious example) is very much true. However, before we can claim any correlation, we HAVE to look at our dataset.

To avoid Berkson’s paradox, ask yourself the following question:

“Is my source dataset completely random?”

In the case of your interview list, you introduced a bias that made the dataset non-random enough to bring conditional probabilities into effect. In our fictitious example, we could easily deduce the reason for the issue.

However, in real-life applications, things are much more complex. Researchers can go very far with their clever statistical approaches without realising that there is a major issue with their source dataset.

Therefore, it makes sense to:

1. Be aware of Berkson’s paradox in the context of correlation.

2. Question your dataset for bias before claiming correlation.

References and credit: Joseph Berkson and Jordon Ellenberg.

If you’d like to get notified when interesting content gets published here, consider subscribing.

Further reading that might interest you:

If you would like to support me as an author, consider contributing on Patreon.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_R5WSNS3HKS	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_131795354_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_learn_press_session_a7b7f6513d11f58744fec86fbc57b116	2 days	No description
_wordpress_lp_guest	1 hour	No description
GoogleAdServingTest	session	No description

Correlation (III): How To Avoid Berkson’s paradox

Skilled Chefs and Ethical Responsibility

Berkson’s Paradox Emerges

The Mathematics Behind Berkson’s Paradox

How To Avoid Berkson’s Paradox

Explore humanity's most curious questions!

Sign up to receive more of our awesome content in your inbox!

Comments

Leave a Reply Cancel reply