Berkson’s paradox is a veridical paradox that involves the notions of correlation and conditional probability. The physicist/physician/statistician Joseph Berkson first demonstrated this phenomenon as a retrospective study. Consequently, his name stuck.
In my previous essays on correlation, I covered how the correlation between two variables might not necessarily be causally linked.
There could be other variables that we are not even tracking, which actually cause the correlation. But in the case of Berkson’s paradox, the observed correlation is not because of a common cause, but because of a common effect.
In this essay, I demonstrate this firstly using an intuitive example. Then, I cover the mathematics behind Berkson’s paradox. Finally, we look at how you can avoid this trap. Let us begin.
For your new restaurant, let us say that you are looking for skilled Chefs who have a strong sense of ethical responsibility. You invite Chefs who tick at least one of the boxes for interview.
In other words, the interview candidates are ethically responsible or skilled or both. During the interview process, you realise that skill and ethical responsibility are inversely correlated.
That is, skilled chefs tend to have a low sense of ethical responsibility and chefs with admirable ethical traits tend to be below average on skills.You get frustrated and wish to solve this problem for the whole gastronomy industry.
To start, you wish to get to the root cause of this correlation. You start by suspecting that the industry culture causes skilled chefs to lose ethical sense over time. But then, you hear a loud voice:
“Hang on a minute!”
That’s where I come in and say that there is something else going on here. Let me clarify.
Berkson’s Paradox Emerges
Firstly, your empirical observation (which is fictitious, by the way) is valid. There is indeed a negative correlation between skilled chefs and ethical responsibility.
However, this occurs not because of a common cause, but is a result of a common effect. This effect is the fact that your entire dataset is comprised of “your” interview pool.
If you think about it, you never invited chefs who were below average on skills and below average on ethical responsibility to the interview.
Graphically, your interview pool would look like the shaded triangle inside the bigger square shown below. The unshaded region represents the subset of chefs who never made it to your interview list.
Skill Vs. Ethical Responsibility — Illustration created by the author
If you had considered the uninvited subset of chefs, your two variables might not be correlated at all or might even be positively correlated. We do not know because you never invited them.
Before we cover how to avoid this situation, let us look at the mathematics behind this phenomenon.
The Mathematics Behind Berkson’s Paradox
Let us now consider two independent events: A and B. Both ‘A’ and ‘B’ have an equal probability of occurring. That is:
P(A) = ½
P(B) = ½
To make things easier to visualise, let us consider a dataset of 100 instances, with A representing the instance of skilled chef and B representing the instance of an ethical chef.
Based on equal probability, the occurrence distribution would look as shown in the below table:
Equal Probability distribution of A and B — Table created by the author
Inviting chefs who are ethical or skilled or both is akin to reducing your sample space to the (yellow) shaded region of the table as shown below:
Shaded subset selection of A and B — Table created by the author
Given this setting, the probability of the correlation you look for is given by:
P(A|AUB)
Since your interview list is only made of chefs who are skilled or ethical or both, your probability arrives from the shaded region as follows:
P(A|AUB) = 50/75 = 2/3 (considering shaded sample space only)
Do you realise what is going on here? All of a sudden the two unconditional random variables have become conditional! The reason is that we have excluded the unshaded box.
Let us now repeat the calculation whilst considering the unshaded box from the table as well:
P[A|(A)U(B)U(~A)U(~B)] = P(A) = 50/100 = ½ (considering both shaded and unshaded sample spaces)
When we consider the entire sample space (shaded AND unshaded), the dependency goes away, and the conditional probability bias dissolves.
Now that we have covered the mathematics behind Berkson’s paradox, let us see how one can avoid it.
How To Avoid Berkson’s Paradox
Berkson’s paradox is a veridical paradox for a reason; the negative correlation you observed (again, from our fictitious example) is very much true. However, before we can claim any correlation, we HAVE to look at our dataset.
To avoid Berkson’s paradox, ask yourself the following question:
“Is my source dataset completely random?”
In the case of your interview list, you introduced a bias that made the dataset non-random enough to bring conditional probabilities into effect. In our fictitious example, we could easily deduce the reason for the issue.
However, in real-life applications, things are much more complex. Researchers can go very far with their clever statistical approaches without realising that there is a major issue with their source dataset.
Therefore, it makes sense to:
1. Be aware of Berkson’s paradox in the context of correlation.
2. Question your dataset for bias before claiming correlation.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-advertisement
1 year
Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent
1 year
Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Cookie
Duration
Description
_gat
1 minute
This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Cookie
Duration
Description
__gads
1 year 24 days
The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga
2 years
The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_R5WSNS3HKS
2 years
This cookie is installed by Google Analytics.
_gat_gtag_UA_131795354_1
1 minute
Set by Google to distinguish users.
_gid
1 day
Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT
2 years
YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Cookie
Duration
Description
IDE
1 year 24 days
Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie
15 minutes
The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE
5 months 27 days
A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC
session
YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Comments