Correlation (II): How To Spot The Surrogate Endpoint Problem?
Published on September 17, 2022 by Hemanth
--
When it comes to the notion of correlation, the surrogate endpoint problem presents a very persistent challenge that no one has solved until today. It subtly catches researchers off guard and leads to false conclusions and decisions.
In my previous essay on correlation, I covered Francis Galton’s genius historical discovery and some of the traps in analysing correlation such as intransitivity and non-linear relationships between correlated variables.
In this essay, I will be covering the much more challenging issue of the surrogate endpoint problem. We will begin by setting up a simple correlation scenario between two binary variables.
Then, we will shortly cover how this simple setup leads to statistical traps. Following this, we will proceed to see how the surrogate endpoint problem presents itself naturally. Without any further ado, let us begin.
Consider the hypothetical example that a vegan diet is positively correlated with COVID-19 infection. Both of these factors could be treated as binary variables. That is, both variables could be answers to yes or no questions.
A person can either follow a complete vegan diet or not. Similarly, a person can be infected with COVID-19 or not. Based on the positive correlation, we could say that a vegan person is more likely than the average person to get infected by COVID-19.
This is the same as saying that a person infected by COVID-19 is more likely than the average person to be vegan. Do you realise that both of these are logically equivalent statements?
If not, why don’t we look at their mathematical expressions? The first statement can be expressed mathematically as follows:
(Vegans with COVID-19) / (All vegans) > (All people with COVID-19) / (All people)
Similarly, the mathematical expression for the second statement is as follows:
(Vegans with COVID-19) / (All people with COVID-19) > (All vegans) / (All people)
To see the logical equivalence more clearly, multiply the first expression by [(All vegans)*(All people)] on both sides. Similarly, multiply the second expression by [(All people with COVID-19)*(All people)] on both sides. In both cases, you will get the following expression:
(Vegans with COVID-19) * (All people) > (All people with COVID-19) * (All vegans)
So, now that we have established the logical equivalence, let us move on to the statistical treatment of correlation.
Statistics and Correlation
When we consider the last expression, one prominent problem pops up. When we consider any meaningful sample size, there is only a very slim chance that the left-hand side of the expression would be exactly equal to the right-hand side.
Correlation presentation cartoon— illustrative art created by the author
What this means is that these two variables are going to be correlated one way or another. And there is nothing special about being vegan or suffering from COVID. You can expect gender or blood type or any binary variable to be positively or negatively correlated with COVID risk.
So, how do we solve this problem? That is where the statisticians come in. When you read reports on correlation based on scientific studies, what you get are statistically significant correlations.
The concept of statistical significance leads to a whole host of issues that I have covered in a dedicated essay. But the issues don’t stop there. Even with statistically significant correlation, causal misinterpretation shows up again!
Causal Misinterpretation from Correlation
Let us say that we indeed managed to establish in a statistically significant fashion that being vegan is positively correlated with mortal COVID-19 infections. Consequently, the following statement is in order:
“If you are vegan, you are more likely than the average person to get infected mortally by COVID-19.”
This statement states factually what we know so far. But it is missing the flair and finesse that mainstream media generally looks for.
So, you will often find the following statement instead of the above one in mainstream media:
“If you were not vegan, you would be less likely to be mortally infected with COVID-19.”
The difference is subtle, but the implications are huge. While the first statement was factual, the second statement implies a causal link between the two variables.
We have in no way proved that there is a causal link between the variables from the statistically significant correlation alone. This issue directly leads us to the main issue we are covering in this essay.
The Surrogate Endpoint Problem
The surrogate endpoint problem arises naturally from many correlation scenarios. Consider the previous example of the correlation between being vegan and mortal COVID-19 infections.
It is very resource intensive and time-consuming to invest in scientific studies that quantitatively measure mortal COVID-19 risk from veganism. The researchers have to wait for vegans to die from COVID-19.
Correlation study cartoon — illustration created by the author
So, instead of waiting, the researchers try to find a surrogate endpoint. This could be some biomarker such as blood oxygen levels.
If the blood oxygen level of a vegan infected with COVID-19 drops below a threshold number, the researchers might declare the situation a mortal risk.
The surrogate endpoint, then, is a proxy that takes the place of a much more complex phenomenon.
Not only might the actual phenomenon and the surrogate endpoint not be causally linked, but they might be consequences of some factor that we never even tracked.
As a result, our analysis might lead to false conclusions and erroneous decisions.
How to Spot the Surrogate Endpoint Problem
Whenever you read phrases like “This spray increases cancer risk in users…” or “Eating this food increases cardiovascular health risk…”, pay attention to what was actually measured to make such conclusions.
Phrases such as “cardiovascular health risk” or “cancer risk” are usually quantified by some surrogate/proxy. It need not always lead to the surrogate endpoint problem, but the issue always lurks around such studies.
More than 130 years after Galton’s discovery of the notion of correlation, we still struggle to thread our way between correlation and causation thanks to issues such as statistical significance and the surrogate endpoint problem.
For all we know, the root cause of this issue could be our (human) nature rather than our (lack of) scientific progress. But that does not mean that we should stop trying to solve such problems.
Such is our strong drive for scientific progress and evolution as a species!
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-advertisement
1 year
Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent
1 year
Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Cookie
Duration
Description
_gat
1 minute
This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Cookie
Duration
Description
__gads
1 year 24 days
The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga
2 years
The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_R5WSNS3HKS
2 years
This cookie is installed by Google Analytics.
_gat_gtag_UA_131795354_1
1 minute
Set by Google to distinguish users.
_gid
1 day
Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT
2 years
YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Cookie
Duration
Description
IDE
1 year 24 days
Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie
15 minutes
The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE
5 months 27 days
A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC
session
YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Comments