Be it the medical sciences or the social sciences, correlation is at the heart of scientific discovery. Say that you wish to invent a new drug to cure a disease. You could just gather a bunch of bio markers that have a positive correlation with curing the said disease.
Then, you just test plausible chemical combinations that aid this said āpositive correlationā. Trial and error is your friend. At some point, you should have a new drug that cures the disease.
Now, say that you wish to analyse if the price of your new drug would have any effect on your companyās stock market price or vice-versa. You perform a ālinear-regressionā analysis comprising the two variables (based on some past data).
This is standard practice in the biz, and your analysis shows that the price of your drugs and your companyās stock market price are uncorrelated. So, you could go ahead and price your latest drug without consideration to your companyās stock market price. Pretty cool, right?
Well, not quite. The results of your drug development method as well as your linear regression analysis might be misleading you . And your misunderstanding of the word ācorrelationā is to blame.
This essay aims to correct that. We will be starting with the history of how correlation was discovered. Following this, we will swing back and forth between regression and correlation with a slight touch of analytic geometry.
It would help you if you have already readĀ my essay on regression. At the end of this essay, you should have a better understanding of both correlation and regression. This, in turn, would help you avoid the common pitfalls. Without any further ado, let us begin!
This essay is supported by Generatebg
Regression Vs. Heredity ā Galtonās Genius
Our story begins with the renowned polymath, Francis Galton. Among his many adventures, he was studying the nature of heredity. He had already figured out that children of tall parents tended to be taller than average, but still shorter than their parents.
On the other hand, children of short parents tended to be shorter than average, but still, taller than their parents. Galton termed this phenomenon āregressionā. In essence, he had discovered regression to the mean.
But he wasnāt satisfied with just that. He wanted to quantify the effect of regression. Clearly, it wasnāt just regression which decided a childās height. Heredity also played a role. But how much did each affect the final height? Thatās pretty much the question he was trying to answer.
Correlation ā Galtonās Ellipse
Galton collected parent-child height data and created a table which featured a map between parentsā heights and adult childrenās heights. Iām skipping the fine details here on how exactly he computed these, but this suffices for our requirements in this essay.
He noticed something peculiar. He began seeing an elliptical shape that did not appear random. Here is a diagram he published based on his data in his 1886 paper āRegression Towards Mediocrity in Hereditary Statureā:
This diagram reveals the interplay between heredity and chance. In the forthcoming section, we will briefly see how variations in both of these variables could affect this interplay.
Galton then decided to plot each parent-child height pair as a point in a two-dimensional plane; he considered parentsā heights on a horizontal axis and childrenās heights on a vertical axis. Mind you, back then Cartesian graphs were not remotely close to the norm.
By doing this, Galton had in fact invented what we know today as scatter plots. While we are on the topic of scatter plots, why donāt we see how heredity and chance affect the outcome of the graphical results?
Introduction to Correlation via Scatter Plots
Let us now say that chance has no role to play in Galtonās parent-child dataset. In such a case, heredity governs the adult childās height, and every child would be exactly as tall as the parent. In such a case, the scatter plot would look like this:
It is no wonder that we see just points along a diagonal line here, because we have a situation where (x = y). Having said that, note that the points are more scattered toward the peripheries than in the middle.
Now, let us say that heredity has no role to play in Galtonās parent-child dataset. In this realm, chance governs the outcome 100%. The corresponding scatter plot would look like this:
This scatter plot shows no affinity to the diagonal. In other words, the childās height and parentās height are independent of each other. So, regardless of the parentās genes, the childās characteristics are 100% luck of the draw (chance).
As you can imagine, both of these are extreme cases. The reality slots somewhere in between. Galtonās dataset led to a scatter plot that looked like this:
Here, we can clearly see the ellipse from the left-hand bottom corner to the right-hand top corner. Galton was so methodical with confirming this result that he went to the trouble of concealing the dataās background (to remove prejudices) and consulting a mathematician to confirm his observation; old-school peer review, if you will.
Drawing Conclusions from the Scatter Plots
Comparing all three scatter plots, we could arrive at the following three empirical conclusions:
1. When the outcome is completely deterministic (that is, controlled 100% by heredity), the ellipse collapses into a straight (diagonal) line.
2. When the outcome is completely governed by chance, the ellipse expands to become a circle (roughly speaking).
3. When the outcome is governed by a mix of both, an ellipse of āsomeā level of eccentricity results.
Galton termed the measure of this eccentricity (of the ellipse) correlation. Over time, the measure of correlation has been advanced by impressive contributions from people such as Karl Pearson. Today, we apply the concept of correlation to data-sets that span multiple dimensions (a topic for another day).
Back to our main challenge: Why are the methods you followed for your drug company misleading you? Letās jump right into the answers.
Correlation: One of the Most Misunderstood Concepts in Science
You might have heard of the phrase āCorrelation does not mean causationā. This phrase has almost become mainstream. What it means is that two correlated variables need not necessarily have causal links.
For instance, it could be that the biological markers you have gathered are positively correlated with curing the said disease. However, it is NOT a given, that they are causally linked. In other words, having the required bio markers does not necessarily guarantee cure from the said disease.
Since this issue has become more or less mainstream, many researchers are able to wrap their heads around reading correlation information without imposing causal biases. However, the intransitivity of correlation is something that still catches many researchers off guard.
To understand transitivity, let us consider the following relationship: (a > b > c). From this, we see that āaā is greater than ābā and ābā is greater than ācā. Based on this, we could say for sure that āaā is greater than ācā.
This property of extending a relationship from one variable to another from the given relationship information is known as transitivity. āGreater thanā relationships are transitive. However, correlation is NOT.
Hereās your situation: your new drug boosts the bio marker readouts. These bio markers are in turn positively correlated with the diseaseās cure. Based on this, the intuitive conclusion many would make is that the new drug helps cure the disease. But you see, this is NOT a given.
To drive home this point, consider a child that prefers ice-cream āaā to ice-cream ābā. The same child prefers ice-cream ābā to ice-cream ācā. Does this mean that the child prefers ice-cream āaā to ice-cream ācā?
Anyone who has interacted with an ice-cream-loving picky child would not make that āassumptionā. But wait, thereās more!
Uncorrelated variables are NOT Always Unrelated
In the introduction, we saw that the price of your drugs and your companyās stock market price were uncorrelated. This might well be the case, but that does NOT mean that they are unrelated.
The dirty little secret of most of the regression/correlation analyses conducted in scientific research is that they look for linear relationships. This is a conceptual simplification. Not all relationships are linear, and not all uncorrelated variables are unrelated.
The non-linear relationship between your drug prices and your companyās stock market price could take you for a ride if your decisions do not take into account the possibility of these two variables ābecomingā correlated in the future or beyond the realm of your data set.
A linear analysis that says two variables are uncorrelated simply says that they do not have a linear relationship. Many researchers still fall prey to concluding that the two variables in question are independent of each other.
References and credit: Francis Galton and Jordon Ellenberg.
If youād like to get notified when interesting content gets published here, consider subscribing.
Further reading that might interest you:
- How To Really Understand The Philosophy Of Inferential Statistics?
- How To Really Avoid P-Value Hacking In Statistics?
- The Hindsight Bias: Cause And Effect
If you would like to support me as an author, consider contributing on Patreon.
Comments