Correlation: One Of The Most Misunderstood Concepts In Scienceā€Š-ā€ŠAn illustration featuring a comic, where a stick figure standing behind an announcer table is pointing to a presentation of what appears like linear regression data. The stick figure says, "Here, we see correlation between correlation and causation!"

Be it the medical sciences or the social sciences, correlation is at the heart of scientific discovery. Say that you wish to invent a new drug to cure a disease. You could just gather a bunch of bio markers that have a positive correlation with curing the said disease.

Then, you just test plausible chemical combinations that aid this said ā€œpositive correlationā€. Trial and error is your friend. At some point, you should have a new drug that cures the disease.

Now, say that you wish to analyse if the price of your new drug would have any effect on your companyā€™s stock market price or vice-versa. You perform a ā€œlinear-regressionā€ analysis comprising the two variables (based on some past data).

This is standard practice in the biz, and your analysis shows that the price of your drugs and your companyā€™s stock market price are uncorrelated. So, you could go ahead and price your latest drug without consideration to your companyā€™s stock market price. Pretty cool, right?

Well, not quite. The results of your drug development method as well as your linear regression analysis might be misleading you . And your misunderstanding of the word ā€œcorrelationā€ is to blame.

This essay aims to correct that. We will be starting with the history of how correlation was discovered. Following this, we will swing back and forth between regression and correlation with a slight touch of analytic geometry.

It would help you if you have already readĀ my essay on regression. At the end of this essay, you should have a better understanding of both correlation and regression. This, in turn, would help you avoid the common pitfalls. Without any further ado, let us begin!

This essay is supported by Generatebg

A product with a beautiful background featuring the sponsor: Generatebg - a service that generates high-resolution backgrounds in just one click. The description says "No more costly photographers" and displays a "Get Started" button beneath the description.

Regression Vs. Heredity ā€” Galtonā€™s Genius

Our story begins with the renowned polymath, Francis Galton. Among his many adventures, he was studying the nature of heredity. He had already figured out that children of tall parents tended to be taller than average, but still shorter than their parents.

On the other hand, children of short parents tended to be shorter than average, but still, taller than their parents. Galton termed this phenomenon ā€˜regressionā€™. In essence, he had discovered regression to the mean.

But he wasnā€™t satisfied with just that. He wanted to quantify the effect of regression. Clearly, it wasnā€™t just regression which decided a childā€™s height. Heredity also played a role. But how much did each affect the final height? Thatā€™s pretty much the question he was trying to answer.

Correlation ā€” Galtonā€™s Ellipse

Galton collected parent-child height data and created a table which featured a map between parentsā€™ heights and adult childrenā€™s heights. Iā€™m skipping the fine details here on how exactly he computed these, but this suffices for our requirements in this essay.

He noticed something peculiar. He began seeing an elliptical shape that did not appear random. Here is a diagram he published based on his data in his 1886 paper ā€œRegression Towards Mediocrity in Hereditary Statureā€:

Correlation: One Of The Most Misunderstood Concepts In Scienceā€Šā€”ā€ŠAn image showing Galtonā€™s diagram with a table which features a map between parentsā€™ heights and adult childrenā€™s heights. One can see an elliptical shape emerging around regions of the data that feature constant density.
Galtonā€™s correlation diagram ā€” image from WikiCC

This diagram reveals the interplay between heredity and chance. In the forthcoming section, we will briefly see how variations in both of these variables could affect this interplay.

Galton then decided to plot each parent-child height pair as a point in a two-dimensional plane; he considered parentsā€™ heights on a horizontal axis and childrenā€™s heights on a vertical axis. Mind you, back then Cartesian graphs were not remotely close to the norm.

By doing this, Galton had in fact invented what we know today as scatter plots. While we are on the topic of scatter plots, why donā€™t we see how heredity and chance affect the outcome of the graphical results?


Introduction to Correlation via Scatter Plots

Let us now say that chance has no role to play in Galtonā€™s parent-child dataset. In such a case, heredity governs the adult childā€™s height, and every child would be exactly as tall as the parent. In such a case, the scatter plot would look like this:

Correlation: One Of The Most Misunderstood Concepts In Scienceā€Šā€”ā€ŠAn illustration showing a graph with parentsā€™ height on the x-axis and childā€™s height on the y-axis. The plot appears to be a straight diagonal line from the lower-left to the upper-right. The points are denser towards the centre than at the periphery.
Scatter plot for fully deterministic behaviour ā€” illustration created by the author

It is no wonder that we see just points along a diagonal line here, because we have a situation where (x = y). Having said that, note that the points are more scattered toward the peripheries than in the middle.

Now, let us say that heredity has no role to play in Galtonā€™s parent-child dataset. In this realm, chance governs the outcome 100%. The corresponding scatter plot would look like this:

Correlation: One Of The Most Misunderstood Concepts In Scienceā€Šā€”ā€ŠAn illustration showing a graph with parentsā€™ height on the x-axis and childā€™s height on the y-axis. The plot appears to feature a random distribution of dots. The outline of the boundary dots roughly forms a circle.
Scatter plot for completely random behaviour ā€” illustration created by the author

This scatter plot shows no affinity to the diagonal. In other words, the childā€™s height and parentā€™s height are independent of each other. So, regardless of the parentā€™s genes, the childā€™s characteristics are 100% luck of the draw (chance).

As you can imagine, both of these are extreme cases. The reality slots somewhere in between. Galtonā€™s dataset led to a scatter plot that looked like this:

Correlation: One Of The Most Misunderstood Concepts In Scienceā€Šā€”ā€ŠAn illustration showing a graph with parentsā€™ height on the x-axis and childā€™s height on the y-axis. The plot appears to feature an elliptical distribution. The outline of the boundary dots roughly forms a ellipse.
Scatter plot depicting Galtonā€™s correlation data ā€” illustration created by the author

Here, we can clearly see the ellipse from the left-hand bottom corner to the right-hand top corner. Galton was so methodical with confirming this result that he went to the trouble of concealing the dataā€™s background (to remove prejudices) and consulting a mathematician to confirm his observation; old-school peer review, if you will.

Drawing Conclusions from the Scatter Plots

Comparing all three scatter plots, we could arrive at the following three empirical conclusions:

1. When the outcome is completely deterministic (that is, controlled 100% by heredity), the ellipse collapses into a straight (diagonal) line.

2. When the outcome is completely governed by chance, the ellipse expands to become a circle (roughly speaking).

3. When the outcome is governed by a mix of both, an ellipse of ā€˜someā€™ level of eccentricity results.

Galton termed the measure of this eccentricity (of the ellipse) correlation. Over time, the measure of correlation has been advanced by impressive contributions from people such as Karl Pearson. Today, we apply the concept of correlation to data-sets that span multiple dimensions (a topic for another day).

Back to our main challenge: Why are the methods you followed for your drug company misleading you? Letā€™s jump right into the answers.


Correlation: One of the Most Misunderstood Concepts in Science

You might have heard of the phrase ā€œCorrelation does not mean causationā€. This phrase has almost become mainstream. What it means is that two correlated variables need not necessarily have causal links.

For instance, it could be that the biological markers you have gathered are positively correlated with curing the said disease. However, it is NOT a given, that they are causally linked. In other words, having the required bio markers does not necessarily guarantee cure from the said disease.

Since this issue has become more or less mainstream, many researchers are able to wrap their heads around reading correlation information without imposing causal biases. However, the intransitivity of correlation is something that still catches many researchers off guard.

To understand transitivity, let us consider the following relationship: (a > b > c). From this, we see that ā€˜aā€™ is greater than ā€˜bā€™ and ā€˜bā€™ is greater than ā€˜cā€™. Based on this, we could say for sure that ā€˜aā€™ is greater than ā€˜cā€™.

This property of extending a relationship from one variable to another from the given relationship information is known as transitivity. ā€œGreater thanā€ relationships are transitive. However, correlation is NOT.

Hereā€™s your situation: your new drug boosts the bio marker readouts. These bio markers are in turn positively correlated with the diseaseā€™s cure. Based on this, the intuitive conclusion many would make is that the new drug helps cure the disease. But you see, this is NOT a given.

To drive home this point, consider a child that prefers ice-cream ā€˜aā€™ to ice-cream ā€˜bā€™. The same child prefers ice-cream ā€˜bā€™ to ice-cream ā€˜cā€™. Does this mean that the child prefers ice-cream ā€˜aā€™ to ice-cream ā€˜cā€™?

Anyone who has interacted with an ice-cream-loving picky child would not make that ā€œassumptionā€. But wait, thereā€™s more!

Uncorrelated variables are NOT Always Unrelated

In the introduction, we saw that the price of your drugs and your companyā€™s stock market price were uncorrelated. This might well be the case, but that does NOT mean that they are unrelated.

The dirty little secret of most of the regression/correlation analyses conducted in scientific research is that they look for linear relationships. This is a conceptual simplification. Not all relationships are linear, and not all uncorrelated variables are unrelated.

The non-linear relationship between your drug prices and your companyā€™s stock market price could take you for a ride if your decisions do not take into account the possibility of these two variables ā€œbecomingā€ correlated in the future or beyond the realm of your data set.

A linear analysis that says two variables are uncorrelated simply says that they do not have a linear relationship. Many researchers still fall prey to concluding that the two variables in question are independent of each other.


References and credit: Francis Galton and Jordon Ellenberg.

If youā€™d like to get notified when interesting content gets published here, consider subscribing.

Further reading that might interest you:

If you would like to support me as an author, consider contributing on Patreon.

Street Science

Explore humanity's most curious questions!

Sign up to receive more of our awesome content in your inbox!

Select your update frequency:

We donā€™t spam! Read our privacy policy for more info.