Click here to monitor SSC
Av rating:
Total votes: 41
Total comments: 8


Joe Celko
Causation, Correlation and Crackpots
15 September 2009

Joe Celko explores the dangers of muddling correlation and causation, emphasises the importance of determining how likely it is that a correlation has occurred by chance, and gets stuck into calculating correlation coefficients in SQL. Along the way, Joe illustrates the consequences of leaping to the wrong conclusion from correlations with tales of Pop Dread.

Correlation is not Causation

Database people are often not data mining people.  Oh, we can do a few simple descriptive statistics because they are built into SQL products these days; and we, at least, can produce an average with AVG() when called upon.  But the real purpose of data is to become information.  We want to have some idea at a higher level as to what the data means.  The ultimate goal is prediction of the future by understanding the present.  We want to know cause and effect. 

People forget, or don’t know, that correlation is not causation.  A necessary cause is one which must be present for an effect to happen -- a car has to have gas to run.  A sufficient cause will bring about the effect by itself -- dropping a hammer on your foot will make you scream in pain, but so will having your hard drive crash.  A contributory cause is one which helps the effect along, but would not be necessary or sufficient by itself to create the effect.  There are also coincidences, where one thing happens at the same time as another, but without a causal relationship. 

'A correlation between two
 measurements, say X and Y,
 is basically a formula that
allows you to predict one
measurement given the
other, plus or minus some
error range'

A correlation between two measurements, say X and Y, is basically a formula that allows you to predict one measurement given the other, plus or minus some error range.  For example, if I lock a cannon at a certain angle and keep shooting it with the same amount of gunpowder, I could expect to place the cannon ball, within, say, a two meter radius of the target most of the time.  Once in awhile, the cannonball will be dead on target and other times it could be several meters away, but most of the time I would be within my target area.   But how well do the angle and the weight of the gunpowder predict hitting the target? 

We like linear expressions and polynomials because the math is easier, but life is not always that kind.  You always start with a simple linear correlation; if you are lucky, it works and if not then you have a start.  The strength of the prediction is called the coefficient of correlation.  There are several kinds, but Pearson’s r is the most common one, and is denoted by the variable r where (-1 <= r <= 1).  A coefficient of correlation of 'negative one' is absolute negative correlation; when X happens, then Y never happens.  Positive one is “cause and effect” or perfect correlation; when X happens, then Y also happens.  Zero coefficient of correlation means that X and Y happen independently of each other. 

In the real world, you expect stronger correlations when you deal with physics, machinery and systems without a lot of variables.  But when you deal with people, if you get r = 0.7 or better then you need to look for reasons.  You are probably dealing with a restricted population and cannot generalize. 

Why variables associate

There are five ways two variables can be related to each other. 

  • The truth could be that X causes Y.  You can estimate the chirp rate of a cricket from the temperature in Fahrenheit using the formula (chirps = 3.777 * degrees - 137.22), with r = 0.9919 accuracy.  However, nobody believes that crickets cause temperature changes. 
  • The truth could be that Y causes X, case two.  This is where your get a logical fallacy and superstitions.  American supply planes in the Pacific theater in World War II would drop cargo via parachute on primitive islands where the US and Japan had military bases during and after World War II.  The Islanders went from the Stone Age to the 20-th century overnight; medicine, canned food, canvas tents, steel tools, and sometimes weapons are very impressive.  And even more so when they drop from the sky.  Local religions grew up around imitating what they had seen the soldiers do.  They carved wooden headphones to talk to the planes, built control towers, learned to wave landing signal flags and lit signal fires on runways.  Every now and then at first signal fires worked by tricking an off-course cargo plane to make a drop.  This behavior is called “sympathetic magic” and appears in many cultures.  You mimic the effects to get the cause.  The war ended and the airbases were abandoned, so the “Manna from Heaven” stopped.  But even today, the John Frum cult is still active on the island of Tanna, Vanuatu. 
  • The third case is that X and Y interact with each other.  Supply and demand curves are an example where as one goes up, the other goes down (negative feedback in computer terms).  A more horrible example is drug addiction where the user requires larger and larger doses to get the desired effect (positive feedback in computer terms), as opposed to a habituation where the usage hits an upper level and stays there. 
  • The fourth case is that any relationship is pure chance.  Any two trends in the same direction will have some correlation, so it should not surprise you that once in awhile, two will match very closely.  The Latin name for this fallacy is “Post hoc ergo propter hoc”; it means "after this, therefore because of this", and it is best seen in a place with gambling.  The dice do not know that you are wearing your lucky socks.  But if all you remember is you won when you wore the lucky socks and/or lost when you did not, then you create a superstition for yourself. 
  • The final case is where the two variables are effects of another variable which is outside the study.  The most common unseen variables are changes in a common environment.  For example, severe hay fever attacks go up when corn prices go down.  They share a common environmental element -- good weather.  Good weather means a bigger corn crop and hence lower prices, but it also means more ragweed and pollen and hence more hay fever attacks.  Likewise, spouses, who live pretty much the same lifestyle, will tend to have the same medical problems from a common shared environment and set of habits. 

If you assume the wrong one of these five, you can create superstition. Even experts get it wrong and cause waves of panic in popular culture. I was looking at some old magazine columns of mine from ten years ago.  The more things change, the more they stay the same.  The “Fear Du Jour” back then was “second hand” cigarette smoke and cell phones giving you brain cancer; today we have Global Warming (which replaced the previous “Pop Dread” of a new Ice Age) and a second Great Depression.  I remember a study done on popular fears which used the term “Pop Dread” and gave a list of cycles that Western media culture goes through; the author classified them by financial, ecological, cultural, medical and so forth.  Each had a regular period of so many years when it would dominate the popular press.  There would always be experts who had been warning us about the new danger and a ton of statistics to prove they were right. 

Pop Dread

The 1994 March issue of DISCOVERY magazine had commentary columns entitled "Counting on Dyscalculia" by John Allen Paulos, the man who coined the word "innumeracy" in his 1990 best seller of the same title.  His particular topic was health statistics since those create a lot of "pop dread" when they get played up in the media. 

One of his examples in the article was a widely covered lawsuit in which a man alleged a causal connection between his wife's frequent use of a cellular phone and her subsequent brain cancer.  But brain cancer is a rare disease that strikes approximately 7 out of 100,000 people per year.  Given the large population of the United States, this is still about 17,500 new cases per year and that number has held pretty steady for years. 

There are over 10 million cellular phone users in the United States.  If there were a causal relationship, then there would be an increase in cases as cellular phone usage increased.  In fact, if we found that there were less than 7000 cases among cellular phone users we could use the same argument to "prove" that cellular phones prevent brain cancer. 

In 2005, a Danish study looked at the cell phone usage of brain tumor patients and concluded that the two were unlikely linked.  More recent studies in France and Norway concluded the same thing.  London's Institute of Cancer Research study showed that there is no risk of developing brain cancer from cell phone usage.

The study investigated 966 brain cancer patients, and showed that there was no risk of gliomas, a serious type of brain cancer, when using a cell phone over time.  The nice part of such a study is that you have all the data in the billing records – first usage, years of ownership, the cumulative number of calls and the hours of use were all there.  They also looked at the type of cell phone since earlier cell phones used higher power analog signals compared to modern digital phones.  It did not matter; there was no increased risk of developing brain cancer. 

The original version of this was that living near power lines was a cause of cancer.  That hysteria started with one flawed epidemiogical study in 1979.  Hundreds of follow up studies found no correlation at all.  But a whole industry was built on re-fitting houses to shield them from EMF (Electro Magnetic Fields) and selling expensive detectors.  Myself, I just add another layer of aluminum foil to my hat I use to keep the Martians from reading my brain waves.  It is a nice simple engineering solution that does not involve government intervention in my life. 

Logic does not stop anecdotal stories or exceptional cases from over-ridding math.  The current line is that Senator Edward (“Ted”) Kennedy’s lethal brain cancer could have been prevented if he had not used his cell phone so much.  Celebrities carry a lot of weight in the popular press, so if you love the Kennedy family, you will fight brain tumors by opposing cell phones and sending me money.  Senator Arlen Specter was diagnosed and treated for a brain cancer in 1993 wants Senate hearings on the matter in September.  This will give us a run of people with brain cancers, those who have lost loved ones to the disease (actually there are many different kinds, but don’t let that ruin a good story) and a host of “cell phone danger” experts on television. 

Not exactly logic, but if you scream it loudly and dance around, you can gain political power and funding.  So the question becomes how do we cheat without outright lying?   The confidence level is related to the coefficient of correlation, but it is expressed as a percentage.  It says that x% of the time the relationship you are seeing would not happen by chance.  It also gives a plus or minus percentage of error.  You have heard that used in surveys that are quoted on television and radio.  The bigger the sample (assuming it is a true random sample), the smaller that plus or minus interval. 

The Confidence Level

A 99% confidence level is the standard for serious scientific studies.  A 95% confidence level is the one used in industry, surveying and such things.  A 90% confidence level doubles the chance of errors and lets you slip it a lot of data to meet your agenda.  Below that, you are using garbage stats for exploring or fraud. 

If you would like to play around with the concept, go to http://www.raosoft.com/samplesize.html.  This company has made survey software for many years and their website has an on-line sample size calculator.  You provide the confidence level, population size and some other simple values.  Push the button and get the minimum sample size you need back. 

Calculating Pearson's r

So, after arming you with several cautions,  here is a simple correlation in SQL. There are many ways of calculating correlation which will not give the same coefficient for the same data, but are all valid measures of the strength of association, and you need to choose the right one depending on the nature of the data. One of the most useful ones is Pearson's Product Moment Coefficient (Pearson's r), or the linear correlation coefficient.  It measures the strength of the linear association between two variables.  It is only possible to draw clear inferences from a correlation based on sample data if the data distribution has bivariate normality and is of a reasonable size

The formula translates into SQL in a straightforward manner. 

CREATE TABLE Samples

(sample_name CHAR(3) NOT NULL PRIMARY KEY, x FLOAT NOT NULL, y FLOAT NOT NULL);

 

INSERT INTO Samples

VALUES ('a', 1.0, 2.0),

       ('b', 2.0, 5.0),

       ('c', 3.0, 6.0);

 

-- r = 0.9608

 

DECLARE @x_avg FLOAT, @y_avg FLOAT;

 

SELECT @x_avg = AVG(X), @y_avg = AVG(y) FROM Samples;

 

SELECT SUM((x - @x_avg)*(y - @y_avg))

       / SQRT(SUM(POWER((x - @x_avg), 2)) * SUM(POWER((y - @y_avg), 2)))

       AS pearson_r FROM Samples;

 

 

But you can use an alternative method that gets round needing to know the averages before you do the aggregation. In the second version, the aggregation is only done once

SELECT ((N*SumXY)-(SumX*SumY))/(SQRT(N*SumX2-SumX*sumX)*SQRT(N*SumY2-SumY*sumY))

 FROM   (SELECT SUM(x) AS SumX,

                SUM(Y) AS sumY,

                SUM(X*X) AS SumX2,

                SUM(Y*Y) AS sumY2,

                SUM(X*Y) AS sumXY,

                COUNT(*) AS N

         FROM   samples

        ) result;

Calculating Spearman's Rank Order Correlation

Spearman's Rank Order Correlation coefficient is devised to measure the correlation between ordinal rankings, and is much used for checking how well the judges of any subjective assessment such as the quality of ballroom dancing or the breeding of dogs actually agree. Imagine we have two judges ranking the quality of flower arrangements. Each judge ranks the flower arrangements in order. You can then calculate the correlation as the measure of how well the judges agree

CREATE TABLE FlowershowRankings

(arranger_name VARCHAR(10) NOT NULL PRIMARY KEY,

 first_judge_ranking INTEGER NOT NULL,

 second_judge_ranking INTEGER NOT NULL);     

 

INSERT INTO FlowershowRankings

VALUES ('Bob', 1, 2),  

       ('Andrea', 3, 1),

       ('Katie', 2, 3),

       ('Richard', 4, 4), 

       ('Tony', 5 , 6),

       ('Chris',6 , 5),

       ('Alice', 7, 7);

 

--The  Spearman Correlation Coefficient is:

 

SELECT  1 - (6* SUM(POWER(first_judge_ranking - second_judge_ranking, 2)))

         / CAST((COUNT(*) * (POWER(COUNT(*), 2)-1)) AS FLOAT) AS spearman_rho

  FROM FlowershowRankings;

 



This article has been viewed 8158 times.
Joe Celko

Author profile: Joe Celko

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 41 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: You missed one
Posted by: Phileosophos (not signed in)
Posted on: Monday, September 21, 2009 at 9:39 PM
Message: First, let me compliment you on the wonderfully thoughtful approach you've taken in your writing. Second, let me observe that you missed one of hte ways in which two variables may be related.

Consider X and Y in the context of polygons such that X refers to the number of sides and Y refers to the number of interior angles. And for sake of discussion, consider the specific case of the triangle. In this case, the five relationships you suggest are inadequate.

The relation of the having of three sides (trilaterality) to the having of three internal angles (triangularity) is one of essential predication; not causation, interaction, chance, or the result of external factors. Trilaterality is necessarily conjoined with triangularity at the metaphysical level, hence the name "essential predication".

Subject: Nope; it was out of the Universe of discourse
Posted by: Anonymous (not signed in)
Posted on: Monday, September 21, 2009 at 10:02 PM
Message: In stats, I would call it a 1.0 correlation; in Plane Geometry, I would call it a definition.

Opps! I just moved to sphere!! The sum of the angles of a triangle > 360 degrees!

Damn! I just got onto a "saddle" plane and the angles of a triangle < 360 degrees!

You are confusing FORMAL systems with data.


Subject: Could do better
Posted by: Anonymous (not signed in)
Posted on: Tuesday, September 22, 2009 at 2:29 AM
Message: Nice article. But can interweb people please stop stealing article ideas from books like: Risk - the science and politics of fear, Black Swan, etc etc

Subject: Re: Could do better
Posted by: Phil Factor (view profile)
Posted on: Tuesday, September 22, 2009 at 9:29 AM
Message: When I trained in Statistics in the late sixties, the illustrations given in books and lectures on correlation and causation were very similar to Joe's, (not surprising, as he trained in Statistics at around the same time) though the examples were of their time, and used different cancer 'pop dread' examples. Recently-published books such as 'Risk - the science and politics of fear' are certainly not the sole source of examples of the way that scientists exploit the general confusion between correlation and causation. It is a well-worn theme.

Subject: "interweb people" ???
Posted by: Anonymous (not signed in)
Posted on: Tuesday, September 22, 2009 at 12:56 PM
Message: hahaha do you even have the slightest idea who Joe Celko is? or do you have some kind of hidden agenda? or with all due respect have you simply flat-out lost your mind? Joe Celko does not need to "steal" article ideas.

Subject: Good job
Posted by: Adam Machanic (view profile)
Posted on: Tuesday, September 22, 2009 at 4:25 PM
Message: A very enjoyable piece. Good job, Joe!

Subject: Great article. Can I make 3 suggestions, Joe?
Posted by: LiveWare (view profile)
Posted on: Monday, October 19, 2009 at 8:42 AM
Message: Joe, this is a terrific article. If there's one concept I try to help my users understand about data analysis, this is it!

Unfortunately, I think you blew it on 3 points:

1. Second-hand Smoke: The jury is still out on this one, but the most recent studies lean more heavily toward the conclusion that yes, there is a damaging effect from it. Using an ambiguous example only underscores the fact that....

2. Uncertainty Exists in a Real World: I know, this is obvious, right? But users, politicians, voters, and V.P.s want ANSWERS, don't they?? Clear, irrefutable, unambiguous answers.

Sometimes, we just don't know, and I think your piece would be more effective if you reminded folks that this is O.K. Even the issue about cell phones and cancer just hit the news, suggesting that the most rigorous of the studies may indeed show a causal link.

3. Pearson's R: This is an overused measure for measuring correlation, and you forgot to mention the assumptions inherent in its use (there is a linear relation between x and y, the data are representative of the population being measured, etc.). I get too many people quoting this calculated statistic as if if were "truth."

My first reply to them: "Have you looked at the data? Boxplots? Scatterplots? Why is it you think there _should_ be a linear relationship here?"

As for Pop Dread.... well, we were once told that the heliocentric model of the solar system, plate tectonics, and the germ theory of medicine were heretical concepts. I'll reserve judgment. Time is a potent truth filter.

Finally, here's a great cartoon about this subject:
http://xkcd.com/552/

Thanks again!

Subject: Favorite correlation
Posted by: Julie9 (view profile)
Posted on: Tuesday, October 20, 2009 at 3:05 PM
Message: My favorite correlation is the very strong positive correlation between shoe size and reading ability in elementary school children.

Why is this so? Because older children are both generally bigger and in higher grade levels.

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...

SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk