Statistics in SQL: Kendall’s Tau rank correlation

Statistical calculations in SQL are often perfectly easy to do. SQL was designed to be a natural fit for calculating correlation, regression and variance on large quantities of data. It just isn't always immediately obvious how. In the second of a series of articles, Phil factor shows how calculating a non-parametric correlation via Kendall's Tau or Spearman's Rho can be stress-free.

Kendall’s Tau rank correlation is a handy way of determining how correlated two variables are, and whether this is more than chance. If you just want a measure of the correlation then you don’t have to assume very much about the distribution of the variables. Kendall’s Tau is popular with calculating correlations with non-parametric data. Spearman’s Rho is possibly more popular for the purpose, but Kendall’s tau has a distribution with better statistical properties (the sample estimate is close to a population variance) so confidence levels are more reliable, but in general, Kendall’s tau and Spearman’s rank correlation coefficient are very similar. The obvious difference between them is that, for the standard method of calculation,  Spearman’s Rank correlation required ranked data as input, whereas the algorithm to calculate Kendall’s Tau does this for you.  Kendall’s Tau consumes any non-parametric data with equal relish.

Kendall’s Tau is easy to calculate on paper, and makes intuitive sense. It deals with the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs of rankings. All observations are paired with each of the others, A concordant pair is one whose members of one observation are both larger than their respective members of the other paired observation, whereas discordant pairs have numbers that differ in opposite directions. Kendall’s Tau-b takes tied rankings into account.

So, let’s imagine that we ask two observers to rank ten glasses of Loire valley wine in order. (These figures aren’t real since I prefer Vouvray and Gros Plant du Pays Nantais)

Wine

Phil

Tony

Menetou-Salon

1

3

Muscadet-Sèvre et Maine

2

4

Quincy

3

2

Reuilly

4

5

Muscadet-Coteaux de la Loire

5

11

Chinon

6

6

Muscadet-Côtes de Grand Lieu

7

12

Vouvray

8

9

Tourraine

9

8

Anjou-Saumur

10

10

Sancerre

11

1

Pouilly-Fumé

12

7

We’ve sorted the rankings in terms of my choices, so we can compare the two. In SQL, we do this anyway.

…and that gives us

Notice that we are not concerned with the individual wines, only with the extent to which the two judges agree in their subjective rankings of the wines.

Using the same data, we can also calculate Spearman’s Rho.

which gives

You can convince yourself that the calculation for Kendall’s Rho doesn’t  need ranked data. You can pretend that, instead of ranking their wines in order, Tony and I scored each wine on a zany scale between 1 and 100. Here is this un-ranked data.

You’ll see that, with this un-ranked data, Kendall’s Tau gives precisely the same result.

Now what if we couldn’t really distinguish some of the wines and gave them tied rankings? What if I put Menetou-Salon and Muscadet-Sèvre et Maine first in my ranking? Well, Spearman’s rho wouldn’t work. We now have to use a variety of Kendall’s tau called Tau-B. Kendall’s Tau is pretty resilient to ties, but Tau-B is better where there are ties.

… Which gives…

So there we have it. If your data really can’t be called parametric, then Kendall’s Tau is pretty useful. Your data doesn’t have to be rank orders, the statistic will consume all manner of data. Data scientists like it because it irons out those pesky outliers that chance throws at you. I’m wary of using it without understanding the data properly, because it is easy to draw unwarranted conclusions from it, but otherwise it is a useful inference tool.

See also the first in this series

and the subsequent..

Tags:

  • 2876 views

  • Rate
    [Total: 5    Average: 5/5]
  • Joe Celko

    I took my statistics courses in college, back in the day when we assumed everything had a normal distribution. The packages available were SPSS, SAS and Osiris. I learned SPSS because it was written in Fortran. I was the Fortran programmer. My first job as an actual statistician, however, was in the Georgia prison system doing research. People who go to a southern state prison have self-selected the worst possible way. About one third are legally retarded, and they often think that the inkblot test and the Rorschach series all look like dead animals or people (the old psychiatrist joke about, “Gee Doc, I know I’m supposed to tell you it’s a beautiful butterfly, but we both know it’s a dog with his head split open.” Was used as a gag in the original watchmen comics.

    We quickly learned nonparametric statistics, because nothing was normally distributed. Perhaps if we had had a seriously skewed to the bottom distribution, we could have done better. While I like to tell people at some of the best years of my life were spent in Georgia prisons (Georgia prison movies were actually a subclass of B movies in the 1950 – 1960 time frame), it is actually true. I did more calculus trying to get distributions and sampling sizes than I did when I was doing road design and bridges.

    Now in my old age, I am a volunteer judge for the local Science Fest. The sad part is, the kids often have good projects, but have no idea what to do with their data. For example, in the social sciences categories they very often compare “before and after” affects on the same subjects. But all the produce is pie charts or bar graphs because it’s easy to do a spreadsheet. They don’t know about a Wilcoxson statistic that would be a better measure of how much difference there was in whatever they’re measuring. We really need to teach just basic descriptive statistics at the high school level. The kids can log into websites that will let them just fill in their data push a button and get various stats out, without having to write any code at all! But if you don’t know how to use the statistics, it might as well be in Chinese (for Non Chinese speakers).

    rant, rant, rant.

  • I used to find the old The Kruskal–Wallis test by ranks, or H test very useful for testing whether samples originate from the same distribution. It stops the critics dead in their tracks because it makes so few assumptios about the nature of the distribution of data. If I spot a flicker of interest, I’ll show how to do it in SQL. It is certainly true that there are many people who call themselves data scientists who make entirely unwarranted claims about the significance of their tests. A/B testing? Bah!