Statistics in SQL: Simple Linear Regressions

Although linear regressions can get complicated, most jobs involving the plotting of a trendline are easy. Simple Linear Regression is handy for the SQL Programmer in making a prediction of a linear trend and giving a figure for the level probability for the prediction, and what is more, they are easy to do with the aggregation that is built into SQL.

The series so far:

  1. Statistics in SQL: Pearson’s Correlation
  2. Statistics in SQL: Kendall’s Tau Rank Correlation
  3. Statistics in SQL: Simple Linear Regressions
  4. Statistics in SQL: The Kruskal–Wallis Test

Let’s imagine that we have two variables, X and Y which we then plot using scatter graph.

C:\Users\andrew.clarke\SkyDrive\Documents\scatterPlot.gif

It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.

Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.

It gives the result…

Alpha was calculated by this expression, …

… Beta was calculated thus.

… and Rho by this

The Pearson’s Product Moment correlation can be used to calculate the probability of this being a significant correlation via the Fisher Transformation.

The slope (beta) and intercept (alpha) can be used to calculate any point on the trendline.

We can, for example calculate the value of Y when X is 100 by the equation
y’ = a + bx

Is, in our example

We can calculate the trendline X,Y points by slightly altering the SQL

We can then plot a trend line

C:\Users\andrew.clarke\SkyDrive\Documents\scatterPlot.gif

We are, of course, assuming a linear relationship and parametric data. With Pearson’s Rho and the number of X-Y pairs, we can go on to estimate the fisher transformation (t= 1.912) and probability of 0.061825.

The PowerShell to do the Gnuplot graphs is here

 

See also the previous in this series

Tags:

  • 6017 views

  • Rate
    [Total: 6    Average: 5/5]
  • Thank you for this article series that brings to the mind that one sometimes can actually compute some things with numbers other than just sorting and filtering them.

    However, I must point out that your method of calculating sums of squares

    SELECT SUM([@OurData].x * [@OurData].x) AS Sxx,

    is *numerically unstable*, meaning that this calculation is subject to severe rounding error problems. This is known since the early 1960ies. The proper way to compute a sum of squares is either to use a two-pass algorithm, first calculating the mean, say Sm = Sum(x) / Count(x), then

    SELECT SUM((x – Sm) * (x – Sm)) AS Sxx,

    or to use a “provisional means” algorithm that does the same in one pass.

    Moreover, you should point out that you have to exclude from the calculation all pairs (x, y) where either x is Null or y is Null, or the means will be incorrect.

    I’m sorry that I don’t have the proper academic references at hand right now. I’m just tired to see people making the same basic errors time and again. To put it bluntly: Rounding errors can kill people, and they have done so in the past. Just google for “patriot missile rounding error”, and there have been others.

    I have the very highest respect for you as an SQL expert, but statistics and numeric calculations are wholly different fields, and it is not appropriate to just translate a bunch of formulas from a statistics text book to the SQL programming language without the proper context..

    Matthias Kläy
    Kläy Computing AG
    http://www.kcc.ch