Click here to monitor SSC

Simple-Talk columnist

Getting normally-distributed random numbers in TSQL

Published 7 February 2014 6:17 pm

When you are creating test data, you will, at some stage, hit the problem of simulating a distribution. There are plenty of different distributions. In the real world, variables such as height, rainfall, or examination  results conform to a Normal or Gaussian distribution, where the mean is the median, and variables are more likely  to have their values  close to the median.  The natural world is full of examples of variables that conform to the normal distribution, and exhibit the easily-recognised bell-curve when graphed by frequency. This is because the final distribution is the sum of a number of factors that are subject to chance variation.

 Were you to have, for example,  a cat database that included their weights, and wanted to get some data in there to simulate real data, you might think of using the RAND() function. This would end up giving you a ‘flat’ or ‘Uniform’ distribution, providing you a range including microscopic cats at the same frequency as cats the size of tigers, with an equal chance of everything in between. This is quite unlike the experience of a normal vet’s practice. No. You would probably want a normal distribution though if you were specialising in cats with weight problems, it would be skewed.

In a normal distribution, the shape of the bellcurve is determined by the standard deviation of the variable. This is the dispersion from the average within which 68.2% of all observations will fall. Double the value of this range and 95% of observations will fall within it. (this is used as the classic ‘margin of error’). If we have cats that range widely, then the population with have a high standard deviation. The standard deviation of a population is easily got from a table using  the StDev() aggregate function.

So, to return to our tests for our cat database, we need to get the distribution of weights right because the distribution of values of any indexed column are likely to be stored in a special database object and used by the query optimiser to choose the best  strategy for getting the data that you request via a SQL Query. If the distribution of your test data is different to your production data then you are unlikely to get the same performance characteristics as in your production system.

Male cats average around 2.8 Kg for a male and 2.4 Kg for a female and we guess at the standard deviation in both cases being around 0.5 Kg. You don’t have the problem of getting these figures because you can get the mean and standard deviation from the production data via a simple SQL Call.

Here, in the following SQL, we have used the values for  male cats. We can now generate the data for them

Declare @RandomNumbers table (number float) –a simple table of numbers

Declare @Mean float –the mean (and median if normally distributed)

Declare @StandardDeviation float –(the Standard deviation you want)

Declare @ii int –counter

Declare @TotalWanted int –counter

 

Select @Mean=2.8, @StandardDeviation=0.5, @ii=1, @TotalWanted=10000;

 

/* now we generate the numbers we want, with the mean and standard deviation we want */

 

while @ii<= @TotalWanted

  begin

  insert into @RandomNumbers(number) select ((rand()*2-1)+(rand()*2-1)+(rand()*2-1))*@StandardDeviation+@mean;

  Select @ii=@ii+1;

  end

–now we see what we've got and check its mean and standard deviation

Select count(*) as [count], avg(number) as [avg], stDev(number) as [stDev]from @RandomNumbers

 

And here are the values of the population that we’d expect.

 

count       avg                    stDev

———– ———————- ———————-
10000       2.79952921818089       0.697902053383761

…and the characteristic bell-curve

So we would have two populations, depending on the sex of the cat and each time we wanted a value, we’d use the SQL Call with the correct mean and standard deviation. Three calls to RAND() will be the minimum to give you the normal distribution. You can add them, of course to get it even finer but this is good enough for me.

There are, of course, plenty of other types of elliptical distributions and then there is the Pareto distribution which corresponds to the way wealth tends to be distributed. Most distributions will be skewed so that the mean and median don’t correspond. Beyond simple distributions such as these, the mathematics begin to get rather complex so we’ll leave it there for this blog. For more details on the various types of non-uniform distributions see Dwain Camps’ excellent article on the subject on SQL Server Central called Generating Non-uniform Random Numbers with SQL

8 Responses to “Getting normally-distributed random numbers in TSQL”

  1. Dwain.C says:

    Hey Phil. Did you know there’s a way to generate Normal RNs in SQL without a loop?

    http://www.sqlservercentral.com/articles/SQL+Uniform+Random+Numbers/91103/

  2. Dwain.C says:

    Oh sorry! Didn’t see that you already linked to that article at the end.

    Thanks for the plug!

  3. Phil Factor says:

    Yes, I did know, thanks. I wanted to keep the demo as simple as possible, in order to illustrate the technique as clearly as I could.

  4. dataminor says:

    Excellent article. My favorite part is the seemingly throwaway line “…Three calls to RAND() will be the minimum to give you the normal distribution…” I wish I’d worked that out years ago!

  5. richierocks says:

    A few thoughts:

    The principle of generating normally distributed numbers from several independent uniform ones is based on the Central Limit Theorem.

    Only using 3 uniform numbers gives a very ropey approximation. For example, the 75th %ile of the distribution should be 0.674; with this it will be somewhere around 0.706.

    You can improve the situation a bit by using more uniform random numbers. If you use m calls to rand(), then setup code looks like:

    DECLARE
    @m int = 10,
    @a float,
    @b float;

    SET @a = SQRT(12.0 / @m);
    SET @b = SQRT(3.0 * @m);

    Then your select statement looks like:

    SELECT (@a * (rand() + rand() + rand() + rand() + rand() + rand() + rand() + rand() + rand() + rand()) – @b) * @StandardDeviation + @mean;

    This is still not a great approximation however; no scientific software uses this alrogithm for generating normal distributions.

    Which leads me to a question: why are basic stats tools not available by default on most databases?

    Normal random number generation is hardly cutting edge: the algorithms are at least as old as SQL itself.

  6. Phil Factor says:

    @richierocks Thanks a lot for the SQL! I use multiple rand()s for a closer approximation, (that’s what I meant by ‘Three calls to RAND() will be the minimum to give you the normal distribution’) but your SQL is much neater (and more accurate I guess). My excuse is that the simpler code was sufficient to match the Statistics histogram object in SQL Server – the objective of the blog.
    I very much agree that it is frustrating not to have stats tools available on databases. I think that there is a point where it makes more sense to do the work in R. However, simple stuff like regression and ANOVA would be great to have. What other stats tools would you want added to the SQL standard?

  7. Robert Young says:

    – I think that there is a point where it makes more sense to do the work in R.

    Ha!! See, now if you were using Postgres (or one of its spawn) you could just hook up to R with PL/R (inside the database, not just ODBC/whatever from R). SAP HANA (and, possibly, a couple of others; I forget at the nonce) has a proprietary hook, too ( https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDAQFjAA&url=https%3A%2F%2Fhelp.sap.com%2Fhana%2FSAP_HANA_R_Integration_Guide_en.pdf&ei=wIn-UvnUFeGh2gXHsYHgCg&usg=AFQjCNESDhZZhlZQCPF-2V07eip_4vNDkA&sig2=NqIGE9Zbk1n-I3ChgN599w&bvm=bv.61535280,d.b2I&cad=rja ); not that I’ve used that one.

    IBM integrates R (the Revolution Analytics version) with Netezza (PG spawn); SPSS otherwise, from what I can tell.

    Since the common (most places, standard) data structure is the data.frame, which is just a (flat) RDBMS table (which can be the result of a view operation), R is a natural extension.

    On the whole, hooks to dedicated stat packs seems more appropriate than attempting to get the SQL standard extended and implemented consistently across engines. We know how consistent SQL is now?

  8. Robert Young says:

    If one of the Editors strolls by, here’s a tiny-ed version of that mess (I was expecting that to happen automagically!): http://tinyurl.com/n364ao6

    And, I also forgot to mention that Oracle also has an R integration, Oracle Advanced Analytics Option, at about $30K.

Leave a Reply