Click here to monitor SSC
  • Av rating:
  • Total votes: 22
  • Total comments: 12
Joe Celko

Median Workbench

05 April 2009

/* SQL Server database engine doesn't have a MEDIAN() aggregate function. This is probably because there are several types of median, such as statistical, financial or vector medians. Calculating Medians are essentially a row-positioning  task, since medians are the middle value of an ordered result. Easy to do in SQL? Nope. Joe Celko explains why
The workbench requires SQL Server 2008, but is easily modified to 2005.*/

 

/*

A Date for an Mock Feud

 

If you are an older SQL programmer, you will remember back in the 1990’s there were two database magazines on the newsstands, DBMS and DATABASE PROGRAMMING & DESIGN.  They started as competitors, but Miller-Freeman bought DBMS from M&T and kept publishing both titles.  Later the two titles are merged into INTELLIGENT ENTERPRISE from CMP Publications, which is more management oriented than the two original magazines. 

 

I started a regular freelance column in DATABASE PROGRAMMING & DESIGN in 1990 and switched to DBMS in 1992 as an employee.  Chris Date took my old slot back at DATABASE PROGRAMMING & DESIGN.  There was a period where we did a version of the famous Fred Allen and Jack Benny mock feud from the classic era of American radio.  Chris would run a column with some position on a topic in his column and I would counter it in my next column.  It was great fun, it made people think and (most important) it sold magazines and kept me employed. 

 

One of the topics that bounced around the SQL community in those columns was how to compute the Median with SQL.  So Chris and I had to get involved in that and toss code around.  Today, the Median is easy to do, but it was a good programming exercise in its day. 

 

The 'Simple', or financial Median

 

The Median is not a built-in function in Standard SQL, like AVG(), and it has several different versions.  In fact, most descriptive statistics have multiple versions, which is why SQL stayed away from them and left that job to specialized packages like SAS and SPSS.  I will not even mention floating point error handling. 

 

The simple Median is defined as the value for which there are just as many cases with a value below it as above it.  If such a value exists in the data set, this value is called the statistical Median.  If no such value exists in the data set, the usual method is to divide the data set into two halves of equal size such that all values in one half are lower than any value in the other half.  The Median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial Median.  In English, sort the data, cut it in the middle and look at the guys on either side of the cut.

 

The financial Median is the most common term used for this Median, so we will stick to it.  Let us use Date's famous Parts table, from several of his textbooks, which has a column for part_wgt in it, like this:

*/

CREATE TABLE Parts

(part_nbr VARCHAR(5) NOT NULL PRIMARY KEY,

 part_name VARCHAR(50) NOT NULL,

 part_color VARCHAR(50) NOT NULL,

 part_wgt INTEGER NOT NULL,

 city_name VARCHAR(50) NOT NULL);

 

INSERT INTO Parts (part_nbr, part_name, part_color, part_wgt, city_name)

VALUES ('p1', 'Nut', 'Red', 12, 'London'),

       ('p2', 'Bolt', 'Green', 17, 'Paris'),

       ('p3', 'Cam', 'Blue', 12, 'Paris'),

       ('p4', 'Screw', 'Red', 14, 'London'),

       ('p5', 'Cam', 'Blue', 12, 'Paris'),

       ('p6', 'Cog', 'Red', 19, 'London');

 

/*

First sort the table by weights and find the three rows in the lower half of the table.  The greatest value in the lower half is 12; the smallest value in the upper half is 14; their average, and therefore the Median, is 13.  If the table had an odd number of rows, we would have looked at only one row after the sorting.

 

Date's First Median

 

Date proposed two different solutions for the Median.  His first solution was based on the fact that if you duplicate every row in a table, the Median will stay the same.  The duplication will guarantee that you always work with a table that has an even number of rows.  Since Chris (and myself) are strongly opposed to redundant data, this was an unusual approach for him to use.

 

The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions.  Here is a corrected version of his first solution:

*/

go

CREATE VIEW Temp1

AS SELECT part_wgt FROM Parts

   UNION ALL

   SELECT part_wgt FROM Parts;

go

CREATE VIEW Temp2

AS SELECT part_wgt

     FROM Temp1

    WHERE (SELECT COUNT(*) FROM Parts)

           <= (SELECT COUNT(*)

                 FROM Temp1 AS T1

                WHERE T1.part_wgt >= Temp1.part_wgt)

      AND (SELECT COUNT(*) FROM Parts)

            <= (SELECT COUNT(*)

                  FROM Temp1 AS T2

                 WHERE T2.part_wgt <= Temp1.part_wgt);

go

SELECT AVG(DISTINCT part_wgt) AS median

  FROM Temp2;

/*

This involves the construction of a table of duplicated values, which can be expensive in terms of both time and storage space.  The use of AVG(DISTINCT x) is important because leaving it out would return the simple average instead of the Median.  Consider the set of weights (12, 17, 17, 14, 12, 19).  The doubled table, Temp1, is then (12, 12, 12, 12, 14, 14, 17, 17, 17, 17, 19, 19).  But because of the duplicated values, Temp2 becomes (14, 14, 17, 17, 17, 17), not just (14, 17).  The simple average is (96 / 6.0) = 16; it should be (31/2.0) = 15.5 instead.

 

Celko's First Median

 

I found a slight modification of Date's solution will avoid the use of a doubled table, but it depends on a CEILING() function. 

*/

SELECT MIN(part_wgt)    -- smallest value in upper half

  FROM Parts

 WHERE part_wgt

       IN (SELECT P1.part_wgt

             FROM Parts AS P1, Parts AS P2

            WHERE P2.part_wgt >= P1.part_wgt

            GROUP BY P1.part_wgt

          HAVING COUNT(*)

                 <= (SELECT CEILING(COUNT(*) / 2.0)

                       FROM Parts)

UNION

SELECT MAX(part_wgt)    -- largest value in lower half

  FROM Parts

 WHERE part_wgt IN (SELECT P1.part_wgt

                    FROM Parts AS P1, Parts AS P2

                   WHERE P2.part_wgt <= P1.part_wgt

                   GROUP BY P1.part_wgt

                  HAVING COUNT(*) <=

                           (SELECT CEILING(COUNT(*) / 2.0)

                              FROM Parts)))

 

--or using the same idea and a CASE expression:

 

SELECT AVG(DISTINCT CAST(part_wgt AS FLOAT)) AS median

  FROM (SELECT MAX(part_wgt)

          FROM Parts AS B1

        WHERE (SELECT COUNT(*) + 1

                 FROM Parts

                WHERE part_wgt < B1.part_wgt)

             <= (SELECT CEILING (COUNT(*) / 2.0)

                   FROM Parts)

       UNION ALL

       SELECT MAX(part_wgt)

         FROM Parts AS B

        WHERE (SELECT COUNT(*) + 1

                 FROM Parts

                WHERE part_wgt < B.part_wgt)

              <= CASE (SELECT (COUNT(*)% 2)

                         FROM Parts)

                 WHEN 0

                 THEN (SELECT CEILING (COUNT(*)/2.0) + 1

                         FROM Parts)

                 ELSE (SELECT CEILING (COUNT(*)/2.0)

                         FROM Parts)

                 END) AS medians(part_wgt);

/*

The CEILING() function is to be sure that if there is an odd number of rows in Parts, the two halves will overlap on that value. 

 

Date's Second Median

 

Now the war was on!  Did you notice that we had to use a lot of self-joins in those days?  That stinks.  Date's second solution was based on Celko's Median, folded into one query:

*/

SELECT AVG(DISTINCT Parts.part_wgt) AS median

  FROM Parts

 WHERE Parts.part_wgt IN

        (SELECT MIN(part_wgt)

           FROM Parts

          WHERE Parts.part_wgt

                IN (SELECT P2.part_wgt

                      FROM Parts AS P1, Parts AS P2

                     WHERE P2.part_wgt <= P1.part_wgt

                     GROUP BY P2.part_wgt

                    HAVING COUNT(*)

                        <= (SELECT CEILING(COUNT(*) / 2.0)

                              FROM Parts))

        UNION

        SELECT MAX(part_wgt)

          FROM Parts

         WHERE Parts.part_wgt

               IN (SELECT P2.part_wgt

                     FROM Parts AS P1, Parts AS P2

                    WHERE P2.part_wgt >= P1.part_wgt

                    GROUP BY P2.part_wgt

                   HAVING COUNT(*)

                       <= (SELECT CEILING(COUNT(*) / 2.0)

                             FROM Parts)));

 

/*

Date mentions that this solution will return a NULL for an empty table and that it assumes there are no NULLs in the column.  But you have to remember Chris Date, unlike Dr. Codd and SQL, does not use NULLs and thinks they should not be in the Relational Model.  If there are NULLs, the COUNT(*) will include them and mess up the computations.  You will need to get rid of them with a COUNT(part_nbr) instead.  Play with some sample data and see the differences. 

 

Murchison's Median

 

Others joined in the battle.  Rory Murchison of the Aetna Institute had a solution that modifies Date's first method by concatenating the key to each value to make sure that every value is seen as a unique entity.  Selecting the middle values is then a special case of finding the n-th item in the table.

*/

SELECT AVG(part_wgt)

  FROM Parts AS P1

 WHERE EXISTS

      (SELECT COUNT(*)

         FROM Parts AS P2

        WHERE CAST(part_wgt AS CHAR(5)) + P2.part_nbr >=

              CAST(part_wgt AS CHAR(5)) + P1.part_nbr

       HAVING COUNT(*) = (SELECT FLOOR(COUNT(*) / 2.0)

                            FROM Parts)

           OR COUNT(*) = (SELECT CEILING((COUNT(*) / 2.0))

                            FROM Parts));

/*

This method depends on being able to have a HAVING clause without a GROUP BY clause, which is part of the ANSI standard but often missed by new programmers. 

 

Occurrence Count Table

 

I came back with a working table with the values and a tally of their occurrences from the original table.  Now that we have this table, we want to use it to construct a summary table that has the number of occurrences of each part_wgt and the total number of data elements before and after we add them to the working table.

 

Philip Vaughan of San Jose, CA proposed a simple Median technique based on a VIEW with unique weights and number of occurrences and then a VIEW of the middle set of weights.

*/

go

CREATE VIEW ValueSet(part_wgt, occurrence_cnt)

AS SELECT part_wgt, COUNT(*)

     FROM Parts

    GROUP BY part_wgt;

go

/*

 

The MiddleValues VIEW is used to get the Median by taking an average.  The clever part of this code is the way that it handles empty result sets in the outermost WHERE clause that result from having only one value for all weights in the table.  Empty sets sum to NULL because there is no element to map the index

*/

CREATE VIEW MiddleValues(part_wgt)

AS SELECT part_wgt

     FROM ValueSet AS VS1

    WHERE (SELECT SUM(VS2.occurrence_cnt)/2.0 + 0.25

             FROM ValueSet AS VS2) >

          (SELECT SUM(VS2.occurrence_cnt)

             FROM ValueSet AS VS2

            WHERE VS1.part_wgt <= VS2.part_wgt) - VS1.occurrence_cnt

      AND (SELECT SUM(VS2.occurrence_cnt)/2.0 + 0.25

             FROM ValueSet AS VS2) >

          (SELECT SUM(VS2.occurrence_cnt)

             FROM ValueSet AS VS2

            WHERE VS1.part_wgt >= VS2.part_wgt) - VS1.occurrence_cnt;

go

SELECT AVG(part_wgt) AS median FROM MiddleValues;

/*

 

Median with Characteristic Function

 

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the Median (SQL Forum 1993, 1994).  They define a characteristic function, which they call delta, using the SIGN() function.  The delta or characteristic function accepts a Boolean expression as an argument and returns one if it is TRUE and zero if it is FALSE or UNKNOWN.  We can construct the delta function easily with a CASE expression.

 

The authors also distinguish between the statistical Median, whose value must be a member of the set, and the financial Median, whose value is the average of the middle two members of the set.  A statistical Median exists when there is an odd number of items in the set.  If there is an even number of items, you must decide if you want to use the highest value in the lower half (they call this the left Median) or the lowest value in the upper half (they call this the right Median).

 

The left statistical Median of a unique column can be found with this query, if you will assume that we have a column called bin_nbr that represents the storage location of a part in sorted order.

*/

SELECT P1.bin_nbr

  FROM Parts AS P1, Parts AS P2

 GROUP BY P1.bin_nbr

HAVING SUM(CASE WHEN (P2.bin_nbr <= P1.bin_nbr) THEN 1 ELSE 0 END)

         = (COUNT(*) / 2.0);

 

/*

Changing the direction of the test in the HAVING clause will allow you to pick the right statistical Median if a central element does not exist in the set.  You will also notice something else about the Median of a set of unique values: It is usually meaningless.  What does the Median bin_nbr number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a Median.

 

The statistical Median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap; thus, the left statistical Median is found by

*/

SELECT P1.part_wgt

  FROM Parts AS P1, Parts AS P2

 GROUP BY P1.part_wgt

HAVING SUM(CASE WHEN P2.part_wgt <= P1.part_wgt

                 THEN 1 ELSE 0 END)

            >= (COUNT(*) / 2.0)

   AND SUM(CASE WHEN P2.part_wgt >= P1.part_wgt

                THEN 1 ELSE 0 END)

            >= (COUNT(*) / 2.0);

/*

Should Parts contain an even number of entries with the (COUNT(*)/ 2) entry not repeated, this query may return FLOOR(AVG(DISTINCT part_wgt)), as happens when SQL Server computes an average of integers.  This can be fixed by changing the inner SELECT to:

 

*/

        SELECT (P1.part_wgt * 1.0) AS part_wgt

 

/*

Notice that here the left and right medians can be the same, so there is no need to pick one over the other in many of the situations where you have an even number of items.  Switching the comparison operators in the two CASE expressions will give you the right statistical Median.

 

Henderson's Median

 

Another version of the financial Median, which uses the CASE expression in both of its forms, is:

*/

SELECT CASE COUNT(*) % 2

       WHEN 0        -- even sized table

       THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt

                                  THEN P2.part_wgt

                                  ELSE NULL END))/2.0

       ELSE P1.part_wgt --odd sized table

       END AS median

  FROM Parts AS P1, Parts AS P2

 GROUP BY P1.part_wgt

HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt

                  THEN 1

                  ELSE NULL END)

       = (COUNT(*) + 1) / 2;

/*

This answer is due to Ken Henderson.  The only instance in which [[2]] is correct is when Parts has an odd number of entries and the middle (COUNT(*) / 2 + 1) entry is not repeated in the data. 

 

Celko's Third Median

 

Another approach involves looking at a picture of a line of sorted values and seeing where the Median would fall.  Every value in column part_wgt of the table partitions the table into three sections, the values that are less than part_wgt, equal to part_wgt or greater than part_wgt.  We can get a profile of each value with a tabular subquery expression. 

 

Deriving the query is a great exercise in SQL logic and math.  Don't do it unless you just like the exercise. The query finally becomes:

*/

SELECT AVG(DISTINCT part_wgt)

  FROM (SELECT P1.part_wgt

          FROM Parts AS P1, Parts AS P2

         GROUP BY P1.part_nbr, P1.part_wgt

        HAVING SUM(CASE WHEN P2.part_wgt = P1.part_wgt

                        THEN 1 ELSE 0 END)

        >= ABS(SUM(CASE WHEN P2.part_wgt < P1.part_wgt THEN 1

                        WHEN P2.part_wgt > P1.part_wgt THEN -1

                        ELSE 0 END)))

        AS Partitions;

/*

 

OLAP Median

 

The new OLAP function in the SQL-99 Standard allow you to replace COUNT() functions with row numberings.  Let’s assume we have a table or VIEW named RawData that has only one numeric column.  The obvious solution for data with an odd number of elements is:

*/

WITH SortedData (x, sort_place)

AS

(SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC)

   FROM RawData)

SELECT x AS median 

  FROM SortedData 

 WHERE sort_place = ((SELECT COUNT(*) FROM RawData)/ 2) + 1;

 

--A slight modification for an even number of elements:

 

WITH SortedData (x, sort_place)

AS

(SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC)

   FROM RawData)

SELECT AVG(x) AS median 

  FROM SortedData 

 WHERE sort_place

      IN ((SELECT COUNT(*) FROM RawData)/ 2),

          (SELECT COUNT(*) FROM RawData)/ 2) + 1);

/*

You can combine both of these into one query with a CASE expression and I leave that as an exercise for the reader. 

 

But there is another question we have not asked about an even number of rows.  Assume the data set look like {1, 2, 2, 3, 3, 3} so the usual median would be AVG({2, 3}) = 2.5 as the measure of central tendency.  But if we consider where the real trend is, then we would use the set-oriented weighted median with AVG({2, 2, 3, 3, 3}) = 13/5 = 2.6 as the measure of central tendency. 

 

This leads to the final query which can handle odd and even counts and do the weighted Median:

*/

WITH SortedData (x, hi, lo)

AS

(SELECT x,

        ROW_NUMBER() OVER(ORDER BY x ASC),

        ROW_NUMBER() OVER(ORDER BY x DESC)

   FROM RawData)

SELECT AVG(x * 1.0) AS median 

  FROM SortedData 

 WHERE hi IN (lo, lo+1, lo-1);

/*

This might be slower than you would like, since the CTE might do two sorts to get the "hi" and "lo" values.  At the time of this writing, ROW_NUMBER() is new to all SQL products, so not as much work has gone into optimizing them.   A smarter optimizer would map the i-th element of a sorted list of (n) items in ASC order to the (n-i+1)-th element for DESC. 

Want to try your own versions?

*/

 

 

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 22 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Hi Joe!
Posted by: Dave Winters (not signed in)
Posted on: Friday, April 10, 2009 at 12:44 PM
Message: Well I see you are doing OK!

It amazing that after everything I taught you at EES you are still messing around with SQL. :)

Dave Winters

Subject: Hi Dave!
Posted by: Joe Celko (not signed in)
Posted on: Monday, April 13, 2009 at 8:58 AM
Message: I stick with what I know :) You used to write an SQL Server column, too!

Subject: Was this content reused from somewhere else?
Posted by: Adam Machanic (view profile)
Posted on: Monday, April 13, 2009 at 11:36 AM
Message: You mention characteristic functions, then use a CASE expression in the actual code example... Which is a good thing, but it's a bit confusing.

Aside from that, interesting article.

Subject: Re: Content reuse
Posted by: Phil Factor (view profile)
Posted on: Monday, April 13, 2009 at 4:14 PM
Message: Some of this was originally in 'SQL for Smarties', which is one of my favorite books on SQL, along with Ken's 'Guru's guide'. We all liked the idea of an update done as a workbench.

Agreed that 'characteristic Functions' were clever, but made superfluous by the CASE statement. Joe says in this workbench ...

'The delta or characteristic function accepts a Boolean expression as an argument and returns one if it is TRUE and zero if it is FALSE or UNKNOWN. We can construct the delta function easily with a CASE expression'.

We'll discuss with Joe a slight re-wording to take out any possible confusion.

Subject: Hmm...
Posted by: Adam Machanic (view profile)
Posted on: Tuesday, April 14, 2009 at 4:25 PM
Message: IMO you should remove the reference altogether. Every six months or so I run into someone on a forum who just discovered characteristic functions and believes that because they look cool and difficult to understand that they're faster than CASE expressions... I would rather just forget about them altogether at this point!

Subject: Implementation paranoia
Posted by: puzsol (view profile)
Posted on: Thursday, April 16, 2009 at 7:08 AM
Message: I hate to spoil the party of the final solution... but isn't it slightly (in theory, not in practice) implementation dependant?

I mean if the asc and desc are done separately then there is no gurantee of correlation of the hi and lo values. It (in theory) would be possible for four (or more) identical values in the middle to have the data:

x, hi, lo
12, 10, 12
12, 11, 13
12, 12, 10
12, 13, 11

which never matches the in condition, and fails to give a median value... (AVG(null))

Of course in practice that won't happen as you are more likely to either get exactly the same indexes (10 to 13 matching 10 to 13 resulting in all similar values being included in the average) or exact oposite indexes (10 to 13 matching 13 to 10 resulting in the correct middle two being returned) - either case being the average of a single number and returning the correct value.

Hey it may be something you dug up from an old column, but it still has the ability to make you think!

Thanx

Subject: Your math is wrong ...
Posted by: Celko (view profile)
Posted on: Thursday, April 16, 2009 at 4:28 PM
Message: WITH SortedData (x, hi, lo)
AS
(SELECT x,
ROW_NUMBER() OVER(ORDER BY x ASC),
ROW_NUMBER() OVER(ORDER BY x DESC)
FROM RawData)
SELECT AVG(x * 1.0) AS median
FROM SortedData
WHERE hi IN (lo, lo+1, lo-1);


>> I mean if the ASC and DESC are done separately then there is no guarantee of correlation of the hi and lo values. <<

There is a strong relationship between the hi and lo values. A good optimizer will look at the first ROW_NUMBER() function, see that it differs from the second ROW_NUMBER() function only by direction. The second ROW_NUMBER() function can computed from the results of the first one -- (COUNT(x)+1 - ROW_NUMBER() OVER(ORDER BY x ASC)), so there is only one sort.

>> It (in theory) would be possible for four (or more) identical values in the middle to have the data:<<

No. do the math:
Odd case:

X: 10 11 12 12 12 12 12 13 14
COUNT(*) + 1 = 10
hi 1 2 3 4 5 6 7 8 9
lo 9 8 7 6 5 4 3 2 1
match ^

Even case:

X: 10 11 12 12 12 12 13 14
COUNT(*) + 1 = 9
hi 1 2 3 4 5 6 7 8
lo 8 7 6 5 4 3 2 1
match ^ ^

There is a classic logic problem based on this principle. A Zen monk walks up a hill to a temple the first day. The next day, he walks down the hill. Was he ever at one spot on the trail at the same time on both days? Yes, one and only one such spot exits. See:

http://www.puzzles.com/PuzzlePlayground/MonkTravel/MonkTravelPrintPlay.pdf


Subject: Wrong?
Posted by: puzsol (view profile)
Posted on: Thursday, April 16, 2009 at 7:14 PM
Message: I did say it wouldn't be the outcome of a good optimiser... I didn't even say it was practical - just theoretical... if you look at the numbers I put out, there is no match, the hi and lo values do form a correct sequence (all be it jumbled).

I was merely pointing out that IF two independent sorts are done, then it COULD depend on the sorting algorithm. Sorting does not have to be in a straight line (like the Monk), it just has to look like it's in order, and different algorithms will do different things when values are said to be equal (to swap or not to swap?).

Case in point... In SQL Server 2005 create a table (called Median) with primary key "id" (int) identity field, "val" (int) as a secondary field, then insert 9 identical values in the val column (I used "5")... now try these:
select * from Median order by val asc
select * from Median order by val desc
The listings are identical.... ie listed in primary key order because the secondary values are all the same... so on the second day did the Monk warp back to the bottom of the hill climb up the hill again then warp back? (Beam me up Scotty!)...

So if these were the median values, you would include them all by the query! (as the Monk walked an identical path)... Turn the select AVG into a select * - I had nine values included from the setup above! (of course it doesn't matter as they are identical, but it's the principle that I am talking about, not whether the output happens to be correct.)

My question would be; what happens if identical median values cross a page boundary, do you get a disjointed sequence (asc vs desc?)?

ie
val, hi, lo
5, 1, 4
5, 2, 5
5, 3, 6
5, 4, 7
--- page boundary
5, 5, 1
5, 6, 2
5, 7, 3

Don't ask me how the Monk would make that journey!



Subject: puzsol is right
Posted by: StefanG (view profile)
Posted on: Friday, April 17, 2009 at 9:29 AM
Message: Joe Celko's last query has exactly the flaws puzsol is talking about.

It is very easy to demonstrate, we just have to use Joe's own example data set on SQL server 2005.

Like this:

create table RawData (x int)
insert into RawData (x) values (1)
insert into RawData (x) values (2)
insert into RawData (x) values (2)
insert into RawData (x) values (3)
insert into RawData (x) values (3)
insert into RawData (x) values (3)
GO

WITH SortedData (x, hi, lo)
AS
(SELECT x,
ROW_NUMBER() OVER(ORDER BY x ASC),
ROW_NUMBER() OVER(ORDER BY x DESC)
FROM RawData)
SELECT AVG(x * 1.0) AS median
FROM SortedData
WHERE hi IN (lo, lo+1, lo-1);

This returns null !

The reason is easy to see if we look at the full SortedData result with:

WITH SortedData (x, hi, lo)
AS
(SELECT x,
ROW_NUMBER() OVER(ORDER BY x ASC),
ROW_NUMBER() OVER(ORDER BY x DESC)
FROM RawData)
SELECT * FROM SortedData

which returns:

x hi lo
--- -------------------- --------------------
3 4 1
3 5 2
3 6 3
2 2 4
2 3 5
1 1 6

None of these rows satisfy the where-condition

This nicely illustrates exactly the problem puzsol is talking about - the correlation between lo and hi is much weaker than Joe assumes.

Subject: Single query for even add odd counts
Posted by: StefanG (view profile)
Posted on: Friday, April 17, 2009 at 10:15 AM
Message: Joe's last query simply does not work.

A much better single query that really works for both even and odd number of elements is this:

WITH SortedData (x, sort_place)
AS
(SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC)
FROM RawData)
SELECT AVG(x*1.0) AS median
FROM SortedData
WHERE sort_place BETWEEN ((SELECT COUNT(*) FROM RawData)/ 2.0) AND ((SELECT COUNT(*)+2 FROM RawData)/ 2.0);

If you really want a set-based median (which I find a strange concept) you could use this:

WITH SortedData (x, sort_place)
AS
(SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC)
FROM RawData)
SELECT AVG(x*1.0) AS median
FROM SortedData
WHERE x in (
SELECT x FROM SortedData
WHERE sort_place BETWEEN
((SELECT COUNT(*) FROM RawData)/ 2.0) AND
((SELECT COUNT(*)+2 FROM RawData)/ 2.0))

Subject: Slight change to OLAP version
Posted by: dnoeth (view profile)
Posted on: Wednesday, May 27, 2009 at 3:57 PM
Message: Why mixing aggregates and OLAP?
I usually prefer to access a table only once, especially if this might be more complex involving joins/groups/conditions.

This is what i used for years (because my Teradata implemented OLAP functions since 1999), it works for odd and even counts:

WITH SortedData (x, row_num, cnt)
AS
(
SELECT x,
ROW_NUMBER() OVER(ORDER BY x ASC),
COUNT(*) OVER ()
FROM RawData
)
SELECT AVG(x * 1.0) AS median
FROM SortedData
WHERE
row_num in
((cnt + 1) / 2, (cnt / 2) + 1)
;


Btw, Puzsol is right, but as Celko already pointed out, the second ROW_NUMBER is easily replaced by
(COUNT(x)+1 - ROW_NUMBER() OVER(ORDER BY x ASC)):

WITH SortedData (x, hi, lo)
AS
(
SELECT x,
ROW_NUMBER() OVER(ORDER BY x ASC),
COUNT(*) OVER () + 1
- ROW_NUMBER() OVER(ORDER BY x ASC)
FROM RawData
)
SELECT AVG(x * 1.0) AS median
FROM SortedData
WHERE hi IN (lo, lo+1, lo-1);

Subject: Median and Weighted Median found here
Posted by: Peso (view profile)
Posted on: Saturday, September 19, 2009 at 5:30 AM
Message: http://sqlblog.com/blogs/peter_larsson/archive/2009/09/18/median-and-weighted-median-calculations.aspx

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.