The workbench requires SQL Server 2008, but is easily modified to 2005.
A Date for an Mock Feud
If you are an older SQL programmer, you will remember back in the 1990’s there were two database magazines on the newsstands, DBMS and DATABASE PROGRAMMING & DESIGN. They started as competitors, but Miller-Freeman bought DBMS from M&T and kept publishing both titles. Later the two titles are merged into INTELLIGENT ENTERPRISE from CMP Publications, which is more management oriented than the two original magazines.
I started a regular freelance column in DATABASE PROGRAMMING & DESIGN in 1990 and switched to DBMS in 1992 as an employee. Chris Date took my old slot back at DATABASE PROGRAMMING & DESIGN. There was a period where we did a version of the famous Fred Allen and Jack Benny mock feud from the classic era of American radio. Chris would run a column with some position on a topic in his column and I would counter it in my next column. It was great fun, it made people think and (most important) it sold magazines and kept me employed.
One of the topics that bounced around the SQL community in those columns was how to compute the Median with SQL. So Chris and I had to get involved in that and toss code around. Today, the Median is easy to do, but it was a good programming exercise in its day.
The ‘Simple’, or financial Median
The Median is not a built-in function in Standard SQL, like AVG()
, and it has several different versions. In fact, most descriptive statistics have multiple versions, which is why SQL stayed away from them and left that job to specialized packages like SAS and SPSS. I will not even mention floating point error handling.
The simple Median is defined as the value for which there are just as many cases with a value below it as above it. If such a value exists in the data set, this value is called the statistical Median. If no such value exists in the data set, the usual method is to divide the data set into two halves of equal size such that all values in one half are lower than any value in the other half. The Median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial Median. In English, sort the data, cut it in the middle and look at the guys on either side of the cut.
The financial Median is the most common term used for this Median, so we will stick to it. Let us use Date’s famous Parts table, from several of his textbooks, which has a column for part_wgt
in it, like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE TABLE Parts (part_nbr VARCHAR(5) NOT NULL PRIMARY KEY, part_name VARCHAR(50) NOT NULL, part_color VARCHAR(50) NOT NULL, part_wgt INTEGER NOT NULL, city_name VARCHAR(50) NOT NULL); INSERT INTO Parts (part_nbr, part_name, part_color, part_wgt, city_name) VALUES ('p1', 'Nut', 'Red', 12, 'London'), ('p2', 'Bolt', 'Green', 17, 'Paris'), ('p3', 'Cam', 'Blue', 12, 'Paris'), ('p4', 'Screw', 'Red', 14, 'London'), ('p5', 'Cam', 'Blue', 12, 'Paris'), ('p6', 'Cog', 'Red', 19, 'London'); |
First sort the table by weights and find the three rows in the lower half of the table. The greatest value in the lower half is 12; the smallest value in the upper half is 14; their average, and therefore the Median, is 13. If the table had an odd number of rows, we would have looked at only one row after the sorting.
Date’s First Median
Date proposed two different solutions for the Median. His first solution was based on the fact that if you duplicate every row in a table, the Median will stay the same. The duplication will guarantee that you always work with a table that has an even number of rows. Since Chris (and myself) are strongly opposed to redundant data, this was an unusual approach for him to use.
The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions. Here is a corrected version of his first solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
go CREATE VIEW Temp1 AS SELECT part_wgt FROM Parts UNION ALL SELECT part_wgt FROM Parts; go CREATE VIEW Temp2 AS SELECT part_wgt FROM Temp1 WHERE (SELECT COUNT(*) FROM Parts) <= (SELECT COUNT(*) FROM Temp1 AS T1 WHERE T1.part_wgt >= Temp1.part_wgt) AND (SELECT COUNT(*) FROM Parts) <= (SELECT COUNT(*) FROM Temp1 AS T2 WHERE T2.part_wgt <= Temp1.part_wgt); go SELECT AVG(DISTINCT part_wgt) AS median FROM Temp2; |
This involves the construction of a table of duplicated values, which can be expensive in terms of both time and storage space. The use of AVG(DISTINCT x)
is important because leaving it out would return the simple average instead of the Median. Consider the set of weights (12, 17, 17, 14, 12, 19)
. The doubled table, Temp1
, is then (12, 12, 12, 12, 14, 14, 17, 17, 17, 17, 19, 19)
. But because of the duplicated values, Temp2
becomes (14, 14, 17, 17, 17, 17)
, not just (14, 17)
. The simple average is (96 / 6.0) = 16
; it should be (31/2.0) = 15.5
instead.
Celko’s First Median
I found a slight modification of Date’s solution will avoid the use of a doubled table, but it depends on a CEILING()
function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
SELECT MIN(part_wgt) -- smallest value in upper half FROM Parts WHERE part_wgt IN (SELECT P1.part_wgt FROM Parts AS P1, Parts AS P2 WHERE P2.part_wgt >= P1.part_wgt GROUP BY P1.part_wgt HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts) UNION SELECT MAX(part_wgt) -- largest value in lower half FROM Parts WHERE part_wgt IN (SELECT P1.part_wgt FROM Parts AS P1, Parts AS P2 WHERE P2.part_wgt <= P1.part_wgt GROUP BY P1.part_wgt HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts))) --or using the same idea and a CASE expression: SELECT AVG(DISTINCT CAST(part_wgt AS FLOAT)) AS median FROM (SELECT MAX(part_wgt) FROM Parts AS B1 WHERE (SELECT COUNT(*) + 1 FROM Parts WHERE part_wgt < B1.part_wgt) <= (SELECT CEILING (COUNT(*) / 2.0) FROM Parts) UNION ALL SELECT MAX(part_wgt) FROM Parts AS B WHERE (SELECT COUNT(*) + 1 FROM Parts WHERE part_wgt < B.part_wgt) <= CASE (SELECT (COUNT(*)% 2) FROM Parts) WHEN 0 THEN (SELECT CEILING (COUNT(*)/2.0) + 1 FROM Parts) ELSE (SELECT CEILING (COUNT(*)/2.0) FROM Parts) END) AS medians(part_wgt); |
The CEILING()
function is to be sure that if there is an odd number of rows in Parts, the two halves will overlap on that value.
Date’s Second Median
Now the war was on! Did you notice that we had to use a lot of self-joins in those days? That stinks. Date’s second solution was based on Celko’s Median, folded into one query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
SELECT AVG(DISTINCT Parts.part_wgt) AS median FROM Parts WHERE Parts.part_wgt IN (SELECT MIN(part_wgt) FROM Parts WHERE Parts.part_wgt IN (SELECT P2.part_wgt FROM Parts AS P1, Parts AS P2 WHERE P2.part_wgt <= P1.part_wgt GROUP BY P2.part_wgt HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts)) UNION SELECT MAX(part_wgt) FROM Parts WHERE Parts.part_wgt IN (SELECT P2.part_wgt FROM Parts AS P1, Parts AS P2 WHERE P2.part_wgt >= P1.part_wgt GROUP BY P2.part_wgt HAVING COUNT(*) <= (SELECT CEILING(COUNT(*) / 2.0) FROM Parts))); |
Date mentions that this solution will return a NULL
for an empty table and that it assumes there are no NULL
s in the column. But you have to remember Chris Date, unlike Dr. Codd and SQL, does not use NULL
s and thinks they should not be in the Relational Model. If there are NULL
s, the COUNT(*)
will include them and mess up the computations. You will need to get rid of them with a COUNT(part_nbr)
instead. Play with some sample data and see the differences.
Murchison’s Median
Others joined in the battle. Rory Murchison of the Aetna Institute had a solution that modifies Date’s first method by concatenating the key to each value to make sure that every value is seen as a unique entity. Selecting the middle values is then a special case of finding the n-th item in the table.
1 2 3 4 5 6 7 8 9 10 11 |
SELECT AVG(part_wgt) FROM Parts AS P1 WHERE EXISTS (SELECT COUNT(*) FROM Parts AS P2 WHERE CAST(part_wgt AS CHAR(5)) + P2.part_nbr >= CAST(part_wgt AS CHAR(5)) + P1.part_nbr HAVING COUNT(*) = (SELECT FLOOR(COUNT(*) / 2.0) FROM Parts) OR COUNT(*) = (SELECT CEILING((COUNT(*) / 2.0)) FROM Parts)); |
This method depends on being able to have a HAVING
clause without a GROUP BY
clause, which is part of the ANSI standard but often missed by new programmers.
Occurrence Count Table
I came back with a working table with the values and a tally of their occurrences from the original table. Now that we have this table, we want to use it to construct a summary table that has the number of occurrences of each part_wgt
and the total number of data elements before and after we add them to the working table.
Philip Vaughan of San Jose, CA proposed a simple Median technique based on a VIEW
with unique weights and number of occurrences and then a VIEW
of the middle set of weights.
1 2 3 4 5 6 |
go CREATE VIEW ValueSet(part_wgt, occurrence_cnt) AS SELECT part_wgt, COUNT(*) FROM Parts GROUP BY part_wgt; go |
The MiddleValues VIEW
is used to get the Median by taking an average. The clever part of this code is the way that it handles empty result sets in the outermost WHERE
clause that result from having only one value for all weights in the table. Empty sets sum to NULL
because there is no element to map the index
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE VIEW MiddleValues(part_wgt) AS SELECT part_wgt FROM ValueSet AS VS1 WHERE (SELECT SUM(VS2.occurrence_cnt)/2.0 + 0.25 FROM ValueSet AS VS2) > (SELECT SUM(VS2.occurrence_cnt) FROM ValueSet AS VS2 WHERE VS1.part_wgt <= VS2.part_wgt) - VS1.occurrence_cnt AND (SELECT SUM(VS2.occurrence_cnt)/2.0 + 0.25 FROM ValueSet AS VS2) > (SELECT SUM(VS2.occurrence_cnt) FROM ValueSet AS VS2 WHERE VS1.part_wgt >= VS2.part_wgt) - VS1.occurrence_cnt; go SELECT AVG(part_wgt) AS median FROM MiddleValues; |
Median with Characteristic Function
Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the Median (SQL Forum 1993, 1994). They define a characteristic function, which they call delta, using the SIGN()
function. The delta or characteristic function accepts a Boolean expression as an argument and returns one if it is TRUE
and zero if it is FALSE
or UNKNOWN
. We can construct the delta function easily with a CASE
expression.
The authors also distinguish between the statistical Median, whose value must be a member of the set, and the financial Median, whose value is the average of the middle two members of the set. A statistical Median exists when there is an odd number of items in the set. If there is an even number of items, you must decide if you want to use the highest value in the lower half (they call this the left Median) or the lowest value in the upper half (they call this the right Median).
The left statistical Median of a unique column can be found with this query, if you will assume that we have a column called bin_nbr
that represents the storage location of a part in sorted order.
1 2 3 4 5 |
SELECT P1.bin_nbr FROM Parts AS P1, Parts AS P2 GROUP BY P1.bin_nbr HAVING SUM(CASE WHEN (P2.bin_nbr <= P1.bin_nbr) THEN 1 ELSE 0 END) = (COUNT(*) / 2.0); |
Changing the direction of the test in the HAVING
clause will allow you to pick the right statistical Median if a central element does not exist in the set. You will also notice something else about the Median of a set of unique values: It is usually meaningless. What does the Median bin_nbr
number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a Median.
The statistical Median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING
clause to allow for overlap; thus, the left statistical Median is found by.
1 2 3 4 5 6 7 8 9 |
SELECT P1.part_wgt FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_wgt HAVING SUM(CASE WHEN P2.part_wgt <= P1.part_wgt THEN 1 ELSE 0 END) >= (COUNT(*) / 2.0) AND SUM(CASE WHEN P2.part_wgt >= P1.part_wgt THEN 1 ELSE 0 END) >= (COUNT(*) / 2.0); |
Should Parts contain an even number of entries with the (COUNT(*)/ 2)
entry not repeated, this query may return FLOOR(AVG(DISTINCT part_wgt))
, as happens when SQL Server computes an average of integers. This can be fixed by changing the inner SELECT
to:
1 |
SELECT (P1.part_wgt * 1.0) AS part_wgt |
Notice that here the left and right medians can be the same, so there is no need to pick one over the other in many of the situations where you have an even number of items. Switching the comparison operators in the two CASE
expressions will give you the right statistical Median.
Henderson’s Median
Another version of the financial Median, which uses the CASE
expression in both of its forms, is:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT CASE COUNT(*) % 2 WHEN 0 -- even sized table THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt THEN P2.part_wgt ELSE NULL END))/2.0 ELSE P1.part_wgt --odd sized table END AS median FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_wgt HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt THEN 1 ELSE NULL END) = (COUNT(*) + 1) / 2; |
This answer is due to Ken Henderson. The only instance in which [[2]]
is correct is when Parts has an odd number of entries and the middle (COUNT(*) / 2 + 1)
entry is not repeated in the data.
Celko’s Third Median
Another approach involves looking at a picture of a line of sorted values and seeing where the Median would fall. Every value in column part_wgt
of the table partitions the table into three sections, the values that are less than part_wgt
, equal to part_wgt
or greater than part_wgt
. We can get a profile of each value with a tabular subquery expression.
Deriving the query is a great exercise in SQL logic and math. Don’t do it unless you just like the exercise. The query finally becomes:
1 2 3 4 5 6 7 8 9 10 |
SELECT AVG(DISTINCT part_wgt) FROM (SELECT P1.part_wgt FROM Parts AS P1, Parts AS P2 GROUP BY P1.part_nbr, P1.part_wgt HAVING SUM(CASE WHEN P2.part_wgt = P1.part_wgt THEN 1 ELSE 0 END) >= ABS(SUM(CASE WHEN P2.part_wgt < P1.part_wgt THEN 1 WHEN P2.part_wgt > P1.part_wgt THEN -1 ELSE 0 END))) AS Partitions; |
OLAP Median
The new OLAP function in the SQL-99 Standard allow you to replace COUNT()
functions with row numberings. Let’s assume we have a table or VIEW
named RawData that has only one numeric column. The obvious solution for data with an odd number of elements is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
WITH SortedData (x, sort_place) AS (SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC) FROM RawData) SELECT x AS median FROM SortedData WHERE sort_place = ((SELECT COUNT(*) FROM RawData)/ 2) + 1; --A slight modification for an even number of elements: WITH SortedData (x, sort_place) AS (SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC) FROM RawData) SELECT AVG(x) AS median FROM SortedData WHERE sort_place IN ((SELECT COUNT(*) FROM RawData)/ 2), (SELECT COUNT(*) FROM RawData)/ 2) + 1); |
You can combine both of these into one query with a CASE
expression and I leave that as an exercise for the reader.
But there is another question we have not asked about an even number of rows. Assume the data set look like {1, 2, 2, 3, 3, 3}
so the usual median would be AVG({2, 3}) = 2.5
as the measure of central tendency. But if we consider where the real trend is, then we would use the set-oriented weighted median with AVG({2, 2, 3, 3, 3}) = 13/5 = 2.6
as the measure of central tendency.
This leads to the final query which can handle odd and even counts and do the weighted Median:
1 2 3 4 5 6 7 8 9 |
WITH SortedData (x, hi, lo) AS (SELECT x, ROW_NUMBER() OVER(ORDER BY x ASC), ROW_NUMBER() OVER(ORDER BY x DESC) FROM RawData) SELECT AVG(x * 1.0) AS median FROM SortedData WHERE hi IN (lo, lo+1, lo-1); |
This might be slower than you would like, since the CTE might do two sorts to get the “hi” and “lo” values. At the time of this writing, ROW_NUMBER()
is new to all SQL products, so not as much work has gone into optimizing them. A smarter optimizer would map the i-th element of a sorted list of (n)
items in ASC order to the (n-i+1)-th
element for DESC
.
Want to try your own versions?
Keep up to date with Simple-Talk
For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter
This post has been viewed 21307 times – thanks for reading.
Tags: Database, SQL, SQL Server, T-SQL Programming, Workbench-
Rate
[Total: 23 Average: 4.3/5]
- Share