SQL's windowing functions are surprisingly versatile, and allow us to cut out all those selfjoins and explicit cursors. Joe Celko explains how they are used, and shows a few tricks such as calculating deltas in a time series, and filling in gaps.
Windowing functions were added to the ANSI/ISO Standard SQL:2003 and then extended in ANSI/ISO
Standard SQL:2008. Microsoft was late to this game. DB2, Oracle, Sybase, PostgreSQL and other products have had full
implementations for years. SQL Server did not catch up until SQL 2012. Better late than never.
Here is a skeleton table we can use for the rest of this article. It has the personnel, their
salary amounts and the department to which they are assigned.
CREATE TABLE Personnel_Assignments
(emp_name VARCHAR(10) NOT NULL,
dept_name VARCHAR(15) NOT NULL
PRIMARY KEY (emp_name, dept_name),
salary_amt DECIMAL (8,2) NOT NULL
CHECK (salary_amt > 0.00));
Load it with dummy data.
INSERT INTO Personnel_Assignments
VALUES
('Aaron', 'acct', 3000.00),
('Abaddon', 'acct', 3000.00),
('Abbott', 'acct', 3000.00),
('Abel', 'acct', 3500.00),
('Absalom', 'acct', 5500.00),
('Shannen', 'ship', 1000.00),
('Shannon', 'ship',2000.00),
('Shaquille', 'ship', 3000.00),
('Sheamus', 'ship', 4000.00),
('Shelah', 'ship', 3000.00),
('Shelby', 'ship', 4500.00),
('Sheldon', ship', 5500.00),
('Hamilton', 'HR',2300.00),
('Hamish', 'HR', 1000.00),
('Hamlet', 'HR', 1200.00),
('Hammond', 'HR', 800.00),
('Hamuel', 'HR', 700.00),
('Hanael', 'HR', 600.00),
('Hanan', 'HR', 1000.00);
Just as people annoyingly say “Sequel” instead of “SQL”, the window functions usually get called
“over functions” because the set of rows to which these functions apply are defined using the syntax:
<window function> OVER
([PARTITION BY <expression list>]
[ORDER BY <expression [ASCDESC] list>]
[ROWSRANGE <window frame>])
Partition is a set theory concept which we have see in the GROUP BY clause of the
SELECT..FROM..
statement. Think of it as a local version of grouping.
But we start losing naive set theory after that. The ORDER BY is the ordering concept we have seen
in cursors done within a partition. I like to think it came from von Neuman's definition of ordinal numbers using nested
sets. I was a math major and it shows. Some functions are easier to define with an ordering than with von Neuman's
ordinals.
The [ROWRANGE] subclause is trickier. It is a local cursor. Yes, Celko said cursor. We start with
the current row where the window function is invoked. This is where the imaginary read head is resting.
This subclause defines a frame extent of ordered rows that precede or follow the current
row.
Let's do this step by step:
<ROW or RANGE clause> ::=
{ROWS  RANGE} <window frame extent>
There is a subtle difference between ROWS and RANGE. ROWS uses a count of the rows and it is the
most common usage. RANGE use a count of distinct values. Think of a SELECT DISTINCT operation. None of this makes sense
without an ORDER BY subclause.
The next parts of the sub clause specify the rows preceding and the rows following the
current row. If no following value is given, the default is the current
row.
<window frame extent> ::=
{<window frame preceding>
 <window frame between>}
<window frame between> ::=
BETWEEN <window frame preceding> AND <window frame following>
The keyword BETWEEN is used here with the expected meaning; the <window frame preceding> rows
precede the <window frame following> rows.
<window frame bound> ::=
{<window frame preceding>
 <window frame following>}
<window frame preceding> ::=
{UNBOUNDED PRECEDING
 <unsigned_value_specification> PRECEDING
 current_row}
The keyword UNBOUNDED PRECEDING is a shorthand for the start of the partition. current_row
explains itself. You can also provide a constant count of rows with the PRECEDING option.
<window frame following> ::=
{UNBOUNDED FOLLOWING
 <unsigned_integer> FOLLOWING
 current_row}
The keyword UNBOUNDED FOLLOWING is a shorthand for the start of the partition.
current_row explains itself. You can also provide a constant count of rows with the FOLLOWING option.
The most common usages are for running totals:
OVER (PARTITION BY .. ORDER BY ..
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT_ROW)
and for the entire partition as a unit:
OVER (PARTITION BY .. ORDER BY ..
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
Aggregate Functions
The <window function> can be a common aggregate function, which include
SUM(), AVG(), MIN(), MAX()
and COUNT(). By default the PARTITION BY subclause is always implied. When it is not explicitly given, the whole result
set in the FROM clause of the query is used as the partition. For those common aggregate function, the ORDER BY subclause makes no sense. The computations are commutative; remember that word from High School algebra? The results do
not depend on ordering.
You have to “nest” basic aggregate functions with CTEs or derived tables to preserve information.
For example, the largest average department salary is done with this query:
WITH X
AS
(SELECT dept_name,
AVG(salary_amt) AS dept_salary_avg
FROM Personnel_Assignments
GROUP BY dept_name)
SELECT MAX(dept_salary_avg) FROM X;
Did you notice I lost the dept_name with the basic
GROUP BY? But I can write:
WITH X1  get departmental averages
AS
(SELECT dept_name,
AVG(salary_amt)
AS dept_salary_avg
FROM Personnel_Assignments
GROUP BY dept_name),
X2  use a windowed max() to keep department name
AS
(SELECT X1.dept_name, X1.dept_salary_avg,
MAX(X1.dept_salary_avg)
OVER () AS dept_salary_avg_max
FROM X1)
SELECT X2.dept_name, X2.dept_salary_avg
FROM X2
WHERE X2.dept_salary_avg_max = X2.dept_salary_avg;
Ranking Functions
The next <window function>s are the ranking functions; ROW_NUMBER(), RANK(), DENSE_RANK() and
NTILE(n). By now, you have probably used ROW_NUMBER() or DENSE_RANK() to get a column to use for picking subsets from
partitions of a result set. The classic example is to find the top (n) salaries by department.
You cannot nest aggregate function inside each other. A function returns a scalar value; that
would mean the innermost function would pass a scalar to the containing aggregate, not a set.
Here is an example of a DENSE_RANK() that is passed to a containing query that lets you pick the
top (n) salaries in each department.
WITH X
AS
(SELECT emp_name, dept_name, salary_amt,
DENSE_RANK()
OVER (PARTITION BY dept_name
ORDER BY salary_amt DESC) AS salary_rank
FROM Personnel_Assignments)
SELECT X.emp_name, X.dept_name, X.salary_amt
FROM X
WHERE X.salary_rank <= @n;
Analytic Functions
The third group of <window function>s are the Analytic Functions. These are somewhat like the
aggregate functions, but they do not make sense without the ORDER BY. But rather than doing a computation like the
aggregate functions, or an ordinal integer like the ranking functions, these return a value from the partition.
The two simplest such functions are the FIRST_VALUE and
LAST_VALUE analytic functions. The
FIRST_VALUE function returns the first value from a sorted partition, and the LAST_VALUE
function returns the last
value. In the following example, the SELECT statement uses both of these functions to calculate salary_amt values in
each sales group:
SELECT emp_name, dept_name, salary_amt,
FIRST_VALUE(salary_amt)
OVER(PARTITION BY dept_name
ORDER BY salary_amt DESC)
AS top_salary,
LAST_VALUE(salary_amt)
OVER(PARTITION BY dept_name
ORDER BY salary_amt DESC)
AS bottom_salary
FROM Personnel_Assignments;
The results are:
emp_name 
dept_name 
salary_amt 
top_salary 
bottom_salary 
Absalom 
acct 
5500 
5500 
5500 
Abel 
acct 
3500 
5500 
3500 
Aaron 
acct 
3000 
5500 
3000 
Abaddon

acct 
3000 
5500 
3000 
Abbott

acct 
3000 
5500 
3000 
Hamilton

HR 
2300 
2300 
2300 
Hamlet

HR 
1200 
2300 
1200 
Hamish

HR 
1000 
2300 
1000 
Hanan 
HR 
1000 
2300 
1000 
Hammond

HR 
800 
2300 
800 
Hamuel

HR 
700 
2300 
700 
Hanael

HR 
600 
2300 
600 
Sheldon

ship 
5500 
5500 
5500 
Shelby

ship 
4500 
5500 
4500 
Sheamus

ship 
4000 
5500 
4000 
Shelah

ship 
3000 
5500 
3000 
Shaquille

ship 
3000 
5500 
3000 
Shannon

ship 
2000 
5500 
2000 
Shannen

ship 
1000 
5500 
1000 
As expected, the PARTITION BY subclause gives us departmental sets, then the
ORDER BY subclause
puts the salary_amt values in descending order. Look at the output. There are problems. The result set is grouped by the
dept_name column, with the salary_amt values in each group sorted. Look at the output from the sample data. The
top_salary in each department is a constant and it is the highest salary for that that department. But the bottom_salary
is a mess; just seeing multiple values tells you something is wrong.
The FIRST_VALUE and LAST_VALUE functions support the [ROWSRANGE]
subclause and there is a default
that makes no sense. Look at the accounting department where the top salary is $5500.00 and bottom salary is $3000.00.
The department partition is being scanned in the order given by the ORDER BY clause. This fine for the
FIRST_VALUE
function with a DESC order, but it makes the LAST_VALUE() behave as “last value so far” function. This can be handled
with the [ROWSRANGE] subclause applied to the entire partition.
SELECT emp_name, dept_name, salary_amt,
FIRST_VALUE(salary_amt)
OVER(PARTITION BY dept_name
ORDER BY salary_amt DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS top_salary,
LAST_VALUE(salary_amt)
OVER(PARTITION BY dept_name
ORDER BY salary_amt DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS bottom_salary
FROM Personnel_Assignments;
The first ROWS subclause is redundant, but it documents the query.
The remaining two analytic functions are LAG() and LEAD(). These functions extend the [ROWSRANGE]
concept of a CURRENT_ROW to let us assume the rows are in a sorted order and we can move the imaginary cursor from the current
row to a preceding row, LAG(), or from the current row to a following row,
LEAD().
The LEAD(<column>, [<step size>], [<default>]) function retrieves a value from a row that is
<step
size> rows following the current one. The LAG(<column>, [<step size>], [<default>]) function retrieves a value from a
row that is <step size> rows preceding the current one.
Obviously, the first row in a partition has no preceding row and the last row in a partition has
no following row. As expected in SQL, these missing values are returned as NULLs. Well, not quite. If you specify a
<default>, then it is used.
SELECT dept_name, salary_amt, emp_name,
LAG(salary_amt, 1, 0.00)
OVER(PARTITION BY
dept_name
ORDER
BY salary_amt DESC)
AS preceding_salary,
LEAD(salary_amt, 1, 0.00)
OVER(PARTITION BY
dept_name
ORDER BY salary_amt DESC)
AS following_salary
FROM Personnel_Assignments;
dept_name 
salary_amt 
emp_name 
preceding_salary 
following_salary 
acct 
5500 
Absalom 
0 
3500 
acct 
3500 
Abel 
5500 
3000 
acct 
3000 
Aaron 
3500 
3000 
acct 
3000 
Abaddon 
3000 
3000 
acct 
3000 
Abbott 
3000 
0 
HR 
2300 
Hamilton 
0 
1200 
HR 
1200 
Hamlet 
2300 
1000 
HR 
1000 
Hamish 
1200 
1000 
HR 
1000 
Hanan 
1000 
800 
HR 
800 
Hammond 
1000 
700 
HR 
700 
Hamuel 
800 
600 
HR 
600 
Hanael 
700 
0 
ship 
5500 
Sheldon 
0 
4500 
ship 
4500 
Shelby 
5500 
4000 
ship 
4000 
Sheamus 
4500 
3000 
ship 
3000 
Shelah 
4000 
3000 
ship 
3000 
Shaquille 
3000 
2000 
ship 
2000 
Shannon 
3000 
1000 
ship 
1000 
Shannen 
2000 
0 
While the parameters can be expressions, the usual values are a step size of one and perhaps zero
for the default, if NULL would make computations or display problems.
As the results show, each row displays salary_amt values from the previous and following
rows
within each partition, unless it is a first or last row, in which case NULL is returned.
Charles Babbage and the Difference Engine
If you do not know who Charles Babbage is, It is time to
Google! What you probably do not know is
 “Schaum's Outline of Calculus of Finite Differences and Difference Equations” by Murray Spiegel;
ISBN: 9780070602182.
 “Introduction to Difference Equations” by Samuel Goldberg; ISBN: 9780486650845.
 “Finite Difference Equations” by H. Levy; ISBN: 9780486672601.
This is an older, precalculus method to interpolate or tabulate functions by using a small set of
polynomial coefficients. Both logarithmic and trigonometric functions can be approximated by polynomials, so this was a
handy way to generate mathematical tables by teams of people doing simple operations. It took a long time, and those
people were not error free. Before you laugh at this technology, it was used for scientific computation well into the
1950's. We did not have cheap digital computers and analog computers were rare and expensive to configure because you
had to hardwire them (but they were very fast once they were “programmed”). The improvements were
mimeographed work sheets and the use of SmithCorona Marchant mechanical calculators (they had multiplication and a
government contract).
Babbage's goal was to automate the process with a programmable digital computer in the 1870's!
Talk about being ahead of your time.
The math is simple. Given a series of values of a polynomial, {x1, x2, x3, x4, .. }, we can
subtract consecutive values and get what is called the first difference or first delta. This is written as Δxn = (xn+1 –
xn). As expected, the second delta is Δ2xn = Δ(Δxn) and so forth.
This is easier to see with an example of a simple polynomial, p(x) = 2x² – 3x + 2.
x 
p(x) 
Δp(x) 
Δ²p(x) 
0 
2 
1  3  7  11 
4 4 4 
1 
1 

2 
4 

3 
11 

4 
22 
To calculate p(5) use the pattern from the bottom cells to get (4 +11 + 22) = 37. You can verify
this with algebra: (2*5*5)  (3*5) + 2 = 37. The advantage is that we did not use anything but addition and subtraction.
Those operations are a lot easier to implement in mechanical devices than multiplication and division.
Those of you who still remember freshman calculus will see that this is the basis for derivatives
of polynomials. We just need to take limits instead of deltas and we have calculus! Obviously, I have used extremely
small data set.
We can do deltas in SQL now. Consider this generic table for trend data. Since we already worked
out a textbook example, let's use it.
CREATE TABLE Data
(time_point INTEGER NOT NULL PRIMARY KEY,
DECIMAL(5,2) INTEGER NOT NULL);
INSERT INTO Data
VALUES (0, 2.0), (1, 1.0), (2, 4.0), (3, 11.0), (4, 22.0);
This could be reading from a meter, stock prices over time, or whatever. Using SQL to translate
the mathematics to SQL is pretty straightforward.
WITH Delta_1 (time_point, value, d1)
AS
(SELECT time_point, value,
(value  LAG(value, 1) OVER (ORDER BY
time_point))
FROM Data),
Delta_2  this is a pattern!
AS
(SELECT time_point, d1,
(d1  LAG(d1, 1) OVER (ORDER BY time_point)) AS
delta_1
FROM Delta_1)
SELECT * FROM Delta_2;
As expected, we get the same results. I can get delta_n by repeating the second CTE pattern.
time_point 
delta_1 
delta_2 
0 
NULL 
NULL 
1 
1 
NULL 
2 
3 
4 
3 
7 
4 
4 
11 
4 
In the real world, the data is not this clean. But the principle still holds. Look at what happens
when we “dirt up” the data.
INSERT INTO Data
VALUES
(0, 2.00),
(1, 1.10),
(2, 3.75),
(3, 11.20),
(4, 22.90);
The results are in this table:
time_point 
delta_1 
delta_2 
2 
NULL 

1 
1.1 
NULL 
2 
3.75 
3.55 
3 
11.2 
4.8 
4 
22.9 
4.25 
If you have graphing calculator, you can show the original polynomial and overlay these slightly
off data points.
The third delta gives you 1.80 instead of zero, but that is measure of the goodness of fit of
this data to a second degree polynomial. If I had rounded the values, this would be smaller. In fact, I can find the
average of the second deltas and subtract it from each delta, ((3.55  4.20) + (4.80  4.20) + (4.25  4.20))/3.0 is
virtually zero. This is a better measurement of the fit.
There has been no advanced math used. This is also fast enough to use in real
time to monitor a process. Bet you did not think SQL could be used for time series this way.
Filling in Gaps
This is not the only use for LAG() and LEAD(). Imagine you have a test that is automatically
conducts every xminutes, but sometimes we do not get a reading, so it shows up as a NULL.
CREATE TABLE Tests
(test_id INTEGER NOT NULL,
test_timestamp DATETIME2(0) NOT NULL,
PRIMARY KEY (test_id, test_timestamp),
test_score INTEGER);
INSERT INTO Tests
VALUES
(2, '20120101 15:15:00', 15),
(2, '20120101 15:25:00', NULL),
(2, '20120101 15:55:00', NULL),
(2, '20120101 16:10:00', 12),
(2, '20120101 16:30:00', 12),
(2, '20120101 16:45:00', 9),
(2, '20120101 16:55:00', NULL),
(2, '20120101 17:12:00', 10);
Our business requirement is to calculate the delta on a newer row (based on the test_timestamp)
from a value in the preceding row, except when the preceding row is NULL and then I need to look back until I can find a
value.
CREATE VIEW Deltas
(test_id, test_timestamp, test_score, delta_test_score)
AS
(SELECT test_id, test_timestamp, test_score,
(test_score 
LAG(test_score)
OVER (PARTITION BY test_id
ORDER BY test_id, test_timestamp))
FROM Tests
WHERE test_score IS NOT NULL
UNION
SELECT test_id, test_timestamp, test_score, 0
FROM Tests
WHERE test_score IS NULL);
test_id 
test_timestamp 
test_score 
test_delta 
2 
'20120101 15:15:00' 
15 
NULL 
2 
'20120101 16:10:00' 
12 
3 
2 
'20120101 16:30:00' 
12 
0 
2 
'20120101 16:45:00' 
9 
3 
2 
'20120101 17:12:00' 
10 
1 
2 
'20120101 15:25:00' 
NULL 
0 
2 
'20120101 15:55:00' 
NULL 
0 
2 
'20120101 16:55:00' 
NULL 
0 
Looking at Old Code
We have to maintain old code, so it is worth looking at how we did analytic functions without the
current syntax. The short answer is selfjoins from hell. When you find this pattern, it is worth the effort to replace
it.
The trick was to build a CTE that adds a row number to the original data. This number is then used
in outer selfjoins with a step size, usually one. If you wanted to get a default, you can add a COALESCE()
to the appropriate columns. Here is a skeleton that will need a lot of work.
WITH Sequenced_Data
AS
(SELECT *,
ROW_NUMBER() OVER (ORDER BY
sequencing_value) AS rn
FROM Data)
SELECT Preceding_Rows.*, Current_Rows.*, Following_Rows.*
FROM Sequenced_Data AS Current_Rows
LEFT OUTER JOIN
Sequenced_Data AS Preceding_Rows
ON Preceding_Rows.rn = Current_Rows.rn 
@step_size
LEFT OUTER JOIN
Sequenced_Data AS Following_Rows
ON Following_Rows.rn = Current_Rows.rn +
@step_size;
The FIRST() and LAST() are done in a similar fashion. This skeleton uses the ASC/DESC option in
the ROW_NUMBER() to get a value of one in the first and last positions of the original data.
WITH Sequenced_Data
AS
(SELECT *,
ROW_NUMBER() OVER (ORDER BY
sequencing_value ASC) AS up,
ROW_NUMBER() OVER (ORDER BY
sequencing_value DESC) AS dn
FROM Data)
SELECT *
FROM Sequenced_Data
WHERE 1 IN (up, dn);