T-SQL Window Functions Speed Phreakery

Before SQL Server had SQL window functions, SQL Server developers had to use all manner of tricks and algorithms to come up with ways to rapidly process large amounts of data. Do we need these techniques now that we can express a SQL task in terms of window functions? Kathi revisits a famous SQL challenge to find out

Back in 2009 and 2010, Phil Factor hosted the SQL Speed Phreak Competition, in which programmers were challenged to construct the fastest solution to a series of common programming challenges, such as calculating a running total, or implementing a FIFO (First In, First Out) queue, where the first items purchased or returned must be the first items sold.

The submitted solutions ranged from poorly performing ‘triangular joins’, subqueries, and recursive CTES to safe-and-steady cursors, to lightning-fast but often complex and occasionally unorthodox set-based code.

I wrote a series of articles describing some of the solutions to the running total, FIFO and SSN Matching problems. My basic philosophy is that for most problems there is an easy, maintainable solution and there is a fast solution, and it’s critical to know the size of the performance difference between the two. My goal in writing the articles was to attempt to demystify some of the common techniques the SQL gurus used to wring every ounce of performance out of their solutions, and so make them more widely accessible to developers who need a faster and more scalable solution than more conventional tools, such as cursors, can offer.

The original winning solution to the running total problem, as submitted by Peter Larsson, executed in just milliseconds against a dataset of one million rows. It used some very clever pre-aggregation and pivoting techniques, but also the rather-controversial “quirky update” method. At that time, SQL Server 2008 and 2008 R2 were the latest versions of SQL Server. However, SQL Server 2012 introduced some very powerful new T-SQL window functions and enhancements, which I’ve since used widely. It occurred to me that they could offer a route to a safer, more maintainable solution for some of these problems, without compromising on speed. This article will explain what I discovered when I took another look at the running total problem, in light of these functions.

The Subscription List Problem

The original Subscription List Problem consisted of a table containing a list of subscribers, with the dates that the subscribers joined, and the dates that they cancelled their subscription, as illustrated by the sample data in Figure 1.

Registration_ID FirstName LastName DateJoined DateLeft
1 Judith Palmer 2004-01-01 NULL
2 Christa Jackson 2004-01-01 NULL
3 Sophia Grimes 2004-01-01 NULL
19 Bobbie Brennan 2004-01-05 2006-09-14

Figure 1

The challenge was to produce a report showing a running total of current subscribers for each month. There were two more things to keep in mind about the requirements: don’t calculate past the final month that has new registrations, and the original data doesn’t skip any registration months (if the source data did skip registration months, then all the solutions would need refinement to account for this).

In the original article, I started by describing a very simple cursor-based solution, which also performed expensive date manipulations and ran rather slowly. Next, I reviewed and explained the techniques used in the winning entry, and finally produced a third solution that exploited some of what I’d learned, but still used a cursor for the final running totals, rather than the ‘quirky update’.

I adopt a similar strategy here, presenting solutions as follows:

  1. The Aunt Kathi solution – the simplest solution I could create, which exploits the capability in SQL Server 2012, and later, to use the OVER clause and a ‘sliding window’ of data, with aggregate functions.
  2. Peter Larsson’s original winning code – I provide a brief review of the key techniques that made Peter’s solution fly (for further details, refer to my original article)
  3. The Window Function Speed Phreak Solution – clever date manipulations, pivoting and a Window function.

You can use the SubscriptionList.sql file, available for download above, to create the sample table and ten thousand subscribers. The original tests ran using the same table but containing 1 million subscribers.

On my laptop, with 16 GB of RAM and SSD storage and running SQL Server 2014, solutions 2 and 3 ran so fast for 1 million subscribers that I had to increase the number of rows in the sample data to over 10 million! With this data set, the Aunt Kathi Easy Solution ran in 3 seconds, and the other two were inseparable, at 1.7 seconds.

The Aunt Kathi Easy Solution

This problem is easy to solve using a SQL Server 2012 accumulating window aggregate function, plus the new EOMONTH function, as shown in Listing 1.

Listing 1: The Aunt Kathi Solution

The challenge requires us to return the current number of subscribers each month. The dates in the data can be any day of the month, so the first task is to manipulate the dates so that they are all the same day of the month. Then, we can count them to arrive at that month’s total.

The cursor solution of the previous article used some rather expensive date conversions to achieve this.

As you can see, in my new solution, for SQL Server 2012, I used the EOMONTH function to change the DateJoined and DateLeft values to the last day of the month.

Each month, new subscribers join and some subscriptions cancel or expire. The problem is that each row of the data contains both the DateJoined and DateLeft for one subscription. Therefore, we need to obtain a count by month for each type of date.

One easy way to do this is to create a separate aggregate query to calculate the counts for each type of. In Listing 1, each of these queries is contained within a CTE. SQL Server has to scan the Registrations table twice, once to get the DateJoined count, and once to get the DateLeft count. This is where The Aunt Kathi Solution takes a performance hit of about 1.3 seconds; the original winning solution, as you’ll see later, scans the table just once.

The outer query joins the two CTEs in a LEFT OUTER join on the dates so that all months with new subscriptions are included. The outer query returns the DateJoined as "TheMonth", the number of new and cancelled subscriptions, and the running total.

To calculate the running total, I used a T-SQL accumulating aggregate window function. To change a window aggregate to an accumulating window aggregate, you add an ORDER BY to the OVER clause, in this case ordering by DateJoined. To avoid a performance penalty, I specified the frame ROWS UNBOUNDED PRECEDING. This is a shortcut for ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. If the frame is not specified, the query will use RANGE instead of ROWS. RANGE almost always performs worse than ROWS because it creates a worktable in tempdb instead of in memory.

The Original Winning Code

The original winning code ran in 1.7 seconds on my laptop against 10 million rows. When my article was originally published, it caused a bit of controversy, since it takes advantage of what has been called “the quirky update”, an undocumented method that is not guaranteed, by Microsoft, to work. At the time, the quirky update was the fastest way to calculate a running total. Listing 2 presents the winning code.

Listing 2: Peter Larsson’s original winning solution

First, the dates must be manipulated. While the Aunt Kathi Easy Solution uses the EOMONTH function, the original winning solution uses a combination of DATEDIFF and DATEADD to turn each date to the first of the month.

The DATEDIFF function is used to calculate the number of months since the date represented by 0. That number is then added back to date 0 to get the original month and year. The solution applies DATEDIFF before aggregating and DATEADD after aggregating. If the functions were used in one step, the formula would look like this:

The Aunt Kathi Easy Solution used two CTEs to calculate the count for new subscriptions and cancellations for each month. The winning solution achieves the same list of months, actually the number of months past date 0, with counts, in such a way that it requires only a single scan of the Registrations table, instead of two. It uses the UNPIVOT method to cause the two date columns to line up into one column. Then the CASE expression is used to re-pivot the data into a list with a count of new subscriptions and cancellations. By using this method, the winning solution scans the table just once. Figure 2 demonstrates the transformations that the data takes in this solution.

2274-1-91c771d3-52c7-459f-a919-6b304f931

Figure 2

The number of months since date 0 and the counts are stored in a temp table, #stage. The temp table is used because the quirky update actually updatesthe source data. Using a temp table with a column to hold the answer avoids updating the original table. Months with no new subscriptions are filtered out in the HAVING clause.

The final step uses a CTE (called Yak), in which the DATEADD function turns the number of months back into a date and orders the data. The quirky update relies on the data being returned in the correct order, but updates do not support an ORDER BY clause. Usually, ordering the data is not allowed in a CTE, but the solution gets around this by using the TOP keyword, which allows you to retrieve a number or percentage of rows, in this case largest possible integer value (2147483647), to make sure that all the rows will be returned. In order to control which rows a query returns when using TOP, you can use ORDER BY to sort the data.

This is a controversial technique since the idea behind using TOP and ORDER BY within a CTE is to get the correct rows, not to actually order them. It’s possible that the rows will not be returned in order in a future version of SQL Server.

The outer query uses a variable to store a subtotal from row to row. Not only is the value of the variable updated for each row, the Subscribers column from the CTE is also updated each time. The quirky update code looks like this:

The PeopleJoined value is added to and the PeopleLeft value is subtracted from the previous value of the variable. That total is saved in the Subscribers column and is also passed to the @Subscribers variable. It’s a row-by-row process, but it runs extraordinarily fast.

To avoid taking a performance hit later by selecting rows from the temp table once the values are calculated, the OUPUT clause is used. The OUTPUT clause returns data that is updated, inserted, or deleted, and, in this case, the results of the puzzle.

The Window Function Speed Phreak Solution

This solution adopts the use of DATEDIFF and DATEADD functions to convert dates to the first day of the month, and uses the “UNPIVOT and re-pivot with CASE" to calculate the count for new subscriptions and cancellations for each month with only a single table scan.

However, it avoids the quirky update and doesn’t need a temp table. It uses a T-SQL window accumulating aggregate to calculate the running total. This solution also runs in 1.7 seconds.

Listing 3: The Window Function Speed Phreak Solution

Instead of storing the list of month numbers and the counts in a temp table, this solution adds to a CTE the code that would have been used to populate the temp table.

The outer query just returns the month, the number of people who joined and left, and uses the T-SQL SUM window aggregate function, along with the new OVER clause, to calculate the running total. This solution is guaranteed to work, unlike the quirky update, and performs exactly the same.

Conclusion

When speaking at SQL Server events, I always say that the 2012 T-SQL Window functions have given T-SQL super powers. Armed with these new powers, plus a few aggregation and data manipulation techniques that can help minimize the work that SQL Server needs to perform, we arrive at a running total solution that is fast, guaranteed to work, and not so quirky.

In the next article, I’ll explain how using the new functionality helps solve the FIFO Stock Inventory Problem.

If you are inspired to learn more about Window Functions…

I’ve written a book on the topic, Expert T-SQL Window Functions in SQL Server, and a Pluralsight online training course.

Downloads

Tags: , , ,

  • 12878 views

  • Rate
    [Total: 23    Average: 4.7/5]
  • MartyP

    Nice Approach
    This is a good article and provided a lot of validation of my work in prior years. I recently had a table with >14MM rows and the results came back within seconds using a sproc that avoided temp tables and cursors. The only thing that I fail to recognize in the article is the ‘phreakiness’ of the code. The T-SQL seems to use standard aggregates and functions that have been around for years. I would like to read your book though.
    As a front-end developer for decades, it’s a shame that I see more and more devs separating themselves from good ol’ T_SQL and database design. Code First and NoSQL approaches seem to cause more confusion than good. A sound application always starts with a sound Db design IMHO. Thanks.

  • Dr. Les Cardwell, DCS-DSS

    ANSI Standard SQL
    Hi Kathi,

    It’s probably worth mentioning that the standard SQL set-based solution I submitted post-contest (and subsequently used in my dissertation) outperformed the rest, with far less code. It’s rare I’ve found a Transact-SQL (linear) solution that can outperform Set-SQL (non-linear) solutions.

    Best,
    Les

  • Dwain.C

    Another Phreaky Solution
    Hi Kathi,

    I love those WINDOW functions too, especially when they allow for the running total. I was wondering how this rather odd looking construction stacks up against the ones you posted in your article.

    WITH PreAggregate AS
    (
    SELECT DateJoined, PeopleJoined=COUNT(*)
    FROM dbo.Registrations
    GROUP BY DateJoined
    )
    SELECT TheMonth, PeopleJoined=SUM(PeopleJoined), PeopleLeft=SUM(PeopleLeft)
    ,CurrentSubscriptions=SUM(SUM(PeopleJoined)-SUM(PeopleLeft)) OVER(ORDER BY TheMonth ROWS UNBOUNDED PRECEDING)
    FROM
    (
    SELECT TheMonth=EOMONTH(DateJoined), PeopleJoined, PeopleLeft=0
    FROM PreAggregate a
    UNION ALL
    SELECT EOMONTH(DateLeft), 0, COUNT(*)
    FROM dbo.Registrations
    WHERE DateLeft IS NOT NULL
    GROUP BY EOMONTH(DateLeft)
    ) a
    GROUP BY TheMonth
    HAVING SUM(PeopleJoined) 0;

    Seemed pretty good when I ramped up the test data to 1M rows.

    As a side note, a big howdy to Les Cardwell whom I haven’t heard from in a coon’s age.

  • DarekDan

    Outstanding insights
    Great article with many excellent points!

    After running all three TSQLs through Adam Mechanic’s SQLQueryStress a 100 times (10 threads, 10 iterations), on 10M rows with 30% NULLs for DateLeft, RAID 10 on SAS, Dwain’s solution outperforms the other two:

    PreAggregate: 1m26s and 167516 avg logical reads (ALR)
    Totals: 2m30s and 83164 ALR
    NewSubscriptions: 2m12s and 166033 ALR

  • Dwain.C

    To: DarekDan RE: Outstanding Insights
    @DarekDan – thanks for taking the time to look at that. My results (although it seems I didn’t use such sophisticated timing tools) sound like they were similar.

    To keep in line with Kathi’s excellent explanation for why individual solutions are faster, let me suggest two reasons why mine works reasonably well:

    1. The PreAggregation – I noticed in the test data that there were quite a few people joining on each individual date. So I figured a "rows reduction" approach could apply here, so that when I proceeded to the next step of retrieving the people leaving, there’d be quite a few rows to have to process through the EOMONTH build-in function.

    2. The INDEX Kathi provided worked quite well when it came to identifying the rows that had a leaving date (since I was able to exclude the NULLs).

  • kkellenbe@hotmail.com

    Why Speed Phreakery?
    That was the name of the contest back in 2009. This is a follow up to an article explaining the winning solution.

  • kkellenbe@hotmail.com

    Great alternate solution, Dwaine!
    Even though your solution scans the table twice, it does run much faster. Great work!!!

  • Eirikur Eiriksson

    Improvement to Dwain’s solution
    I know I’m a little late to the party but I think it’s worth pointing out an opportunity for improvement in Dwain’s solution. The first part is adding two indices, one on each date column, the second one is aggregating both dates (to the month) in two separate CTEs before combining them for the final aggregation and the running total. These changes will shave off apprx. 15%

    Code:
    ;WITH REGISTRATION_JOINS AS
    (
    SELECT
    EOMONTH(RG.DateJoined) AS DateJoined
    ,COUNT(*) AS JOINED
    ,0 AS RLEFT
    FROM dbo.Registrations RG
    GROUP BY RG.DateJoined
    )
    ,REGISTRATION_LEFT AS
    (
    SELECT
    EOMONTH(RG.DateLeft) AS DateLeft
    ,0 AS JOINED
    ,COUNT(*) AS RLEFT
    FROM dbo.Registrations RG
    WHERE RG.DateLeft IS NOT NULL
    GROUP BY RG.DateLeft
    )
    ,JOINED_AND_LEFT_ON_DATE AS
    (
    SELECT
    RJ.DateJoined AS RDATE
    ,RJ.JOINED
    ,RJ.RLEFT
    FROM REGISTRATION_JOINS RJ
    UNION ALL
    SELECT
    RL.DateLeft AS RDATE
    ,RL.JOINED
    ,RL.RLEFT
    FROM REGISTRATION_LEFT RL
    )
    ,AGGREGATED_JOINED_LEFT AS
    (
    SELECT
    JALOD.RDATE
    ,SUM(JALOD.JOINED) AS JOINED
    ,SUM(JALOD.RLEFT ) AS HAS_LEFT
    FROM JOINED_AND_LEFT_ON_DATE JALOD
    GROUP BY JALOD.RDATE
    )
    SELECT
    AJL.RDATE
    ,AJL.JOINED
    ,AJL.HAS_LEFT
    ,SUM(AJL.JOINED – AJL.HAS_LEFT) OVER
    (
    ORDER BY AJL.RDATE
    ROWS BETWEEN UNBOUNDED PRECEDING
    AND CURRENT ROW
    )
    FROM AGGREGATED_JOINED_LEFT AJL;