Kathi Kellenberger
Writing Efficient SQL: Set-Based Speed Phreakery
04 February 2010

 Phil Factor's SQL Speed Phreak challenge is an event where coders battle to produce the fastest code to solve a common reporting problem on large data sets. It isn't that easy on the spectators, since the programmers don't score extra points for commenting their code. Mercifully, Kathi is on hand to explain some of the TSQL coding secrets that go to producing blistering performance.

SQL is a declarative language, designed to work with sets of data. However, it does support procedural, "row-by-row" constructs in the form of explicit cursors, loops and so on. The use of such constructs often offers the most intuitive route to the desired SQL solution, especially for those with backgrounds as application developers. Unfortunately, the performance of these solutions is often sub-optimal; sometimes very sub-optimal.

Some of the techniques used to achieve fast SQL code, and avoid row-by-row processing can, at first, seem somewhat obscure. This, in turn, raises questions over the maintainability of the code. However, in fact, there are really only a handful of these techniques that need to be dissected and understood in order to open a path towards fast, efficient SQL code. The intent of this article is to investigate a very common "running total" reporting problem, offer some fairly typical "procedural" SQL solutions, and then present much faster "set-based" solutions and dissect the techniques that they use.

The examples in this article are based on the classic "running total" problem, which formed the basis for Phil Factor’s first SQL Speed Phreak Competition. This challenge is not some far-fetched scenario that you will never encounter in reality, but a common reporting task, similar to a report that your boss may ask you to produce at any time.

In my experience, I've mostly found that the ease of producing the solution is inversely proportional to its performance and scalability. In other words, there is an easy solution and a fast solution, as well as many solutions in between. One may argue that for a report that runs once a month, clarity and maintainability are as important as speed, and there is some truth is this. However, bear in mind that while you can get away with a simple solution on a table with a few thousand rows, it won't scale. As the number of rows grows so the performance will degrade, sometimes exponentially.

Furthermore, if you don't know what's possible in terms of performance, then you have no basis to judge the effectiveness of your solution. Once you've found the fastest solution possible then, if necessary, you can "back out" to a solution that is somewhat slower but more maintainable, in full knowledge of the size of the compromise you've made.

The Subscription List Challenge

The Subscription List challenge, as presented in Phil Factor's Speed Phreak competition, consisted of one table containing a list of subscribers, with the dates that the subscribers joined, and the dates that they cancelled their subscription. The task was to produce a report, by month, listing:

  • The number of new subscriptions
  • The number of subscribers who have left
  • A running total of current subscribers

Just about anything you wished to do was allowed in this challenge. You could use any version of SQL Server, change the indexes on the original table, use temp tables, or even use cursors. The person whose T-SQL solution returned the correct results in the fastest time was the winner.

In a nice twist, Phil Factor reported the standings daily and allowed multiple submissions. This allowed the entrants to fine-tune their solutions, and to help out some of the other contestants. The original solutions were developed against a sample dataset of 10,000 rows. However, the top entries performed so well that they were also tested against a 1 million row dataset. Likewise, all of the solutions in this were developed against the 10K-row dataset, but tested against both.

The eventual winner was Peso, the SQL Server MVP Peter Larsson, whose solution we will present and dissect, and which returned the required report, running against a 1 million row table, in just over 300 milliseconds. Eight other entrants ran very close, with solutions that ran in under 600 milliseconds.

Speed was all important in this contest; this was a super-charged drag race and there was little point turning up to the starting line in a mini-van. However, for lesser SQL mortals, you could be forgiven for regarding Peso's winning solution with an equal mix of amazement and confusion. Hopefully, after reading the rest of this article, the solution will be clearer and the techniques accessible to you in solving your own T-SQL problems.

Setting Up

All you need to do to get started is to:

  1. Create a new database in your test or development SQL Server instance.
  2. Download the sample data from http://ask.sqlservercentral.com/questions/92/the-subscription-list-sql-problem.
  3. Play the script found in the download to create and populate the table in your sample database.

Go ahead and take a look at the data by running SELECT * FROM Registrations. Figure 1 shows a sample of the data that we will be using to create the report.

Figure 1: The data found in the Registrations table

Notice that many registrations do not have a DateLeft value. These subscriptions are ongoing and are active past the time period required for the report.

Validation

In order to prove that our solution works, we need to know what the correct answer should look like. In this case, of course, we could simply run Peso's winning script. However, in practice, we won't have that script and so we'll need to do our own investigation of the data, to make sure we fully understand the report's requirements.

The report must provide a list of the months, from the beginning of the data to the month of the contest (September, 2009), along with the count of new subscribers and cancellations for each month. After September 2009, there are no new registrations, only cancellations. The report must also supply a running total of the number of active subscriptions. The simple calculation for the number of active subscriptions in a given month is:

Number of subscriptions from previous month
+ new subscribers in current month
– cancellations in current month

To get started, run these two queries from Listing 1:

-- new registrations per month

SELECT  count(*) AS newRegistrations ,

        year(datejoined) AS Year ,

        month(datejoined) AS Month

FROM    REGISTRATIONS

GROUP BY year(datejoined) ,

        month(datejoined)

ORDER BY Year ,

        Month ;

       

-- Unsubscribes per month

SELECT  count(*) AS Cancellations ,

        year(dateleft) AS Year ,

        month(dateleft) AS Month

FROM    REGISTRATIONS

GROUP BY year(dateleft) ,

        month(dateleft)

ORDER BY Year ,

        Month ;

Listing 1: Another look at the data

Figure 2 shows the partial results. The first query shows the new registrations for each month, and the second query shows the cancellations. Notice that there are no cancellations until June 2004.

Figure 2: The counts of new registrations and cancellations.

Let’s manually calculate the first few results. The first month, January, 2004, has 167 new subscriptions with no cancellations; therefore, the running total is 167. There are no cancellations until June 2004, so we can just add the new subscriptions for each month through May. In June, we have to subtract one cancellation. Table 1 shows the results for the first few months of 2004, manually calculated, which provide a good representation of the report.

Table 1: The report for 2004

Now we have a means to validate the results returned by our report, we're ready to start building it. Remember that, just like Peso, we are unlikely to come up with the best solution first; more likely we'll create a workable solution first and then tweak it to improve performance. Of course, after each tweak of your own solutions, be sure to implement regression testing to make sure that the improved solution still returns the correct results.

Row by Agonizing Row

The reason that so much SQL code found out in the wilds performs poorly is that SQL is nobody's "first language" (with the possible exception of Joe Celko). Developers tend to master the basic SQL syntax quickly, but then start firing out SQL data retrieval code in a manner strongly influenced by the philosophy and practices of their primary, procedural programming language. When speaking at my local VB.NET group, I found that some of the developers in the audience were shocked that there is another, better way to perform updates than one row at a time.

At its most brutal, the row-by-row solution might look something like the one shown in Listing 2. There is no aggregation going on at all. What we have here is basically an explicit cursor that runs through every single row in the dataset, one at a time, performing the necessary calculations and keeping track of the running totals in a temporary table.

--Variables

DECLARE @the_month DATETIME

DECLARE @last_month DATETIME

DECLARE @date_joined DATETIME

DECLARE @date_left DATETIME

DECLARE @outside_month DATETIME

 

--Table to hold results

CREATE TABLE #subscriptions

    (

      theMonth DATETIME ,

      PeopleJoined INT ,

      PeopleLeft INT ,

      Subscriptions INT

    )

 

--Set variables for processing

SELECT  @the_month = min(DateJoined) ,

        @last_month = max(DateJoined)

FROM    Registrations

 

SELECT @the_month = cast(year(@the_month)
                                AS VARCHAR) + '-'

       + cast(month(@the_month) AS VARCHAR) + '-1' ,

        @last_month = cast(year(@last_month)
                                AS VARCHAR) + '-'

       + cast(month(@last_month) AS VARCHAR) + '-1'

 

SET @outside_month = dateadd(M, 1, @last_month)

 

--Insert a row for each month in the report

WHILE @the_month <= @last_month

    BEGIN

        INSERT  INTO #subscriptions

                ( theMonth ,

                  PeopleJoined ,

                  PeopleLeft ,

                  Subscriptions

                )

        VALUES  ( @the_month ,

                  0 ,

                  0 ,

                  0

                )

        SET @the_month = dateadd(M, 1, @the_month)

    END

 

--Cursor to look at each row

DECLARE regs CURSOR FAST_FORWARD FOR

SELECT dateJoined, coalesce(dateLeft,@outside_month)

FROM Registrations

 

OPEN regs

FETCH NEXT FROM regs INTO @date_joined, @date_left

WHILE @@FETCH_STATUS = 0

    BEGIN

 

        SELECT  @date_joined = cast(year(@date_joined)
                                   
AS VARCHAR) + '-'

                + cast(month(@date_joined)
                                    AS VARCHAR) + '-1' ,

                @date_left = cast(year(@date_left)
                                  
 AS VARCHAR) + '-'

                + cast(month(@date_left)
                                    AS VARCHAR) + '-1'

 

   --Update every row that the subscription is valid

        UPDATE  #subscriptions

        SET     PeopleJoined = PeopleJoined + 1 ,

                Subscriptions = Subscriptions + 1

        WHERE   theMonth >= @date_joined

                AND theMonth < @date_left

 

   --Process calculations

        UPDATE  #subscriptions

        SET     PeopleLeft = PeopleLeft + 1

        WHERE   theMonth = @date_left  

 

        FETCH NEXT FROM regs INTO @date_joined, @date_left

    END

CLOSE regs

DEALLOCATE regs

 

--The report!

SELECT  *

FROM    #subscriptions

DROP TABLE #subscriptions

Listing 2: The RBAR solution

The term row-by-agonizing row (RBAR), first coined by Jeff Moden on SQLServerCentral.com, is apt here. This solution, while providing the correct results, runs in 7 seconds on 10,000 rows, on the hardware used to test the solutions in the contest. When applied to the 1 million row dataset, it took 13 minutes!

Imagine a box of marbles sitting in a box on a table. If you needed to move all the marbles to another box on the other side of the room without picking up either box, how would you do it? Would you pick up each marble, one at a time and march it across the room to the second box? Unless you were getting paid by the minute, you would probably scoop them all up and carry them all in one trip. You would avoid the marble-by-marble method, if you wanted to be efficient.

Faster, but Still Iterative

A kinder, though still iterative solution might take the following approach:

  • Create a temporary table (#subscriptions), load it with the DateJoined data from the Registrations table, and then aggregate on this column, by month and year, to get a count of the number of people who joined in each month
  • Create a CTE (cancellations), load it with the DateLeft data from the Registrations table, and then aggregate on this column by month and year to get a count of the number of people who left in each month, in each year. In lieu of a CTE, you could also use another temp table or a derived table.
  • Update the PeopleLeft column in #subscriptions with the number of people who left in each month, from the CTE
  • Use a cursor to loop through each month in the temporary table, calculating the required running total of subscribers

This solution is shown in Listing 3.

--variables

DECLARE @total INT

DECLARE @people_joined INT

DECLARE @people_left INT

DECLARE @the_month DATETIME

 

--create a table to hold the results

CREATE TABLE #subscriptions

    (

      theMonth DATETIME ,

      PeopleJoined INT ,

      PeopleLeft INT ,

      Subscriptions INT

    )

 

--insert a row for each month

--in the data

INSERT  INTO #subscriptions

        ( theMonth ,

          PeopleJoined ,

          PeopleLeft ,

          Subscriptions

        )

        SELECT  cast(year(DateJoined) AS VARCHAR) + '-'

                + cast(month(DateJoined) AS VARCHAR) + '-1' ,

                count(*) ,

                0 ,

                0

        FROM    Registrations

        GROUP BY cast(year(DateJoined) AS VARCHAR) + '-'

                + cast(month(DateJoined) AS VARCHAR) + '-1'

 

--update for cancellations

;

WITH    CANCELLATIONS

          AS ( SELECT   count(*) CANC_COUNT ,

                        cast(year(DateLeft) AS VARCHAR) + '-'

                        + cast(month(DateLeft) AS VARCHAR) + '-1' AS Dateleft

               FROM     Registrations

               WHERE    DATELEFT IS NOT NULL

               GROUP BY cast(year(DateLeft) AS VARCHAR) + '-'

                        + cast(month(DateLeft) AS VARCHAR) + '-1'

             )

    UPDATE  S

    SET     PeopleLeft = CANC_COUNT

    FROM    #subscriptions S

            INNER JOIN CANCELLATIONS C
               ON S.theMonth = Dateleft

 

SET @total = 0

--set up a cursor to update the total subscriptions

--for each month

DECLARE SUBSCRIPTIONS CURSOR FOR

SELECT THEMONTH, PEOPLEJOINED, PEOPLELEFT

FROM #SUBSCRIPTIONS ORDER BY THEMONTH

 

OPEN SUBSCRIPTIONS

FETCH NEXT FROM SUBSCRIPTIONS INTO @the_month, @people_joined,

   @people_left

WHILE @@FETCH_STATUS = 0

    BEGIN

 

        SET @total = @total + @people_joined -
                                  @people_left

        UPDATE  #subscriptions

        SET     Subscriptions = @total

        WHERE   theMonth = @the_month

 

        FETCH NEXT FROM SUBSCRIPTIONS
           INTO @the_month, @people_joined, @people_left

    END

CLOSE SUBSCRIPTIONS

DEALLOCATE SUBSCRIPTIONS

 

--the report

SELECT  *

FROM    #subscriptions

ORDER BY THEMONTH

 

DROP TABLE #subscriptions

Listing 3: The Easy, Iterative Solution

So, instead of looping through every row in the data, we aggregate the data to the month level, and then loop through each possible month in the report performing the calculations. We have a lot of date calculations going on in our INSERT and UPDATE statements. These date calculations are very expensive because multiple functions must operate on each date.

When tested, this solution took 360 milliseconds on 10,000 rows and three seconds on 1 million rows. While still not performing as well as the winning solution, it is a big improvement over the original row-by-row solution.

So, this solution, while still flawed, is at least one that will work adequately on a small report table. The other advantage is that the logic behind it is crystal clear, and it will be an easily maintainable solution. BUT…it doesn’t scale. As the results indicate, as more and more rows are added to the registrations table, the report will take longer and longer to run.

We could spend some time tweaking to improve performance, but let’s move on to an even better way.

The Fastest Solution: DATEDIFF, UNPIVOT, and a “Quirky Update”

The winning script in the competition, script "4e" submitted by Peter Larsson (Peso), is reproduced in Listing 4. As noted earlier, it performs so well that, against 10K rows it is almost un-measurably fast, and even when tested against a dataset of one million rows, it took just milliseconds to run. In other words, Peso’s solution is scalable!

/*****************************************************

    Peso 4e - 20091017

*******************************************************/

--A

CREATE TABLE #Stage

   (

     theMonth SMALLINT NOT NULL ,

     PeopleJoined INT NOT NULL ,

     PeopleLeft INT NOT NULL ,

     Subscribers INT NOT NULL

   )

--B

INSERT   #Stage

         ( theMonth ,

           PeopleJoined ,

           PeopleLeft ,

           Subscribers       

         )

--C

SELECT   u.theMonth ,

         sum(case WHEN u.theCol = 'DateJoined'
                          THEN u.Registrations

                  ELSE 0

             END) AS PeopleJoined ,

         sum(case WHEN u.theCol = 'DateLeft'
                          THEN u.Registrations

                  ELSE 0

             END) AS PeopleLeft ,

         0 AS Subscribers

--D

   FROM     (              

--E

      SELECT datediff(MONTH, 0, DateJoined) AS DateJoined ,

             datediff(MONTH, 0, DateLeft) AS DateLeft ,

             count(*) AS Registrations

           FROM   dbo.Registrations

           GROUP BY datediff(MONTH, 0, DateJoined) ,

                    datediff(MONTH, 0, DateLeft)

--F

          

            ) AS d
      UNPIVOT ( theMonth
                FOR theCol IN ( d.DateJoined, d.DateLeft )
               ) AS u

--G

   GROUP BY u.theMonth

--H

   HAVING   sum(case WHEN u.theCol = 'DateJoined'
                          THEN u.Registrations

                  ELSE 0

             END) > 0

--I        

DECLARE @Subscribers INT = 0 ;

--J           

;

WITH  Yak ( theMonth, PeopleJoined, PeopleLeft,
               Subscribers )

  AS (  

--K

       SELECT TOP 2147483647

               dateadd(MONTH, theMonth, 0) AS theMonth ,

               PeopleJoined ,

               PeopleLeft ,

               Subscribers

       FROM    #Stage

       ORDER BY theMonth

--L

           )

   --M

UPDATE   Yak

SET      @Subscribers = Subscribers = @Subscribers + PeopleJoined - PeopleLeft

--N

OUTPUT   inserted.theMonth ,

         inserted.PeopleJoined ,

         inserted.PeopleLeft ,

         inserted.Subscribers

--O      

DROP TABLE #stage

--P

Listing 4 Peso 4e

As you can see, we still have the temporary table and the CTE, but what goes on inside each is quite different. Also, this solution has no need for the dreaded cursor, although it does use a few slightly "unconventional" techniques to arrive at the answer. Let’s break this bad boy apart and see if we can learn something.

The overall strategy behind this solution is to pre-aggregate the data down to the rows required in the report while minimizing the date calculations. By doing so, the query engine makes just one pass through the initial data and works with as little data as possible when calculating the running total.

There are three major sections to the code:

  • Aggregation Process– the goal here is to reduce the working dataset, stored in a temp table called #Stage, to the minimum possible number of rows, so that date calculations and the subsequent running total calculation are inexpensive.
  • Running Total calculation – here, we run an UPDATE statement on a CTE to calculate the running total
  • Producing the report –  making clever use of the OUTPUT clause

The Aggregation Process

The essential goal of this step is the same in Peso's solution (Listing 4) as it was in the solution seen in Listing 3. We need to build a temp table that contains a row for each month and a count of the new subscriptions and cancellations, upon which to base the running total calculation. However, the way that each solution populates the temp table is very different.

Take a look at the table definition for #Stage table (lines A to B), and you will notice something unusual right away. Whereas in the previous solutions we used a conventional DATETIME type for the theMonth column, Peso uses an integer – more about this later. Go ahead and run the table creation statement (A to B).

The SELECT statement used to populate the table is an aggregate query based on the UNPIVOT operator. Run the SELECT statement without the INSERT (C to I) to see what will actually populate #Stage. Figure 3 shows the partial results. Instead of a date, theMonth contains a number from 1248 to 1316 along with the PeopleJoined and PeopleLeft column totals. The Subscribers column contains zeros at this point. This is the data that will be used in the second part of the solution. Except for having integers in the month, it looks very similar to the spreadsheet we created at the beginning of the article to validate the results.

Figure 3: The data used to populate #Stage

Let's now break this SELECT down, step-by-step to see how it works.

The Pre-aggregation Step

In Listing 3, in order to arrive at a temp table that contained a row for each month and a count of the new subscriptions and cancellations, we had to make two full passes through the Registrations table, once aggregating on the DateJoined column, and once on the DateLeft, performing some expensive date calculations as we went.

Using a "pre-aggregation" involving the DATEDIFF function, followed by a clever trick using the UNPIVOT operator, to normalize the data, Peso managed to take just one pass through the table and so minimize the impact of date manipulations.

In order to understand the final result set shown in Figure 3, we need to first  look at the query that populate the derived table, on which the UNPIVOT operator works. Run lines E to F of Listing 3 will return the contents of this derived table, as shown in Figure 4.

SELECT   datediff(MONTH, 0, DateJoined) AS DateJoined ,

         datediff(MONTH, 0, DateLeft) AS DateLeft ,

         count(*) AS Registrations

FROM     dbo.Registrations

GROUP BY datediff(MONTH, 0, DateJoined) ,

         datediff(MONTH, 0, DateLeft)

Figure 4: The heart of the unpivoted data

It may be hard to recognize the fact at first glance, but this is, in many respects, the critical moment in the script. In it, we make a single pass through the 10K (or 1 million) row base table, and "summarize" everything we need from it, in order to solve the reporting problem, into only 586 (or 1580) rows. All subsequent operations are performed on this limited working data set and so are relatively inexpensive.

If you look at the original data, you will see that a date can fall anywhere within a month. In order to efficiently eliminate the “day,” this query uses the DATEDIFF function to convert all dates to the number of months that have passed since the date represented by 0 (1900-01-01). So, for example, any date within January 2004 becomes 1248 (104*12). Instead of grouping by month and year, or converting all dates to the first of the month, this solution just converted all the dates to an integer. Eventually, as we perform the running total calculation in the second part of the solution, these integer values are converted back to dates, using DATEADD.

The query counts the number of rows grouped by the DateJoined and DateLeft calculated values. Therefore, all subscribers who joined and left at the same time are grouped together, or “connected”, as part of the pre-aggregation.

You may be wondering why the solution leaves the dates as integers in the INSERT statement and converts them back to dates at the end of the process. It would be simple to just apply a DATEADD function to add the months back in the first statement:

DATEADD(M,DATEDIFF(M,0,DateJoined),0)

This would convert all the dates to the first of the month at one time, but Peso chose to convert them back to dates later, during the running totals calculation. By doing so, the DATEADD function to convert back to date was performed on just 69 values instead of up to 20,000 values (the 10,000 rows times two dates for each row minus the NULLs)

The UNPIVOT Trick

The next step is to "normalize", or unpivot, the pre-aggregated data. This is the step that means we can avoid any further reads of the base table, because it gives us all the data we need calculate the monthly number of subscribers.

Let’s look at the UNPIVOT part of this statement (F to G), by running the query in Listing 5, which temporarily removes the GROUP BY and HAVING clauses and replaces the SELECT list found in the original solution.

SELECT   *

FROM     ( SELECT datediff(MONTH, 0, DateJoined) AS DateJoined ,

                  datediff(MONTH, 0, DateLeft) AS DateLeft ,

                  count(*) AS Registrations

           FROM   dbo.Registrations

           GROUP BY datediff(MONTH, 0, DateJoined) ,

                  datediff(MONTH, 0, DateLeft)

         ) AS d UNPIVOT        

            ( theMonth FOR theCol IN ( d.DateJoined, d.DateLeft ) ) AS u

Listing 5: The un-aggregated UNPIVOT query

Figure 5 shows the data after the pre-aggregation together with the unpivoted results:

Figure 5: Before and after unpivoting

As you can see, the UNPIVOT operator, introduced with SQL Server 2005, takes column headings and turns them into data. This is the opposite of the PIVOT function that turns data into column headings.

NOTE:
Pre SQL Server 2005, we could have used a derived table with UNION ALL to mimic the UNPIVOT functionality. However, that would have led to two passes on the source table, and not just one.

In this case, the DateJoined and DateLeft column names now become values in the theCol column. The integer values stored in DateJoined and DateLeft become individual entries in a new column, theMonth.  Now, each row contains a number representing the month, a number of registrations, and a value showing whether the data represents a new subscription or a cancellation.

Now we have a list of registrations by month and action (see Figure 6). If you take a closer look at the data after the UNPIVOT is applied, you will see that we have multiple rows for each theMonth and theCol combination. For example, Figure 6 shows two rows for month 1306 and DateLeft. The duplicates exist because the initial aggregation grouped the data on the combination of DateJoined and DateLeft. Since we have split those values apart, there are now duplicates and further aggregation must be done.

Figure 6: The UNPIVOT results

Repivot and Aggregate

Moving out a bit more, let’s look at how we re-pivot the data and perform the final aggregation. The SELECT list (C to D) creates a sum of PeopleJoined and a sum of PeopleLeft grouped by theMonth (G to H). The code to focus on here is shown in Listing 6.

SELECT   u.theMonth ,

         sum(case WHEN u.theCol = 'DateJoined'
                     THEN u.Registrations

                  ELSE 0

             END) AS PeopleJoined ,

         sum(case WHEN u.theCol = 'DateLeft'
                     THEN u.Registrations

                  ELSE 0

             END) AS PeopleLeft ,

         0 AS Subscribers

--D

FROM     ( --------UNPIVOTED RESULTS-------- ) AS u

--E

--G

GROUP BY u.theMonth

   HAVING  sum(case WHEN u.theCol = 'DateJoined'
                         THEN u.Registrations

                         ELSE 0

                    END) > 0

Listing 6: Using CASE to pivot

The CASE function in the SELECT list determines which rows to count for PeopleJoined and which to count for PeopleLeft. If theCol equals DateJoined, then the row belongs in PeopleJoined. If theCol equals DateLeft, then the row belongs to PeopleLeft. For those of you familiar with the PIVOT function, you probably know that using CASE is another way to pivot data. In this case, the query “pivots” the theCol column to create a total for PeopleJoined and PeopleLeft for each month. Peso could have chosen to use PIVOT instead, but he found that the CASE method was slightly faster. Figure 7 shows how the data looks after pivoting with CASE.

Figure 7: After aggregating

Finally, the HAVING clause (H to I) filters out any rows with no new subscriptions. In the 10,000 rows, all months after September 2009 have cancellations, but no new subscriptions, and future months are not part of the report. Since HAVING filters rows out after aggregates are applied, the CASE function in the WHERE clause returns the total number of new subscriptions after grouping on theMonth. If a month contains only cancellations, as in the months after September 2009, the CASE function returns zero, and the HAVING clause filters those rows out.

We now have 69 rows which is also the number of rows (months) that will appear in the report. For each month (still represented by the integer result of DATEDIFF), we know the number of subscribers and the number of cancellations. The only thing missing is the number of subscriptions, which is still set at zero. Peso has managed to get to this point with only one pass through the data by his masterful pre-aggregation, followed by use of UNPIVOT and CASE.

Go ahead and run lines B to I to populate #Stage.

Calculating the Running Total

The next part of the solution calculates the required running total of subscribers via a "quirky" update command operating on a common table expression (CTE) (lines J to O).

Again, let's walk through it step-by-step.

Use a Variable to help you Count: no Cursors!

So far, the solution has manipulated our data to get it ready for calculating the running total. Instead of using a while loop or a cursor to calculate this total, a variable @Subscribers is defined and set to zero (I to J). Initializing a variable on the same line that the variable was defined is new to SQL Server 2008. If you are using SQL Server 2005, you will have to change this line:

DECLARE @Subscribers int = 0

Into these two lines

DECLARE @Subscribers INT
SET @Subscribers = 0

Even though we are not using any kind of loop, behind the scenes SQL Server updates one row at a time. Luckily, the pre-aggregation used before getting to this point has transformed the 10,000 original rows into 69, and the implicit cursor that SQL Server uses is faster than an explicit cursor.

The Ordered CTE

CTEs were introduced with SQL Server 2005. At a minimum, you can use CTEs to replace temp tables and views or to separate out a part of the logic of your query. CTEs also have some special uses, for example, for performing recursive queries.

Inside the CTE, we convert back to "normal" dates, and we do this by loading the data from our staging table into a CTE, using the DATEADD function as we go. Since DATEADD is used just once for this small number of rows, it is not that expensive.

If you run just the query inside the CTE definition (K to L), you will see that now theMonth has been converted back into a date and that the data is sorted. Only the running total of subscribers is missing. Figure 8 shows the partial results.

SELECT TOP (2147483647)

         dateadd(MONTH, theMonth, 0) AS theMonth ,

         PeopleJoined ,

         PeopleLeft ,

         Subscribers

FROM     #Stage

ORDER BY theMonth

Figure 8: The CTE results

Normally, you cannot order data inside a CTE. One way around this rule is to use the TOP keyword, which allows you to retrieve a number or percentage of rows. In order to control which rows a query returns when using TOP, you can use ORDER BY to sort the data. In this case, Peso chose to use the largest possible integer, 2147483647, to make sure that all the rows will be returned.

The Quirky Update

You can use a CTE in one statement to select, update, delete, or insert data. This is the first time that I have seen the UPDATE statement update the data in the CTE itself (M to N) and not just another table joined to the CTE, though there is nothing stopping you from doing so as long as no rules for updating data are violated.  The statement actually updates the #Stage table, which is the basis for the CTE. The UPDATE statement includes an OUTPUT clause (N to O) which we will discuss shortly.

The SET clause of the UPDATE statement has an interesting formula:

SET @Subscribers = Subscribers = @Subscribers + PeopleJoined – PeopleLeft

For each row of our data, the Subscribers column is set to the current value of the variable @Subscribers and then increased or decreased based on what happened that month. The value stored in Subscribers, once it is calculated, is saved in the variable. The variable is then available for the next row of data.

This style of update is unofficially referred to as a “quirky update”. It refers to the ability to update variables and columns in a table simultaneously and in an ordered fashion. Here, the ordering is guaranteed by the ORDER BY in our CTE. If this technique were applied to a permanent table the ordering would be established by the table's clustered index; the update would be done according to the clustered index, but would fail of the table was partitioned, or if parallelism occurred. In any event, it is very important that the calculation is done in the correct order, and that order is guaranteed in this example.

Since the #Stage table is a heap we won’t have problems with the ordering being affected by a clustered index. We won’t get parallelism either because the data is now only 69 records, and holds much less data than a page. Finally, partitioning is not possible on temporary tables.

Producing the Report: A clever use for OUTPUT

The UPDATE statement populates the Subscribers column of the #Stage table. After the update, if we just select the current contents of #Stage, we have the report.  Instead of doing that, Peso chose to use an OUTPUT clause, another relatively new T-SQL feature, introduced with SQL Server 2005. The OUTPUT clause can be used along with any data manipulation statement. It provides deleted and inserted tables, just like triggers do. The deleted table returns the data before the transaction. It can be data that is deleted or the data before it is updated. The inserted table returns the data after the transaction. It can be data that is inserted or data after the update. In this case, the inserted table returns the data how it looks after the update. By using this technique, we avoided accessing the table again to display the data.

Run the last part of the script (I to P) to see that the report works. Figure 9 shows the partial results.

Figure 9: The report

One Small Hole

It was stated as part of the competition rules that it could be assumed that all months had at least one new subscriber.

However, it's worth noting that if there was a particular month that had no new subscriptions, then that month would not be included in the results. In fact, if a month in the middle of the data had no new subscriptions but had some cancellations, the report will not count the cancellations, and the report will be incorrect. To demonstrate this, change the DateJoined values to 2004-07-01 in the rows where the DateJoined values are within June, 2004 in the Registrations table. Then run the script again. You will see that the report no longer includes the cancellation and all the subsequent values are incorrect.

Fortunately, this problem is easily overcome, via something like an outer join to a sequence or numbers table (see, for example, http://weblogs.sqlteam.com/peterl/archive/2009/11/03/Superfast-sequence-generators.aspx)

Not Quite as Fast, but a Bit More Conventional

Not all queries and scripts require that you take the time to tune them to be fastest they can possibly be. The performance of a query that runs once a month does not matter as much as a query that runs 1000 times a minute. Of course, as you learn more, you will automatically incorporate better performing techniques into your code.

In the real world, there is a balance to be struck between maintainability and outright speed. However, once you've started to understand the considerations that are involved in getting the code to run as fast as possible, you can then decide to "back out" to what might be a slightly slower but more maintainable solution. Alternatively, you might decide that the time required to get the code as fast as possible is more time than you have to spend.

So, for example, if we had established Peter's solution (Listing 4) as the fastest, but felt that we were willing to sacrifice some of the speed in return for ease of understanding and maintainability, then we might opt for a compromise solution, such as that shown in Listing 7. It is based on the "faster but still iterative" solution from Listing 3, but borrows Peso’s DATEDIFF technique to reduce the number of required date calculations.

--variables

DECLARE @total INT

DECLARE @people_joined INT

DECLARE @people_left INT

DECLARE @the_month INT

--create a table to hold the results

CREATE TABLE #subscriptions

   (

     theMonth INT ,

     PeopleJoined INT ,

     PeopleLeft INT ,

     Subscriptions INT

   )

--insert a row for each month

--in the data

INSERT   INTO #subscriptions

         ( theMonth ,

           PeopleJoined ,

           PeopleLeft ,

           Subscriptions

         )

         SELECT   datediff(M, 0, DateJoined) ,

                  count(*) ,

                  0 ,

                  0

         FROM     Registrations

         GROUP BY datediff(M, 0, DateJoined)

 

--update for cancellations

;

WITH  CANCELLATIONS

        AS ( SELECT  count(*) CANC_COUNT ,

                     datediff(M, 0, DateLeft) AS
                                             
 DateLeft

             FROM    Registrations

             WHERE   DATELEFT IS NOT NULL

             GROUP BY datediff(M, 0, DateLeft)

           )

   UPDATE   S

   SET      PeopleLeft = CANC_COUNT

   FROM     #subscriptions S

            INNER JOIN CANCELLATIONS C ON S.theMonth = Dateleft

SET @total = 0

--set up a cursor to update the total subscriptions

--for each month

DECLARE SUBSCRIPTIONS CURSOR FOR

SELECT THEMONTH, PEOPLEJOINED, PEOPLELEFT

FROM #SUBSCRIPTIONS ORDER BY THEMONTH

OPEN SUBSCRIPTIONS

FETCH NEXT FROM SUBSCRIPTIONS INTO @the_month,

      @people_joined, @people_left

WHILE @@FETCH_STATUS = 0

   BEGIN

      SET @total = @total + @people_joined - @people_left

      UPDATE   #subscriptions

      SET      Subscriptions = @total

      WHERE    theMonth = @the_month

      FETCH NEXT FROM SUBSCRIPTIONS

      INTO @the_month, @people_joined, @people_left

   END

CLOSE SUBSCRIPTIONS

DEALLOCATE SUBSCRIPTIONS

--the report

SELECT   dateadd(M, theMonth, 0) ,

         PeopleJoined ,

         PeopleLeft ,

         Subscriptions

FROM     #subscriptions

ORDER BY THEMONTH

DROP TABLE #subscriptions

Listing 7: The compromise solution

When this "compromise" solution was tested, it ran in 74 ms against 10,000 rows and in 847 ms against 1 million rows. Still quite a bit slower than Peso's solution, but not bad! Even though it uses a cursor for the running the total calculations, the population of the temp table has been improved, and the performance is acceptable.

The important point to remember is that, really, it is the aggregation process that is the most critical. Once you are working with a small data set, the exact manner in which the running total calculation is performed becomes less of an issue, although it will obviously still have some impact.

Table 2 shows how the four solutions stack up:

Solution

10,000 Rows

1 Million Rows

Row-by-row (Listing 2)

7 seconds

13 minutes

Faster but still iterative (Listing 3)

360 milliseconds

3 seconds

Peso  (Listing 4)

0 milliseconds (too fast to measure)

300 milliseconds

Not quite as fast, but conventional (Listing 7)

74 milliseconds

840 milliseconds

Summing Up

The real proof of performance to Peso’s brilliant solution is that, even when Phil Factor tested it against one million rows, the report ran in milliseconds. Here are some of the important points:

  • Avoid row-by-row processing, unless on just a handful of rows
  • Pass through the data as few times as possible, preferably just once
  • Use DATEDIFF to convert the date to an integer if the day of the month is not needed
  • Minimize calculations whenever possible by pre-aggregating to a small number of rows first
  • Use UNPIVOT and CASE to realign the columns
  • Use a variable, instead of a cursor, to calculate running totals
  • The OUTPUT clause can be used to display data during an update instead of running an additional SELECT statement

At the very least, I hope you have learned that there can be many ways to come up with the correct answer, but not all solutions are created equal.

The fastest solution presented in this article has reached perfection in terms of performance and scalability. Is it always worth attempting to achieve perfection? Even the SQL Server query optimizer comes up with a plan that is “good enough” and not necessarily perfect. But some techniques perform so poorly on large datasets that learning how to avoid them will pay tremendous dividends.



This article has been viewed 7569 times.
Kathi Kellenberger

Author profile: Kathi Kellenberger

Kathi Kellenberger, SQL Server MVP, is a database administrator for Bryan Cave LLP, a law firm headquartered in St. Louis, MO. She is author of "Beginning T-SQL 2008" (Apress 2009) and co-author of "Professional SQL Server 2005 Integration Services" (Wrox 2006). Kathi enjoys speaking and writing about SQL Server and teaches the occasional SQL Server class. When she is not busy working with SQL Server, you might find her singing at the local karaoke bar or climbing the stairs in tall buildings.

Search for other articles by Kathi Kellenberger

Rate this article:   Avg rating: from a total of 32 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Great Analysis
Posted by: Bill (view profile)
Posted on: Thursday, February 04, 2010 at 9:18 AM
Message: This article is an excellent breakdown of a complex solution. I'm sure that gurus such as Jeff Moden, Phil Factor and Joe Celko can look at Peso's SQL and immediately understand it. For the cranium-challenged such as myself, this article is a must read to understand the power of set-based solutions.

Subject: Brilliant written article...
Posted by: DeafProgrammer (view profile)
Posted on: Thursday, February 04, 2010 at 9:46 PM
Message: I have been writing SQL for one and half decades. I have been struggling to explain to the ICT contractors and developers that the procedural language such as T-SQL is the only solution that works with sets of data, not in the application side that involves a lot of OOP stuff and the likes. To achieve the best "set-based" solution is to write the most efficient SQL code in which you called it as “fast SQL code”. I have to agree of what you wrote in a well excellent written article, however, I am still studying of how to write “fast SQL code” that makes the code more maintainability and readability.

Keep it going mate!

Subject: Brilliant analysis and a real pleasure to read
Posted by: Jeff Moden (view profile)
Posted on: Friday, February 05, 2010 at 1:57 PM
Message: The subject line above says it all. Incredible article about some incredible code. "Well done" to you and Peso both!

Subject: Fantastic
Posted by: laerte (view profile)
Posted on: Saturday, February 06, 2010 at 2:33 PM
Message: The solution, as our friends said, was brilliant. And the explanation even better. Congratulations to both. :-)

Subject: Article
Posted by: Michael Justus (not signed in)
Posted on: Sunday, February 07, 2010 at 4:22 AM
Message: Seeing the solution given by Peso, it's true what you mentioned that to most developers, SQL is not their native language; but your summary at the end of the article certainly sums up the mentality we should employ when writing any sort of code: minimize data access as much as possible.

And the results do speak for themselves.

Your article was an informative read!

Subject: Wonderful explanation of excellent techniques
Posted by: Tim (view profile)
Posted on: Sunday, February 07, 2010 at 12:47 PM
Message: I remember reading about this speed challenge when it was going on but never took the time to dissect the winning solution. So reading your step by step explanation was very much appreciated! I learned a lot of new techniques that I can use in other aspects of my job. In particular, after being stuck on SQL Server 2000 for so long I didn't even know about the PIVOT and UNPIVOT operators. Now that I understand them I have a new tool in my arsenal.

Subject: Selecting just the month and year
Posted by: Anonymous (not signed in)
Posted on: Sunday, February 07, 2010 at 6:38 PM
Message: I'm a newbie at this, but what about using convert(varchar(6), startdate, 112) to get the joinMonth?? This would negate the need to re-convert the integer back to a meaningful month

Subject: thankyou
Posted by: Hayden (not signed in)
Posted on: Sunday, February 07, 2010 at 11:08 PM
Message: What a well structured article, thanks very much for this. I'll ensure I frequently refer back to this for set-based wisdom.

Subject: Selecting just the month and year
Posted by: gianluca.sartori (view profile)
Posted on: Monday, February 08, 2010 at 2:41 AM
Message: [...] what about using convert(varchar(6), startdate, 112) to get the joinMonth?? [...]

The reason is that integers get aggregated faster than varchars.

Subject: Quirky Update is Flawed
Posted by: TheSQLGuru (view profile)
Posted on: Monday, February 08, 2010 at 4:56 AM
Message: It has been shown on numerous occassions that the "quirky update" is NOT GUARANTEED TO PRODUCE THE CORRECT RESULTS!!! I see no disclaimer to that effect, which is a horrible oversight!! If I did miss such, it should be in big red 72 point capital underlined letters. People using this can get the WRONG answers and never know it.

Subject: Prove me wrong
Posted by: Peso (view profile)
Posted on: Monday, February 08, 2010 at 6:09 AM
Message: TheSQLGuru, this is not the traditional "quirky update" (once coined by Barry Young).
It is a normal CTE which is documented in Books Online here http://msdn.microsoft.com/en-us/library/ms175972.aspx.
The UPDATE syntax using variables and columns is documented here http://msdn.microsoft.com/en-us/library/ms177523.aspx.

You are more than welcome to prove me wrong. If that should happen, please use Kathi's version with cursor at the end.


Subject: 'Quirky update'
Posted by: Phil Factor (view profile)
Posted on: Monday, February 08, 2010 at 6:13 AM
Message: I followed the debate on SSC with great interest, and read several blogs on the subject. Robyn Page invented actually the 'Quirky' name to describe its' fragility, as well as its oddness. (Quirky Update is the use of an accumulating variable to update a column during an update, relying on the order of update). Various MVPs have stated that it shouldn't be used because is it is 1/undocumented and 2/unsupported. It is by no means a new technique. As far as I know, it has been used in all versions of SQL Server and Sybase. I first saw it used on Sybase! I have to admit that I'd prefer to have explicit details of where it breaks. I've never had an occasion myself where a system that uses it suddenly breaks. You can certainly break it by adding an index, but one normally uses this technique on a temporary table over which one has a high level of control.

If you don't like 'Quirky' there are plenty of techniques in the Speed Phreak challenge that don't use it.

Subject: Quirky Update
Posted by: Peso (view profile)
Posted on: Monday, February 08, 2010 at 6:25 AM
Message: Itzik Ben-Gan recently posted an excellent article about the matter here
http://www.sqlmag.com/Article/ArticleID/103533/sql_server_103533.html

Subject: Breaking Issues for Quirky Update
Posted by: TheSQLGuru (view profile)
Posted on: Monday, February 08, 2010 at 7:08 AM
Message: 1) http://www.sqlservercentral.com/Forums/Topic802558-203-3.aspx last post
2) http://www.sqlservercentral.com/Forums/Topic802558-203-11.aspx last post
(I believe there were other issues brought up in that thread I don't list here)
3) parallelized operations
4) enterprise "merry-go-round" in-progress-scan join-ups. We were unable to actually prove this one but it seems logical that it would cause problems.
5) It is not guaranteed to work by Microsoft. Further discussion on this is prevented due to NDA.
6) Even if, as Jeff Moden has done, you come up with rules under which you or others could NOT break it, that list is subject to modification when something pops up that requires a new rule. yet all those people that have been using it will almost certainly not acquire the new rule and retrofit it to all their existing code.

Subject: download is not case sensitive compatible
Posted by: Anonymous (not signed in)
Posted on: Monday, February 08, 2010 at 7:24 AM
Message: Why are scripts never written for case sensitive databases? Commence find and replace on the download.....

Subject: case sensitive scripts
Posted by: TheSQLGuru (view profile)
Posted on: Monday, February 08, 2010 at 8:39 AM
Message: this is probably not done because a) I would venture that a very small fraction of installed databases are case sensitive and b) even if you have a server that is, can't you build a database with a case insensitive collation for testing?

Subject: Quirky
Posted by: Kathi (not signed in)
Posted on: Monday, February 08, 2010 at 9:05 AM
Message: Tony and I debated on whether we should use the term "quirky" or not but went with it in the end. Once you get the data aggregated down to just a handfull of rows, it doesn't make as much difference how you do the last step. The real beauty of the winning entry is the "one pass" pre-aggregation.

Subject: Brilliant
Posted by: MrDee (view profile)
Posted on: Monday, February 08, 2010 at 2:14 PM
Message: Excellent solution explained extremely well.
I will be recommending this article to my fellow workers.

Subject: Breaking issues
Posted by: Peso (view profile)
Posted on: Monday, February 08, 2010 at 2:15 PM
Message: I am happy you mentioned the discussion here http://www.sqlservercentral.com/Forums/Topic802558-203-3.aspx, TheSQLGuru. As written by Hugo, the ordered update can only work if a) or b) is satisfied. If you look at the execution plan for the second query of my solution, you can see that a) is satisfied.
Also, according to http://www.sqlservercentral.com/Forums/Topic802558-203-11.aspx, my solution has no WHERE clause.

Subject: Breaking Issues
Posted by: TheSQLGuru (view profile)
Posted on: Monday, February 08, 2010 at 2:45 PM
Message: Still a bad idea:

1) You cannot guarantee that you (nor some other machine with different settings, configuration, data, etc) always get the type of plan you currently get

2) There are other things than what Hugo posted that can break it

3) Microsoft states the order isn't guaranteed

4) If you address and/or caveat all the known points that currently can break it that still leaves someone finding yet another scenario where it breaks and people not getting the new rule.

Besides, people will take this type of logic and use it where it is GUARANTEED to give them the WRONG ANSWERS but they won't know it.

Signing off from this one since we appear to have to agree to disagree. Sorry, but I must state without reservation that it is inappropriate and I believe irresponsible to promulgate this type of processing at this time.

Subject: Took only 16 Ms
Posted by: Anonymous (not signed in)
Posted on: Monday, February 08, 2010 at 3:36 PM
Message: select * from Registrations

declare @starTime datetime
declare @minDate datetime
declare @runTotal int
select @runTotal = 0
select @starTime = GETDATE()

select @minDate = MIN(DateJoined)from Registrations

SELECT
DATEDIFF(MM,@minDate,Datejoined) as s1,
COUNT(*) as counts,
@runTotal as runningTotal
INTO #1
FROM
Registrations
GROUP BY DATEDIFF(MM,@minDate,Datejoined)
order by DATEDIFF(MM,@minDate,Datejoined) asc

select
DATEDIFF(MM,@minDate,DateLeft) as s1,
COUNT(*) as counts,
@runTotal as runningTotal
INTO #2
FROM Registrations
where DateLeft is not null
GROUP BY DATEDIFF(MM,@minDate,DateLeft)
order by DATEDIFF(MM,@minDate,DateLeft) asc



update #1
set @runTotal= runningTotal = @runTotal + c.counts - ISNULL(#2.counts,0)
from #1 as c
left outer join #2 on c.s1 = #2.s1



SELECT
DATEADD(MM,#1.s1,@minDate) as Monthly ,
#1.counts as NewJoins ,
#2.counts as LeftUs,
#1.runningTotal
FROM
#1
LEFT JOIN #2 on #1.s1 = #2.s1

order by runningTotal Asc

select DATEDIFF(mcs,@starTime,getdate())

Subject: what if using a CTE for #stage?
Posted by: Anonymous (not signed in)
Posted on: Monday, February 08, 2010 at 5:06 PM
Message: Thanks for the article, definitely worth reading.

I would appreciate if somebody explain what happens if we use a CTE instead of the #Stage table in the solution "Peso 4e"? Will it be possible at all and how performance will be affected?

Just curicity from my side, and general perseption that keeping everything in memory (avoiding writing to the tempdb) is beneficial.

Appreciate your thoughts.

Subject: Conditions for Quirky Update to fail
Posted by: puzsol (view profile)
Posted on: Monday, February 08, 2010 at 5:53 PM
Message: It's not just that the CTE has too few rows to use a parallel operation... in my experience (at least on SQL 2005) table variables (and by extension CTEs?) will never use parallel operations. Spent a bit of time converting all table variables to temporary tables for just that reason!

If my assertion is still correct (and I suspect it is, and will remain so), then all three criteria are met by using a CTE/table variable:
- it is ordered
- it will not parellelise itself (no matter how big)
- it will not partition.

So it won't break in this use.... just because someone could misuse a technique doesn't mean you shouldn't use it. Hey I've seen insert/update triggers written as if they were for only one row... so should we stop using triggers altogether or is that a different discussion?

Subject: what if using a CTE for #stage?
Posted by: Anonymous (not signed in)
Posted on: Monday, February 08, 2010 at 5:57 PM
Message: Thanks for the article, definitely worth reading.

I would appreciate if somebody explain what happens if we use a CTE instead of the #Stage table in the solution "Peso 4e"? Will it be possible at all and how performance will be affected?

Just curicity from my side, and general perseption that keeping everything in memory (avoiding writing to the tempdb) is beneficial.

Appreciate your thoughts.

Subject: Conditions for Quirky Update to fail
Posted by: puzsol (view profile)
Posted on: Monday, February 08, 2010 at 6:13 PM
Message: It's not just that the CTE has too few rows to use a parallel operation... in my experience (at least on SQL 2005) table variables (and by extension CTEs?) will never use parallel operations. Spent a bit of time converting all table variables to temporary tables for just that reason!

If my assertion is still correct (and I suspect it is, and will remain so), then all three criteria are met by using a CTE/table variable:
- it is ordered
- it will not parellelise itself (no matter how big)
- it will not partition.

So it won't break in this use.... just because someone could misuse a technique doesn't mean you shouldn't use it. Hey I've seen insert/update triggers written as if they were for only one row... so should we stop using triggers altogether or is that a different discussion?

Subject: It didn't break... it was forced to not work.
Posted by: --Jeff Moden (not signed in)
Posted on: Monday, February 08, 2010 at 7:21 PM
Message: >>>There are other things than what Hugo posted that can break it

Hugo posted things that wouldn't allow it to work to begin with. The rest of it, I posted to show that my poor explanation of why it worked and when it won't needed to be changed.

So far as "could break" when a future CU or SP comes out, so what? That could happen with anything and if you don't regression test before such installations, you're begging for trouble. An example of a change that was not on the deprecation list that broke a lot of code was when they changed the very heavily documented sp_MakeWebTask to require "SA" privs in SQL Server 2000 SP4. Although Microsoft is pretty good about following "the plan" in the form of deprecation lists, Microsoft can, has, and will continue to change anything they need to without warning... documented or not.

Subject: what if using a CTE for #stage?
Posted by: Anonymous (not signed in)
Posted on: Monday, February 08, 2010 at 8:52 PM
Message: Thanks for the article, definitely worth reading.

I would appreciate if somebody explain what happens if we use a CTE instead of the #Stage table in the solution "Peso 4e"? Will it be possible at all and how performance will be affected?

Just curiosity from my side, and general perception that keeping everything in memory (avoiding writing to the tempdb) is beneficial.

Appreciate your thoughts.

Subject: CTE instead of #Stage
Posted by: Anonymous (not signed in)
Posted on: Tuesday, February 09, 2010 at 2:21 PM
Message: "I would appreciate if somebody explain what happens if we use a CTE instead of the #Stage table in the solution "Peso 4e"? Will it be possible at all and how performance will be affected?"

You can try it, but I don't think you can nest a CTE inside another CTE which is what you would have to do if I understand your proposal.

Subject: Temp Tables can live in memory, too...
Posted by: --Jeff Moden (not signed in)
Posted on: Tuesday, February 09, 2010 at 6:58 PM
Message: -----------------------------------------------
>>> "Just curicity from my side, and general perseption that keeping everything in memory (avoiding writing to the tempdb) is beneficial."
-----------------------------------------------
Using a Temp table doesn't mean it's not in memory. The first place a Temp table will try to live is the same place a Table variable or the result set of a CTE may try to live... memory. When a Table Variable or Temp Table get to big for memory, they both do the same thing... start using disk space in TempDB.

Subject: Ordered CTE
Posted by: Plamen (view profile)
Posted on: Tuesday, February 09, 2010 at 9:44 PM
Message: The concept of ordered CTEs seems very interesting... Can you explain the relational theory background behind that? Is there a guarantee that we can create ordered views (since we can base views on ordered CTEs)?

Subject: puzsol
Posted by: Peso (view profile)
Posted on: Wednesday, February 10, 2010 at 7:50 AM
Message: Table variable operations can be parallellized, but ONLY for using them directly, or in a JOIN, for a SELECT query.
For all other queries (such as INSERT, DELETE and UPDATE), parallellism is prohibited for table variables.

Subject: Plamen
Posted by: Peso (view profile)
Posted on: Wednesday, February 10, 2010 at 7:54 AM
Message: No, I can't.
However, the Ordered CTE Update will fail if there is a clustered index on the underlying table, which is the exact opposite for the traditional "Quirky Update" which must have a clustered index to work.
Looking at the execution plans for the Ordered CTE (which is based on a heap in my suggestion) reveals the table scan is ordered and without any of the pitfalls the traditional "Quirky update" has; parallellism and partitioning. And to add, the WHERE and join clauses displayed by Hugo Kornelis.

Subject: I wrote a longer rant here
Posted by: Peso (view profile)
Posted on: Wednesday, February 10, 2010 at 8:57 AM
Message: http://weblogs.sqlteam.com/peterl/archive/2010/02/10/Unsupported-andor-undocumented-features.aspx

Subject: Ordered table expressions
Posted by: Plamen (view profile)
Posted on: Wednesday, February 10, 2010 at 9:27 AM
Message: Peter,
You misunderstood my question. Leave aside the undocumented "quirky update"... The point is about trying to tell people that there is "ordered table expression". I don't want to look at the execution plan, execution plans change by day and night, with or without the next update. But some fundamentals do not change and people should be told what is right and what is wrong.

Subject: Ordered CTEs
Posted by: Plamen (view profile)
Posted on: Wednesday, February 10, 2010 at 1:12 PM
Message: Here is a statement by Conor Cunningham (Principal Software Architect, SQL Server Engine, Microsoft) on the topic of "ordered CTEs":

"There are no guarantees on rows coming out of a CTE in any order from the Query Optimizer in any case (not just updates)."

Subject: Ordered CTEs
Posted by: sqlPac (view profile)
Posted on: Wednesday, February 10, 2010 at 3:32 PM
Message: The underlying issue with the solution seems to be its dependence on ordered processing. If you remove the ordered processing from the equation your solution is still fast, but it will throw a second table scan in there (which appears to be what you're trying to eliminate).

Of course the table scans are performed on your #Stage table which you used UNPIVOT to reduce in size significantly (10,000 rows to 69 rows -- the really clever bit). Even with a million rows you're probably not going to have more than a few hundred rows after you unpivot. For example, 25 solid years of monthly aggregated data is only 300 rows worth.

Not a huge performance hit, but you can also declare a clustered index/pk on theMonth column to get an index scan + seek. The main thing is it eliminates the reliance on undocumented/poorly documented features that just happened to work this time.

SELECT dateAdd(Month, theMonth, 0) AS theMonth,
PeopleJoined,
PeopleLeft,
(
SELECT SUM(PeopleJoined) - SUM(PeopleLeft)
FROM #Stage s2
WHERE s2.theMonth <= s1.theMonth
) AS Subscribers
FROM #Stage s1;

Subject: sqlPac
Posted by: Peso (view profile)
Posted on: Thursday, February 11, 2010 at 5:22 AM
Message: You are right about the clever part :-)
I posted two other solutions at the original competition. Using the pre-aggregation technique and then
1) CURSOR for running total (550 ms)
2) "Triangular join" for running total (450 ms)

for 1000000 sample records.



Subject: Ahem.
Posted by: Phil Factor (view profile)
Posted on: Thursday, February 11, 2010 at 11:14 AM
Message: As far as I remember, almost all the well-performing solutions in the three challenges so far used perfectly conventional and well-supported TSQL. Quirky update is a bit of a distraction as it doesn't give a clear performance improvement over other solutions. It doesn't get used that much in the challenges.

Incidentally, there is a new challenge out now http://cli.gs/GphQGM so now is your chance to come up with a solution.

Subject: Using a CTE instead of #stage
Posted by: Kathi (not signed in)
Posted on: Thursday, February 11, 2010 at 8:26 PM
Message: I think that the only way to replace #stage with a CTE is to combine it with the triangular join method of calculating the running totals. At least on my laptop, it was taking a really long time to run, so I just stopped it.

You can't combine a CTE/stage replacement with any of the other methods because you have to do an update to something and there wouldn't be anything able to update.

Subject: The quirky update firestorm
Posted by: Kathi (not signed in)
Posted on: Thursday, February 11, 2010 at 8:35 PM
Message: My intenion in writing this article was to explain just what Peso (Peter Larsson) did to create this report so quickly. I have to admit that until I took the time to really dig in, I had no idea what he was doing. The point was to see what the rest of us can learn from SQL gods like Peter.

So, I think that the big controversy is can we really rely on the TOP / ORDER BY in a CTE. From what I understand is that we really can't.

The other possible solution is the triangular join which was not mentioned in the article. This is not a good solution on a large number of rows, but, just like the cursor, isn't too bad after all the pre-aggregation was done. Again, Peter's brilliant use of UNPIVOT, CASE, and DATEDIFF to aggregate the rows to the number needed in the final report.

Subject: dataentryjob
Posted by: Anonymous (not signed in)
Posted on: Saturday, February 13, 2010 at 10:03 AM
Message: I recently came across your article and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice article. I will keep visiting Simple-Talk very often.

Lucy

Subject: Optimized Peso solution
Posted by: LesCardwell (view profile)
Posted on: Monday, February 22, 2010 at 4:27 PM
Message: FWIW, by using the Quirky Update, and adding a bit of abstraction, I was able to simplify the code and reduce Peso's time. For 1-mill records it processes in 60ms...

/****** Object: Table [dbo].[Registrations] Script Date: 02/22/2010 14:24:54 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Registrations](
[Registration_ID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](80) NOT NULL,
[LastName] [varchar](80) NOT NULL,
[DateJoined] [datetime] NOT NULL,
[DateLeft] [datetime] NULL,
[YrMoJoined] AS (CONVERT([int],CONVERT([varchar](4),datepart(year,[DateJoined]),(0))+case when datepart(month,[DateJoined])<(10) then '0'+CONVERT([char](2),datepart(month,[DateJoined]),(0)) else CONVERT([char](2),datepart(month,[DateJoined]),(0)) end,(0))),
[YrMoLeft] AS (CONVERT([int],CONVERT([varchar](4),datepart(year,[DateLeft]),(0))+case when datepart(month,[DateLeft])<(10) then '0'+CONVERT([char](2),datepart(month,[DateLeft]),(0)) else CONVERT([char](2),datepart(month,[DateLeft]),(0)) end,(0))),
CONSTRAINT [PK_Registrations] PRIMARY KEY CLUSTERED
(
[DateJoined] ASC,
[LastName] ASC,
[FirstName] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO
SET ANSI_PADDING OFF

------------------------

DECLARE @begTime DATETIME

SELECT @begTime = GETDATE()

CREATE TABLE #subscriptions
(YrMo int,
Subscribed int,
UnSubscribed int,
Subscribers int
)

INSERT INTO #subscriptions
SELECT TOP (100) PERCENT YrMoJoined AS YrMo,
COUNT(*),
(SELECT COUNT(*) FROM dbo.Registrations AS R2 WHERE YrMoLeft = R1.YrMoJoined),
0
FROM dbo.Registrations AS R1
GROUP BY R1.YrMoJoined
ORDER BY R1.YrMoJoined


DECLARE @subscribers INT
SET @subscribers = 0
;
UPDATE #subscriptions
SET @subscribers = Subscribers = @Subscribers + Subscribed - UnSubscribed
FROM #subscriptions S2

SELECT YrMo,
Subscribed,
UnSubscribed,
Subscribers

FROM #subscriptions
ORDER BY YrMo

DROP TABLE #subscriptions

SELECT @begTime, GETDATE(), DATEDIFF(ms,@begTime,GETDATE())
------------------------------------'

Enjoy!
Les Cardwell

Subject: Re: Prove me wrong
Posted by: RBarryYoung (view profile)
Posted on: Monday, February 22, 2010 at 7:48 PM
Message: Just to clarify, Peso, it was Phil Factor who coined that particular phase, not me. :-)

Subject: Optimized Peso solution #2
Posted by: LesCardwell (view profile)
Posted on: Wednesday, February 24, 2010 at 1:37 PM
Message: FWIW - In thinking about this a bit more, a more standard (simpler) solution is almost as fast (within 5-10ms) and still outperforms Peso's solution by ~3x (70ms vs 180ms on my w/s).

BTW, in my 'create table' DDL above, I noticed that it did not put indexes on the two computed columns, YrMoJoined and YrMoLeft. If anyone wants to run this up, be sure to add those two indexes.

Les

---------------------------------

DECLARE @begTime DATETIME
SELECT @begTime = GETDATE()

CREATE TABLE #subscriptions
(YrMo int,
Subscribed int,
UnSubscribed int,
Subscribers int
)

INSERT INTO #subscriptions
SELECT TOP (100) PERCENT YrMoJoined AS YrMo,
COUNT(*),
(SELECT COUNT(*) FROM dbo.Registrations AS R2 WHERE YrMoLeft = R1.YrMoJoined),
0
FROM dbo.Registrations AS R1
GROUP BY R1.YrMoJoined
ORDER BY R1.YrMoJoined


SELECT YrMo,
Subscribed,
UnSubscribed,
( SELECT SUM(S2.Subscribed)- SUM(S2.UnSubscribed)
FROM #subscriptions S2
WHERE S2.YrMo <= #subscriptions.YrMo
) AS Subscribers
FROM #subscriptions
ORDER BY YrMo

DROP TABLE #subscriptions

SELECT @begTime, GETDATE(), DATEDIFF(ms,@begTime,GETDATE())

 










Phil Factor
Exploring your database schema with SQL
 In the second part of Phil's series of articles on finding stuff (such as objects, scripts, entities, metadata) in... Read more...



 View the blog
Mission Critical: Database Design
 There is nothing like a checklist to make sure you've completed all the tasks in designing a database,... Read more...

Transparent Data Encryption
  Transparent Data Encryption is designed to protect data by encrypting the physical files of the... Read more...

SQL Toolbelt 2008: Predominantly an Engineering Task
 The conversion of the Red-Gate tools to be compatible with SQL Server 2008 might not seem, on first... Read more...

SQL Response: The dim sum interview
 Richard Morris met David and Nigel of the SQL Response team, in a dim sum Restaurant in Cambridge. They... Read more...

Data Correlation Optimization Internals
 Having adroitly introduced us, in his previous article, to the Date Correlation ability of the Query... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

SQL Server Full Text Search Language Features
 SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 150,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk