Click here to monitor SSC
  • Av rating:
  • Total votes: 55
  • Total comments: 35
Kathi Kellenberger

Solving Complex T-SQL Problems, Step-By-Step

18 March 2013

What should you do if your first, most intuitive solution to a problem ends up scanning the data more than is necessary, resulting in poor performance? Have you missed a new SQL Server feature that can remove inefficiency from your technique? Alternatively, do you need a little help, and some lateral thinking, to open the path to a different approach? Sometimes, the answer is "both".

Repeatedly on the SQL Server forums, I see people struggling to solve common, complex business problems in T-SQL. More to the point, I see them struggling to solve them in a way that performs well and scales gracefully to large numbers of rows. My systematic approach to writing "difficult" T-SQL is always the same:

  1. Examine the data set carefully, and note down any particular characteristics, especially appearance of NULL values, use of dates, and so on.
  2. Make sure I fully understand the expected results and validate them, manually.
  3. Break the problem down into a series of simple steps, write the T-SQL, view the results, adjust if necessary, and move on to the next step.

The key at each step is to understand and accept that the first, most intuitive solution is unlikely to be the fastest or most scalable. As I explained in my previous series, Writing Efficient SQL, we need to look for opportunities to use set-based techniques that will help avoid non-scalable row-by-row processing, work with the smallest data set for any expensive calculations, scan the data as few times as possible, and so on. If my solution still fails to meet my expectations for performance, I try to "pair" with a co-worker, or ask advice from a T-SQL guru in the community, to help me find a better solution.

I recently helped solve a business problem that illustrates this process, and these ideas, perfectly. Based on data from a time clock, the challenge was to calculate based on a series of entry and exit times for employees, the amount of time between each entry and exit recorded, for each person. I present first a solution that will work on SQL Server 2008 and SQL Server 2008 R2, but has to scan the data a number of times to arrive at the result. Since that solution was not acceptable, I enlisted the help of a T-SQL guru, Peter Larsson (known as "SwePeso" or "Peso" on the various community forums) who changed the approach to the problem, and found a better solution.

Finally, I'll demonstrate how new analytic functions in SQL Server 2012, LAG and LEAD, make it possible to find a solution that performs just as well, and requires a little less lateral thinking.

Understanding the Business Requirements and Data

Listing 1 provides the code to create the StaffHours table and populate it with data that represents fairly the time clock data set with which I worked. It uses row constructors, introduced with SQL Server 2008, and so the code will not run in any earlier versions (see the code download for the 2005 version).

--2008 & 2012 version
IF EXISTS ( SELECT * 
      FROM sys.tables 
      WHERE name = 'StaffHours' ) 
   DROP TABLE StaffHours; 
   
CREATE TABLE StaffHours
   ( 
      StaffMember CHAR(1) , 
      EventDate DATE , 
      EventTime TIME , 
      EventType VARCHAR(5) 
   ); 

INSERT INTO StaffHours
   ( StaffMember, EventDate, EventTime, EventType ) 
VALUES 
   ( 'A', '2013-01-07', '08:00', 'Enter' ), 
   ( 'B', '2013-01-07', '08:10', 'Enter' ), 
   ( 'A', '2013-01-07', '11:30', 'Exit' ), 
   ( 'A', '2013-01-07', '11:35', 'Exit' ), 
   ( 'A', '2013-01-07', '12:45', 'Enter' ), 
   ( 'B', '2013-01-07', '16:45', 'Exit' ) 
   ( 'A', '2013-01-07', '17:30', 'Exit' )
   ( 'A', '2013-01-07', '1:00', 'Exit' )
   ( 'C', '2013-01-07', '08:33', 'Enter' ), 
   ( 'C', '2013-01-07', '17:33', 'Exit' ), 
   ( 'C', '2013-01-07', '17:35', 'Exit' );

Listing 1: Creating and Populating the StaffHours table

Figure 1 shows the "time clock" data set, essentially a series of entry and exit times from each member of staff.

Figure 1: The original time clock data set

The requirement is to calculate the amount of time in hours, minutes and seconds between each Enter and Exit, for each staff member. The first point to note is that while there is usually a matching Exit row for each Enter, sometimes there is an Enter with no Exit or an Exit with no Enter. Here are the business rules:

  • If there are consecutive Enter rows, use the first one.
  • If there are consecutive Exit rows, use the last one.
  • Ignore any Exit rows with no matching Enter row (i.e. if the first row for an employee is an Exit, then ignore it).
  • Ignore any Enter rows with no matching Exit row (i.e. if the last row for an employee is an Enter, then ignore it).
  • The Exit for an Enter could be a subsequent day.

We can validate our expected results, manually. Employee A starts a work session at 08:00 on Jan 7. There follows two Exit times and, according to our business rules, we use the last one (11:35), for a work session duration of 3 hours 35 minutes. Employee A enters again at 12:45 on the same day and again we have two Exit entries, the latter falling on the next day, to give 12:15:00 duration. In a similar fashion, we can hand-calculate the full result set, as shown in Figure 2.

StaffMember

EnterDateTime

ExitDateTime

WorkTime

A

2013-01-07 08:00:00

2013-01-07 11:35:00

03:35:00

A

2013-01-07 12:45:00

2013-01-08 01:00:00

12:15:00

B

2013-01-07 08:10:00

2013-01-07 16:45:00

08:35:00

C

2013-01-07 08:33:00

2013-01-07 17:35:00

09:02:00

Figure 2: Expected results

A Pre-SQL Server 2012 Solution: pairing events using row numbers

Listing 2 shows my first solution, which uses row numbersand self-joins on matching rows, based on the row numbers. It takes a very systematic approach, using five levels of CTEs, as follows:

  • Level1 – partition the data by StaffMember and apply ROW_NUMBER
  • Level2 – perform two self-joins to find, for each row, the value of the EventType for the previous and then the subsequent rows, in each partition
  • Level3 – filter out unneeded rows and reapply row numbers
  • Level4 – perform another self-join to pair up ENTER and EXIT rows
  • Level5 – use DATEDIFF to calculate work session durations for each employee
--SET STATISTICS IO ON;; 
WITH    Level1
           AS (-- apply row numbers 
               SELECT StaffMember , 
                     CAST(EventDate AS DATETIME) 
                     +   CAST(EventTime AS DATETIME) AS EventDateTime , EventType , 
                     ROW_NUMBER() OVER ( PARTITION BY StaffMember 
                                  ORDER BY EventDate, EventTime ) AS RowNum 
              FROM StaffHours 
              ), 
        LEVEL2
           AS (-- find the last and next event type for each row 
              SELECT   A.StaffMember , 
                       A.EventDateTime , 
                       A.EventType , 
                       COALESCE(LastVal.EventType, 'N/A') AS LastEvent , 
                       COALESCE(NextVal.EventType, 'N/A') AS NextEvent 
              FROM Level1 A 
                   LEFT JOIN Level1 LastVal 
                      ON A.StaffMember = LastVal.StaffMember 
                         AND A.RowNum - 1 = LastVal.RowNum 
                   LEFT JOIN Level1 NextVal 
                      ON A.StaffMember = NextVal.StaffMember 
                         AND A.RowNum + 1 = NextVal.RowNum ), 
        Level3 
           AS (-- reapply row numbers to row-eliminated set 
              SELECT  StaffMember , 
                      EventDateTime , 
                      EventType , 
                      LastEvent , 
                      NextEvent , 
                      ROW_NUMBER() OVER ( PARTITION BY StaffMember 
                          ORDER BY EventDateTime ) AS RowNBr 
              FROM Level2 
              WHERE   NOT ( EventType = 'Enter' 
                           AND LastEvent = 'Enter' 
                          ) 
                     AND NOT ( EventType = 'Exit' 
                              AND NextEvent = 'Exit' 
                              ) 
                     AND NOT ( EventType = 'Enter' 
                              AND NextEvent = 'N/A' 
                              ) 
                     AND NOT ( EventType = 'Exit' 
                              AND LastEvent = 'N/A' 
                              ) 
              ), 
        Level4
           AS (-- pair enter and exit rows. 
               SELECT A.StaffMember , 
                      A.EventDateTime , 
                      B.EventDateTime AS ExitDateTime 
               FROM Level3 A 
                    JOIN Level3 B ON A.StaffMember = B.StaffMember 
                                     AND A.RowNBr + 1 = B.RowNBr 
               WHERE A.EventType = 'Enter' 
                     AND B.EventType = 'Exit' 
               ), 
        LEVEL5 
           AS (--Calculate the work session duration 
               SELECT StaffMember , 
                      DATEDIFF(second, EventDateTime, ExitDateTime) 
                               AS Seconds , 
                      EventDateTime , 
                      ExitDateTime 
               FROM Level4
              ) 
   SELECT StaffMember , 
          EventDateTime , 
          ExitDateTime , 
          seconds , 
          RIGHT('0' +   CAST(Seconds / 3600 AS VARCHAR(2)), 2) + ':' 
          + RIGHT('0' +   CAST(Seconds % 3600 / 60 AS VARCHAR(2)), 2) + ':' 
          + RIGHT('0' +   CAST(Seconds % 3600 % 60 AS VARCHAR(2)), 2) 
                               AS WorkTime 
   FROM Level5;

Listing 2: Calculating work session duration using row numbers, CTEs and self-joins

I spent quite a bit of time working on this query, taking it a step at a time, checking the results and moving on to the next level. However, as soon as I saw the STATISTICS IO information, I knew there was a problem.

(4 row(s) affected)Table 'StaffHours'. Scan count 6, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    

The query scanned the table six times! Of course, with a small amount of data, this was not a problem, but the solution will never scale. Figure 3 visualizes the problem

Figure 3: Why we need six scans to arrive at the result

While my solution is intuitive, I'd fallen into the trap of thinking about what I needed to do with each row, instead of considering what I could do by looking at the data as a set. This skill does not come easily to most of us, especially those with a background in programming that teaches an iterative approach.

A Better Pre-SQL Server 2012 Approach: Islands and Gaps

When I wrote a series of articles on the SQL Server Speed Phreak competitions (see https://www.simple-talk.com/author/kathi-kellenberger/), I discussed the compromise between solutions that were simple to understand, and "fast enough", and the absolute fastest solutions possible. I spent time trying to explain how the latter worked, for the benefit of those with only medium-sized SQL brains.

On several occasions, when dissecting the fastest solution, the author was Peter Larsson. I turned to him for help on this problem, and he devised a very elegant solution that returns the answer in just one table scan instead of six. However, even Peter didn't conjure his best solution straight off the bat. His first attempt, show in in Listing 3, reduces the scan count from six to three.

; 
WITH cteSource ( StaffMember, EventDateTime, EventType,  rn ) 
        AS ( SELECT   StaffMember , 
                      CAST(EventDate AS DATETIME) 
                      +   CAST(EventTime AS DATETIME) AS EventDateTime , 
                      EventType , 
                      ROW_NUMBER() OVER ( PARTITION BY StaffMember 
                              ORDER BY EventDate, EventTime ) AS  rn 
             FROM dbo.StaffHours 
           ), 
     cteGrouped ( StaffMember, EventDateTime, grp ) 
        AS ( SELECT c.StaffMember , 
                    c.EventDateTime , 
                    ROW_NUMBER() OVER ( PARTITION BY c.StaffMember 
                                 ORDER BY c.EventDateTime ) 
                    - 1 AS grp 
             FROM cteSource AS  c 
                  LEFT JOIN cteSource AS p 
                     ON p.StaffMember = c.StaffMember 
                        AND p.rn = c.rn - 1 
                  LEFT JOIN cteSource AS n 
                     ON n.StaffMember = c.StaffMember 
                        AND n.rn = c.rn + 1 
             WHERE ( COALESCE(p.EventType, 'Exit') = 'Exit' 
                     AND c.EventType = 'Enter' 
                   ) 
                   OR ( c.EventType = 'Exit' 
                        AND COALESCE(n.EventType, 'Enter') = 'Enter' 
                       ) 
            ) 
   SELECT StaffMember , 
          MIN(EventDateTime) AS EventDateTime , 
          MAX(EventDateTime) AS ExitDateTime , 
          DATEDIFF(SECOND, MIN(EventDateTime), MAX(EventDateTime)) 
                                                         AS Seconds ,
          CONVERT(CHAR(8), DATEADD(SECOND, 
                               DATEDIFF(SECOND, MIN(EventDateTime),
                                  MAX(EventDateTime)), 
                                     '00:00:00'), 8) AS WorkTime 
   FROM cteGrouped 
   GROUP BY StaffMember , 
            grp / 2 
   HAVING COUNT(*) = 2;

Listing 3: Peter's First Solution Eliminates One Self-Join

Peter's initial solution contains the two self-joins to determine the previous and next event types just as my solution does, but he manages to eliminate the final self-join (level 4 in my solution), and with it half the scans.

He achieved this by creating groups of rows representing each Enter and Exit pair, using row numbers generated in the cteGrouped level. At this level, he also used a simplified WHERE clause to eliminate most of the invalid rows, keeping every Enter row that had a previous Exit row and every Exit row with a subsequent Enter row. Figure 4 shows how the data looks at the cteGrouped stage.

Figure 4: The values from cteGrouped

Notice that the value of grp for all of the remaining Enter rows are even, and the remaining Exit rows are odd. Integer division throws away remainders, so when the grp values are divided by two we end up with group 0 for the first enter/exit pair, group 1 for the second enter/exit pair and so on. This observation is important because it means we can now find each enter/exit pair using that grouping, in the final step. Use of Having COUNT(*) = 2 ensures that we are working with pairs of rows and not any orphaned Exit events at the beginning, or Enter events at the end.

The other point to remember, at this stage, is that the Enter date and time is the MINIMUM for the group, and the Exit date and time is the MAXIMUM for the group. Peter takes advantage of this in the final step to match up the Enter and Exit times without doing another self-join. Remember that the final self-join, if performed, doubles the work with six scans of the data.

In the final step, the query performs the grouping as well as the calculations. Peter uses an interesting technique to do the calculation.

CONVERT(CHAR(8), 
      DATEADD(SECOND, DATEDIFF(SECOND, MIN(EventDateTime), 
         MAX(EventDateTime)), 
            '00:00:00'), 8) AS WorkTime

The DATEDIFF calculates the seconds between the MIN(EventDateTime), or ENTER time, and the MAX(EventDateTime), or Exit. He then adds those seconds back to time zero (00:00:00). He uses the style 8 (hh:mm:ss) to format the value returned. Often, people use the CAST function in preference to the CONVERT function, but the benefit of the latter is that you can specify a style that formats the date.

However, not happy with three scans of the table, Peter tried again to improve the performance and arrived at a masterful "Islands and Gaps" solution that eliminated the initial self-joins, thus reducing the table scans to just one!

; 
WITH cteIslands ( StaffMember, EventType, EventDateTime, grp ) 
       AS ( SELECT StaffMember , 
                 EventType , 
                 CAST(EventDate AS DATETIME) 
                 +   CAST(EventTime AS DATETIME) AS EventDateTime , 
                 ROW_NUMBER() 
                    OVER ( ORDER BY StaffMember, EventDate, EventTime ) 
                 - ROW_NUMBER() 
                    OVER ( ORDER BY EventType, StaffMember, 
                                    EventDate, EventTime ) AS grp 
             FROM dbo.StaffHours 
           ), 
     cteGrouped ( StaffMember, EventType, EventDateTime ) 
      AS ( SELECT StaffMember , 
                  MIN(EventType) , 
                  CASE WHEN MIN(EventType) = 'Enter' 
                       THEN MIN(EventDateTime) 
                       ELSE MAX(EventDateTime) 
                  END AS EventDateTime 
           FROM cteIslands 
           GROUP BY StaffMember , 
                    grp 
         ), 
     cteResult ( StaffMember, EventType, EventDateTime, grp ) 
       AS ( SELECT StaffMember , 
                   EventType , 
                   EventDateTime , 
                   ROW_NUMBER() OVER ( PARTITION BY StaffMember ORDER BY EventDateTime ) 
                   - 1 AS grp 
            FROM cteGrouped 
          ) 
   SELECT StaffMember , 
          MIN(EventDateTime) AS EventDateTime , 
          MAX(EventDateTime) AS ExitDateTime , 
          DATEDIFF(SECOND, MIN(EventDateTime), MAX(EventDateTime)) AS Seconds ,
          CONVERT(CHAR(8), DATEADD(SECOND, 
                           DATEDIFF(SECOND, MIN(EventDateTime),
                                    MAX(EventDateTime)), '00:00:00'), 8) AS WorkTime 
   FROM cteResult 
   GROUP BY StaffMember , 
            ( grp - CASE WHEN EventType = 'Exit' 
                              AND Grp = 0 THEN NULL 
                         ELSE 0 
                    END ) / 2 
   HAVING COUNT(*) = 2

Listing 4: The Islands and Gaps Pre-2012 Solution

The first CTE, cteIslands, applies row numbers to the data sorted in two different ways and subtracts the two values. In order to see what is going on more easily, let's extract and execute the query from cteIslands and look at the results it returns.

SELECT StaffMember , 
       EventType , 
       CAST(EventDate AS DATETIME) +   CAST(EventTime AS DATETIME) AS EventDateTime , 
       ROW_NUMBER() OVER ( ORDER BY StaffMember, EventDate, EventTime ) AS ByTime , 
       ROW_NUMBER() OVER ( ORDER BY EventType, StaffMember, EventDate, EventTime ) AS ByType , 
       ROW_NUMBER() OVER ( ORDER BY StaffMember, EventDate, EventTime ) 
       - ROW_NUMBER() OVER ( ORDER BY EventType, StaffMember, EventDate, EventTime) AS grp 
FROM dbo.StaffHours;

Listing 5: Viewing the Islands

Listing 5 applies the ByTime row numbers in order of StaffMember, and then EventDateTime. It applies the ByType row numbers in order of EventType, StaffMember and then EventDateTime. If an Exit event follows another Exit event, they will end up in the same GRP (ByTime – ByType). The same goes for Enter events.

If the data were perfect, an Exit event would follow every Enter event. If that had been the case, all the ByTime values would have odd numbers for the Enter rows, and even numbers for the Exit rows and so the difference between the ByTime and ByType values would have increased by one for each Enter row, and increased by one for each Exit row, thus giving us no duplicate GRP numbers. Figure 6 shows how the results of Listing 5 would look assuming ideal data.

Figure 6: Results of Listing 5, if the data were perfect

In our imperfect data, because two adjacent Exit events will increase both ByTime and ByType by one (Figure 5), the difference between ByTime and ByType for the two rows are equivalent, and those two adjacent rows become a group (an "island"). This enables us to use MIN for Enter rows and MAX for Exit rows to find the valid row for each GRP.

The remainder of this solution uses the same techniques as Listing 3 to pair Enter and Exit rows, and to calculate the difference in time.

Note that Peter submitted a further iteration to his solution, which uses the same basic algorithm but with refinements, including small bug fixes for two edge conditions. It is available as part of the code download (see the speech bubble to the right of the article title).

SQL Server 2012-only Solution: New Analytic Functions

While this solution performs far better than my original attempt and will perform well on all SQL Server versions, up to and including SQL Server 2012, a solution that is easier to develop, and has equal performance, takes advantage of some new functions available in SQL Server 2012.

SQL Server 2012 brings us several new functions, three of which will be very helpful here: LAG, LEAD, and TIMEFROMPARTS. LAG and LEAD are two new analytic window functions. TIMEFROMPARTS is one of several new functions used to build a particular date and time with the parts.

In our previous solution, we created partition "windows" based on StaffMember, and used the ROW_NUMBER function to number sequentially each row in each partition (Listing 2). However, we then had to perform two extra table scans to get the event type values for the previous and next rows. The new LAG and LEAD functions allow us to retrieve these event type values from each "window" without needing to apply row numbers first. The LAG function returns the previous value from a particular column without rescanning the table. With LEAD, we can find the next value. Instead of rescanning, SQL Server performs the operation on data already in memory. This means fewer scans of the data and fewer logical reads.

The LAG and LEAD functions each accept three parameters. The first parameter specifies the column for the value we wish to retrieve. The second parameter is optional and is the number of rows to go back or forward with the default of one. The third parameter, also optional, defines a default value in case a NULL is returned.

Listing 6 shows the code using these functions to return the previous and next values of the event type. Notice that we also return N/A instead of NULL and so render redundant the COALESCE logic we previously added for this task.

SELECT StaffMember , 
       EventType , 
       CAST(EventDate AS DATETIME) + CAST(EventTime AS DATETIME) AS EventDateTime , 
       LAG(EventType, 1, 'N/A') OVER ( PARTITION BY StaffMember ORDER BY EventDate, EventTime ) AS LastEvent , 
       LEAD(EventType, 1, 'N/A') OVER ( PARTITION BY StaffMember ORDER BY EventDate, EventTime ) AS NextEvent 
FROM StaffHours;
GO

Listing 6: Using LAG and LEAD to find the previous and next event type values

Critically, we've achieved this with only one scan of the table, instead of three. The next step is to filter out the unneeded rows. SQL Server processes the WHERE clause before the LAG and LEAD functions are processed, so we'll have to use the CTE strategy again. Listing 7 shows the query for this step.

; WITH Level1 
    AS ( -- Use LAG and LEAD to find the previous and next event type 
         -- values (Listing 6) 
        ) 
        SELECT StaffMember , 
               EventType , 
               EventDateTime , 
               LastEvent , 
               NextEvent 
        FROM Level1 
        WHERE NOT ( EventType = 'Enter' 
                    AND LastEvent = 'Enter' 
                   ) 
          AND NOT ( EventType = 'Exit' 
                    AND NextEvent = 'Exit' 
                   ) 
          AND NOT ( EventType = 'Enter' 
                    AND NextEvent = 'N/A' 
                   ) 
          AND NOT ( EventType = 'Exit' 
                    AND LastEvent = 'N/A' 
                   ); 
GO

Listing 7: Filter out unneeded rows

Crucially, we are still at one table scan. We now have the rows we need, but we can't do the time calculations because the EventDateTime values are on different rows. We can use the LEAD function again to access the EventDateTime value on the next row. Because we can nest functions, we can use the results of the LEAD function inside the DATEDIFF function to calculate seconds. Listing 8 shows the additional code for this step.

;
WITH Level1 
       AS ( 
            -- Using LAG and LEAD to find the previous and next event type 
            -- values (Listing 6) 
           ), 
     Level2 
       AS ( 
            -- Filter out unneeded rows (Listing 7) 
           ) 
  SELECT StaffMember , 
         EventType , 
         EventDateTime , 
         DATEDIFF(second, EventDateTime, 
                  LEAD(EventDateTime) 
                    OVER ( PARTITION BY StaffMember 
                       ORDER BY EventDateTime )) AS Seconds 
  FROM Level2; 
GO

Listing 8: Calculate work session duration in seconds

The final step is to turn Seconds into hours, minutes, and seconds. We can use the new TIMEFROMPARTS functions to simplify the formula. Each date or time data type in SQL Server 2012 has a corresponding *FROMPARTS function. These allow us to build a date from its component parts. There is no casting or appending. Finally, we filter to return just the Enter rows. Listing 9 shows the final 2012-only code.

; 
WITH Level1 
       AS ( SELECT StaffMember , 
                   EventType , 
                   CAST(EventDate AS DATETIME) 
                   +   CAST(EventTime AS DATETIME) AS EventDateTime , 
                   LAG(EventType, 1, 'N/A') 
                      OVER ( PARTITION BY StaffMember 
                         ORDER BY EventDAGate, EventTime ) AS LastEvent , 
                   LEAD(EventType, 1, 'N/A') 
                      OVER ( PARTITION BY StaffMember 
                         ORDER BY EventDate, EventTime ) AS NextEvent 
            FROM StaffHours ), 
     Level2 
       AS ( SELECT StaffMember , 
                   EventType , 
                   EventDateTime , 
                   LastEvent , 
                   NextEvent 
            FROM   Level1 
            WHERE  NOT ( EventType = 'Enter' 
                         AND LastEvent = 'Enter' 
                       ) 
               AND NOT ( EventType = 'Exit' 
                         AND NextEvent = 'Exit' 
                       ) 
               AND NOT ( EventType = 'Enter' 
                         AND NextEvent = 'N/A' 
                       ) 
               AND NOT ( EventType = 'Exit' 
                         AND LastEvent = 'N/A' 
                       ) 
          ), 
     Level3 
       AS ( SELECT StaffMember , 
                   EventType , 
                   EventDateTime , 
                   DATEDIFF(second, EventDateTime, 
                            LEAD(EventDateTime) 
                               OVER ( PARTITION BY StaffMember 
                                  ORDER BY EventDateTime )) AS Seconds 
            FROM Level2 
           ) 
   SELECT StaffMember , 
          EventDateTime , 
          TIMEFROMPARTS(Seconds / 3600, Seconds % 3600 / 60, 
                        Seconds % 3600 % 60, 0, 0) AS WorkTime 
   FROM Level3 
   WHERE EventType = 'Enter';

Listing 9: The final SQL 2012-only code

Here are the results, just as expected. Thanks to the new functions in SQL Server 2012, the query is simpler and easier to write and is much more efficient, scanning the data just one time. If you are working with SQL Server 2012, keep the new functions in mind to help you solve those tricky queries. If you're interested in seeing Peter's SQL 2012-only solution, which is similar to mine but with some refinements, it's available in the code download file.

Summary

Writing T-SQL queries to solve complex problems is often difficult. Frequently, I will assist others, either on forums or by direct request, in solving those problems. My methodology involves breaking the problem down and solving it systematically. At each step, I examine the results, adjust if necessary, and move on to the next step. I don't expect to get the right answer right off the bat; writing a complex query takes time. I hope that the individual asking for help will not just take the answer, but will learn something and be better able to handle the next challenging query that comes along. The same applies to me, of course. Peter managed to find a more efficient pre-2012 solution and I can take what I learned from that and apply to future problems.

Luckily for SQL Server developers, T-SQL functionality has improved immensely starting with the introduction of the ranking functions and the OVER clause in 2005, and continuing with the introduction of very useful analytic functions, such as LAG and LEAD, in SQL Server 2012. If you haven't kept up with these improvements, you owe it to yourself to spend some time learning about them. I found that once I started working with such functions, and Common Table Expressions, I relied on them more and more.

In particular, if you are fortunate enough to run SQL Server 2012, be sure to learn about the new analytic and other functions. The improved performance in many queries may be the tipping point to get your organization to upgrade!

Kathi Kellenberger

Author profile:

Kathi Kellenberger is a consultant with Pragmatic Works. She is author of "Beginning T-SQL 2008" and co-author of "Professional SQL Server 2005 Integration Services," "Beginning T-SQL 2012," and "SQL Server MVP Deep Dives." Kathi enjoys speaking and writing about SQL Server and teaches the occasional SQL Server class. When she is not busy working with SQL Server, you might find her singing at the local karaoke bar or climbing the stairs in tall buildings.

Search for other articles by Kathi Kellenberger

Rate this article:   Avg rating: from a total of 55 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Could it be done easier?
Posted by: AlexK (not signed in)
Posted on: Wednesday, March 27, 2013 at 1:11 PM
Message: Kathi,

This is a very interesting read, of course, but I don't think I would want to use such solutions in production, until I have exhausted other options.

First, such problems are very easy to solve on the client. Someone straight out of college can create a covering index, read the data once, and do the rest in Java or C#. The code would be very simple and easy to maintain. The database performance would be predictable, as we only scan a range of a covering index.

Second, how can I know if this complex query will deliver consistent performance? How do I know it will not all of a sudden start performing 500 times faster, ruining my users' day and my weekend and my reputation? I cannot know it for sure, because SQL Server optimized is closed source, so I have know idea if it can switch to another plan.

This is especially true for new features, which are sometimes released without enough testing and may perform poorly.

On the other hand, if I solve this problem on the client, I am in full control and am better positioned to guarantee stable performance.

Third, we can redesign the table, so that the matching entry and exit are stored in the same row. That would make selecting so very much easier. It is no longer 1995 when refactoring database tables was prohibitively expensive - it has become much cheaper, and may be much cheaper than optimizing complex queries.

What do you think?

Subject: Thoughts about practical things in real world
Posted by: Bo (not signed in)
Posted on: Thursday, March 28, 2013 at 4:35 AM
Message: The problem with matching entry and exit columns on the same row, is that it does not work in the real world.
Often when you arrive at a door, you sneak in with someone else and thus miss a new entry row. Or the opposite, when you leave the office you go through the door someone else opened just before you and you don't swipe your own card.

A stable performance is not the same as fast performance.

What is the point of getting a stable performance of between 5.6 and 5.7 seconds every time, when the database solution get the result within 0.2 and 0.4 seconds?
And you are not guaranteed to get a stable performnace either when you send all the needed rows on the network to the client. Bandwidth issues may occur, and why clutter the network when there is no need for it?

If you sense SQL Server is not a stable product for your need, use a text file instead, because that is what you are suggesting.

Subject: I can't wait to learn form your next answer
Posted by: AlexK (not signed in)
Posted on: Monday, April 01, 2013 at 2:37 PM
Message: Bo,

In fact, if matching entry and exit columns can work when we select, then it can work as we save rows as well. Not all entries and exits have matching ones, of course, but some will, and matching values can be stored as we save entries and exits.

In general, whenever we encounter a complex problem, we want to consider not having to solve it at all, such as selecting pre-matched entries.

Another great example of avoiding to solve a complex problem was recently posted by a great SQL Server expert Brent Ozar. You must have heard the name, and you should read the following post: "The fastest query is the one you never make", http://www.brentozar.com/archive/2013/02/the-fastest-query-is-the-one-you-never-make/

Also I am not with you on stable performance vs. database solution performance. For some queries against time series, client side solutions are much faster, like 10 times or more, not 25-30 times slower as you suggested.

And, of course, the closing argument, "use a text file instead, because that is what you are suggesting", is a typical logical fallacy. It is known as "Straw man – an argument based on misrepresentation of an opponent's position", as well as "Appeal to ridicule – an argument is made by presenting the opponent's argument in a way that makes it appear ridiculous".

My previous post is right here, unedited, and clearly I am suggesting to use other SQL Server features, and I am not even mentioning a text file.

Usually, whenever someone uses something from the list of fallacies, http://en.wikipedia.org/wiki/List_of_fallacies,
this means that they have nothing better to say. However, I think that in this particular case you do have a lot to say and could make a brilliant contribution.

I can't wait to learn form your next answer.


Subject: Not quite
Posted by: Professor (not signed in)
Posted on: Monday, April 01, 2013 at 10:14 PM
Message: AlexK,
Your response is typical of application developers who are not comfortable with using SQL for fast optimized queries.

"This is a very interesting read, of course, but I don't think I would want to use such solutions in production, until I have exhausted other options."

Why? If you can control the database code as well as the application code you will produce a better application than one that optimizes only one of those.

"First, such problems are very easy to solve on the client."

Yes but control-flow code such as loops are not fast for large data-sets. SQL and relational databases are born of set theory, which is entirely geared around doing operations to an entire data-set at once, rather than slowly, row-by-row.


"Someone straight out of college can create a covering index, read the data once, and do the rest in Java or C#. The code would be very simple and easy to maintain. The database performance would be predictable, as we only scan a range of a covering index."

And this would be a mistake. If you are retrieving all rows from the database (even filtered by an index) and then doing all transformation in the client, your application's slowness will scale exponentially, as will the slowness of the query and the network transfer speed as the data is sent from the database to the client. Fresh college kids who are great with C# have great difficulty understanding the different mindset required to work with SQL which is why you find most DBA's are older and have industry experience.
The database performance will be predictable, yes, predictably bad. You are not the first to think that application code is easier to maintain. Truly excellent software engineers understand how to use the right tool. It's never the same tool in all situations.


"Second, how can I know if this complex query will deliver consistent performance? How do I know it will not all of a sudden start performing 500 times faster, ruining my users' day and my weekend and my reputation? I cannot know it for sure, because SQL Server optimized is closed source, so I have know idea if it can switch to another plan."

Wrong. You test the code like any other code. You create a test system, you fill the table with 100000000 rows and you test how your query scales. You will find that row-by-row queries scale exponentially and set-based queries scale geometrically. Your mind will expand and you will become wiser. You might already be very smart but you can always be wiser. The database engine behind SQL Server, or Oracle or any other database might be closed source, but you can still test it, and should. Note that C# compiles to CLR which is then run JIT on the CLR runtime engine. This runtime engine is also closed source. Yes, there is mono, but that is a reverse engineered open source version of the CLR runtime. You are arguing that having this engine open source means that you are able to see exactly how the code works under the hood. I would argue that unless you are intercepting and interpreting the machine code as it executes on the processor, that you have no idea how you code is being executed in a direct sense. And why would you care? You test your code and measure how it works indirectly by timing the execution and checking the data integrity. It's the same on any compiled, interpreted or in SQL's case declarative language. If you truly crave control over exactly how code is run, then start writing in machine language! There is a time and place for that, and I strongly argue that in business applications that it is a complete waste of time.

"This is especially true for new features, which are sometimes released without enough testing and may perform poorly."
Agreed, and you make my point about testing. You always have to test your code, even old functions.

The lead/lag functions mentioned in this article are part of ISO standard, have been available in Oracle for some years and are only now in SQL Server 2012. All new features can have bugs though, and the same is true of your favourite language, so this is not a valid argument against writing good SQL over poor SQL.

"On the other hand, if I solve this problem on the client, I am in full control and am better positioned to guarantee stable performance."

This is only true because you are comfortable with client application code, and only to the point where your application code is as optimized as it can be. Once you hit that point the only way you can improve performance is by improving database query time and reducing network transfer, which in turn reduces the amount of work your application code needs to perform.

"Third, we can redesign the table, so that the matching entry and exit are stored in the same row. That would make selecting so very much easier. It is no longer 1995 when refactoring database tables was prohibitively expensive - it has become much cheaper, and may be much cheaper than optimizing complex queries."

Queries are always cheaper to optimize than changing database tables, but both are options. I would advise against putting all your data in the same table because you will increase redundancy and reduce integrity. Relational databases follow ACID principles for good reason which I suggest you read up on. NoSQL proponents argue otherwise, but if you are prepared to sacrifice integrity and security for flexibility then that's a choice you should be testing and validating.
Of course, your suggestion here might be totally appropriate but I encourage you to be scientific and empirical about your methodology. Test your assumptions. Try code options for yourself.

Really, this is a tired old argument and perpetuates this ridiculous war between application developers and DBAs. The best solution is always going to be harmony between the two camps and compromises on both sides where appropriate.

It's nice that you've read a bit of philosophy about arguments, but really the analogy of a text file is a pretty good approximation of what you suggest. If you ignore the tools within a database, and you are dumping all the raw data to an application then you are essentially ignorant of the set-based power of relational databases and may as well resort to really crude methods of data retrieval or even do something really pointless like write your own database.

Subject: Not to rain on the parade...
Posted by: puzsol (view profile)
Posted on: Tuesday, April 02, 2013 at 8:55 AM
Message: I hate to say it, as I like Peso's pre2012 solution, but (as downloaded) it falls over for non-exceptional data that starts with an enter... eg, add rows:

( 'B', '2013-01-08', '08:10', 'Enter' ),
( 'B', '2013-01-08', '16:45', 'Exit' ),

to the data, and you get 1 day for the work durations for 'B'.... this is because the grouping includes the first exit with the second enter. I think that the only way to resolve this is to ensure there is always an exit first row, or multiple exits/enters for the first (and subsequent?) group. The SQL 2012 solution does not suffer this problem - I guess that's the danger of using maths to fake set logic...

I think it might also fall over rather spectacularly if the recorded events are in the order; Enter, Enter, Exit, Enter, Exit - but I need some sleep now.

In my version, I had to resort to table variables to reduce 4 the StaffHours table scans to 1 (using a with clause resulted in 4)... for reference:

declare @staff_rnum table(StaffMember char(1), rnum int, EventDateTime datetime, EventType varchar(5), primary key (StaffMember, rnum))
insert into @staff_rnum
select h.StaffMember, ROW_NUMBER() OVER (partition by h.StaffMember order by cast(h.EventDate as datetime) + cast(h.EventTime as datetime)), cast(h.EventDate as datetime) + cast(h.EventTime as datetime), h.EventType
from StaffHours h

declare @staff_gaps table(StaffMember char(1), rnum int, EnterDateTime datetime, ExitDateTime datetime, primary key (StaffMember, rnum))
insert into @staff_gaps
select coalesce(r1.StaffMember, r2.StaffMember) as StaffMember
, ROW_NUMBER() over (partition by coalesce(r1.StaffMember, r2.StaffMember) order by r1.EventDateTime) as rnum
, r2.EventDateTime as EnterDateTime
, r1.EventDateTime as ExitDateTime
from @staff_rnum r1
full join @staff_rnum r2 on r1.StaffMember = r2.StaffMember and r2.rnum = r1.rnum + 1
where (r1.EventType is null or r1.EventType = 'Exit')
and (r2.EventType is null or r2.EventType = 'Enter')

select g1.StaffMember, g1.EnterDateTime, g2.ExitDateTime, datediff(DAY, 0, g2.ExitDateTime - g1.EnterDateTime) as WorkDays, cast(g2.ExitDateTime - g1.EnterDateTime as Time) as WorkHours
from @staff_gaps g1
inner join @staff_gaps g2 on g2.StaffMember = g1.StaffMember and g2.rnum = g1.rnum + 1
order by g1.StaffMember, g1.EnterDateTime


It works, but nowhere near as well as the SQL 2012 solution... so yes an upgrade would be well worth it if you do these sorts of joins.

Thanks for the great article.

Subject: Can I avoid RBAR
Posted by: markandersonus (view profile)
Posted on: Tuesday, April 02, 2013 at 9:09 AM
Message: Hi All

I use RBAR a lot on SP's that import a lot of data, process it and then insert/update (typically a few thousand rows max). I'll often import into a temp tables, massage and then insert/update into the target tables.

The reason I use the while loop is so that I can create an audit log (could probably do a lot of that with output clause on insert/update--just learned that), but it also lets me handle row errors, so if an insert/update fails I can grap the key and not abort the whoel batch

Is there a better way to do this without RBAR?

Regards

Mark

Subject: Not raining part 2
Posted by: puzsol (view profile)
Posted on: Tuesday, April 02, 2013 at 6:41 PM
Message: Peso's first solution in the article also falls over when the sequence of StaffEvents starts with an Exit... eg Exit, Enter, Exit... Because the maths groups Exit-Enter, rather than Enter-Exit.

To fix this, I had to modify the where clause for cteGrouped to:
WHERE ( COALESCE(p.EventType, 'Exit') = 'Exit'
AND c.EventType = 'Enter'
)
OR ( c.EventType = 'Exit'
AND COALESCE(n.EventType, 'Enter') = 'Enter'
and p.EventType is not null
)

Which effectively filters out the null-Exit (or starting with an Exit event) record... there may be other fringe cases where this falls over, but I couldn't think of any.

Subject: A: Can I avoid RBAR
Posted by: puzsol (view profile)
Posted on: Tuesday, April 02, 2013 at 7:04 PM
Message: More than likely... though you would have to state why a row might fail.

If it is because of a missing foreign key, then you might only try to insert/update those that have those foreign keys in the first place.

If it is because a value won't convert to another type (eg string to int), then in SQL2012 you might use the new TRY_CAST or TRY_PARSE to pre-filter these rows out, or use some other function/filter to exclude the rows that just won't insert.

One approach would be:
1 - Delete all rows from temp table that won't convert and add to exception log.(repeat for each similar bad-data test you can perform)... this can be done as a set.

2 - Move all the rows from the temp table to dest table that have all references required. Repeat until no new rows are moved (to cover the case of linked data being inserted)... Again done as a set

3 - Remove and add an exception for all remaining rows as missing linking information.

But it really depends on your process... if you want more specific help, you will have to be more specific with the problem.


Subject: A: Can I avoid RBAR
Posted by: markandersonus (view profile)
Posted on: Wednesday, April 03, 2013 at 7:15 AM
Message: Thanks for reply puzsol

Should have provided a bit more info:

I'm stuck with SQL 2008R2 for forseeable future. Is there a way to implement something similar to TRY_CAST or TRY_PARSE (without CLR) for SQL 2008

Reasons for failure are typically invalid data for type (i.e. invalid date, string too long, etc.) or missing lookup data (i.e. a contact with a CompanyID that does not exist, because of referential integrity issues in source data).

So for your suggestions am I correct in thinking the following (with 2008R2):

1. Delete bad rows based on my defined criteria and use OUTPUT clause to populate exception log with those rows

2. Insert into dest table. Use OUTPUT clause to get some kind of reference to succesful records so that I can delete only the succesful inserts?

A lot of time my RBAR processing will use SELECT to get a list of ID's to be inserted and put those in a table variable. I will then iterate through the actual source rows (which can't be modified/deleted) using the list of ID's. I normally create a separate try/catch block inside the While loop, so I can can handle the failures, log the erors and move to next row.

This makes it easy to trap exceptions based on invalid data (the insert fails).

On a data set with under 10K records (for one off usage), the RBAR processing is probably quicker than the time I would spend having to writing a test for every column to see if it's a valid type, or am I missing something? Say I have some source data that has three text fields (extreme example) that need to be inserted into a date, a string (max 50 char) and a smallint. What's a good way in 2008 to use set based processing to test that they're all valid data types and log an exception if not?

Regards

Mark

Subject: Puzsol
Posted by: Peter Larsson (not signed in)
Posted on: Wednesday, April 03, 2013 at 3:47 PM
Message: Great catch, and thank you for taking the time to test another edge condition for me.
The solution is simple, since the WHERE clause is evaluated before ROW_NUMBER() function.

Change "ORDER BY IslandCandidate" to "ORDER BY MIN(EventDateTime)" and you will get the same result as the 2012 solution.


//Peter

Subject: Peter
Posted by: puzsol (view profile)
Posted on: Wednesday, April 03, 2013 at 5:01 PM
Message: Thanks Peter,

That seems to have fixed the pre-2012 query.

Last edge case you might want to test is the one that starts with Exit Exit (really fringe I know) on the 2012 solution.

Subject: RBAR Avoidance
Posted by: puzsol (view profile)
Posted on: Wednesday, April 03, 2013 at 6:21 PM
Message: Pre 2012, you have to perform your own conversion / testing.

One method would be using the CLR - I'm sure you could even find a library ready to use.

Or you could fix/expand this example for your own use...

go
;create function safeInt(@val nvarchar(max))
returns int as
begin

if @val like '%[^-0-9]%' or @val like '_%-%'
return null;
if ABS(cast(@val as bigint)) > 2147483647
return null;
return cast(@val as int)
end

go
;create function safeFloat(@val nvarchar(max))
returns float as
begin
if @val like '%[^-.0-9]%' or @val like '%.%.%' or @val like '_%-%'
return null;
return cast(@val as float)
end

go
;create function safeDate(@val nvarchar(max))
returns datetime as
begin
if ISDATE(@val) > 0
return cast(@val as datetime)
return null
end

go
;create function safeMoney(@val nvarchar(max))
returns money as
begin
if @val like '%[^$-.0-9]%' or @val like '%.%.%' or @val like '_%-%' or @val like '_%$%'
return null;
return cast(@val as money)
end

go
IF EXISTS ( SELECT *
FROM sys.tables
WHERE name = 'RawData' )
DROP TABLE RawData;

create table RawData
(info nvarchar(50), data nvarchar(200))
go

INSERT INTO RawData
( info, data )
VALUES
( 'int', '12345' ),
( 'negative int', '-12345' ),
( 'int math', '543-12' ),
( 'float', '123.45' ),
( 'date 1', convert(nvarchar, cast('29/Feb/2004' as datetime), 106) ),
( 'date 2', convert(nvarchar, cast('29/Feb/2004' as datetime), 120) ),
( 'date 3', convert(nvarchar, cast('29/Feb/2004' as datetime), 112) ),
( 'money', '$123.45' ),
( 'null', null ),
( 'too big for int', '2147483648')
go

select info, data
, dbo.safeInt(data) as AsInt
, dbo.safeFloat(data) as AsFloat
, dbo.safeDate(data) as AsDate
, dbo.safeMoney(data) as AsMoney
from RawData


(So I cheated on int overflow testing, and didn't perform any overflow on the other numeric types, but hey it's just one example of what might be done... you could probably find more robust tests from other people)... I took the approach of just returning null if it's invalid, you could refactor if to be a test instead... but I figure you get the idea of how it could be used for both. eg:

-- Int value is not allowed to be null
select info, dbo.safeInt(data) as AsInt from RawData where dbo.safeInt(data) is not null
-- Int value can be null if set to null before conversion
select info, dbo.safeInt(data) as AsInt from RawData where (data is null or dbo.safeInt(data) is not null)

Subject: Professor, lots of errors
Posted by: AlexK (not signed in)
Posted on: Wednesday, April 03, 2013 at 7:22 PM
Message: Professor,

There are so many errors in your post that I do not have the time to address them all. I will correct just a few - that does not mean I agree with the rest.

"control-flow code such as loops are not fast for large data-sets. SQL and relational databases are born of set theory, which is entirely geared around doing operations to an entire data-set at once, rather than slowly, row-by-row"

Set based operations are implemented as loops, most likely in C/C++, and under the hood set-based commands process sets row by row, although maybe in parallel. There is nothing inherently slow about row-by-row processing. Only T-SQL implementation of cursors is slow. PL/SQL, on the other hand, boasts fast cursors. This is why if we google up "gaps and islands", lots of T-SQL solutions show up, and no Oracle ones. Solving gaps and islands in Oracle is trivial: all we need to do is write a simple cursor, so there is nothing to write books and articles about.

"Fresh college kids who are great with C# have great difficulty understanding the different mindset required to work with SQL"

This was true in 1995. Now C# is just as set-based as SQL, so "different mindset" is not needed.

"You test the code like any other code. You create a test system, you fill the table with 100000000 rows and you test how your query scales."

This is not good enough in real life: we can add 0.5% more rows and observe the performance decrease 50 times. For serious database development, we use slightly more sophisticated approaches.

In general, black-box testing can only get you that far. Consider, for example, Microsoft Zune which did not work on 366th day of the leap year of 2008. No amount of black-box testing in 2007 would expose this bug. Professional code review, on the other hand, should find it.

So, if you are OK with the quality and commercial success of Zune, then go ahead and keep testing as you recommended. Many of us who work in the field have to do better than that.

" The database engine behind SQL Server, or Oracle or any other database might be closed source, but you can still test it, and should. Note that C# compiles to CLR which is then run JIT on the CLR runtime engine. This runtime engine is also closed source."

After just a little bit of learning, I have no problem using CLR, or Linux Kernel, or git - all these things just work. Yes, I "have no idea how you code is being executed in a direct sense", but I do not care if it consistently works.

Theoretically, all software has some problems. Practically, however, there is a huge difference between a new Honda and a 20-year old Lada VAZ 2011. Both have a non-zero probability to break down during the next hour. However, to drive one of these two, I need to have a deep knowledge of its internals, and carry quite a few spare parts and tools in the trunk. On the other hand, I will ignore that probability.

"Queries are always cheaper to optimize than changing database tables"

This was true in 1995. Since then, optimizing queries became much more involved - in 2013 we need to learn more information to optimize queries. On the other hand, refactoring database tables became much easier - we regularly refactor large live OLTP tables without downtime for users, overtime for developers and DBAs, and without issues.

I could go on about how "set-based power of relational databases" may in real life be quite weak for temporal data, because time series are not inherently unordered as relation theory assumes, but this is all from us for now.

Maybe I should write an article or something...

Subject: Set-based vs client side processing
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 9:13 AM
Message: AlexK, I have tried to do the first "Phil Factor Speed Phreak" competition using client side code.

No matter what I do, it takes about 9 seconds to transfer the 1 million rows to the client for further processing. And then it takes another 6 seconds to process the data. A total of 15 seconds. This is 50 times slower than the set-based solution written in SQL Server 2012.

Please share your superior solution which gives the correct result in less than a second. You can use my email address swepeso at sqltopia dot com.

Subject: RBAR Avoidance
Posted by: markandersonus (view profile)
Posted on: Thursday, April 04, 2013 at 11:31 AM
Message: Thanks Puzsol

Will give it a shot on my next project

Regards

Mark

Subject: Your contribution is appreciated
Posted by: AlexK (not signed in)
Posted on: Thursday, April 04, 2013 at 12:13 PM
Message: Peter Larsson,

Your contribution to the following problem is appreciated

http://dba.stackexchange.com/questions/39272/gaps-and-islands-client-solution-vs-t-sql-query/39275#39275

I hope to learn from your expertise.

Subject: Stack Exchange
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 12:43 PM
Message: I am looking at the sample data and realize that is a completely different data model, than the one revealed by Kathi.

Subject: Your contribution is appreciated
Posted by: AlexK (not signed in)
Posted on: Thursday, April 04, 2013 at 12:52 PM
Message: Peter Larsson,

Your contribution to the following problem is appreciated

http://dba.stackexchange.com/questions/39272/gaps-and-islands-client-solution-vs-t-sql-query/39275#39275

I hope to learn from your expertise.

Subject: AlexK
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 1:12 PM
Message: It took a while to understand the sample data provided by AlexK is not about gaps and islands, it's about packing date and time interval.

However, here is a piece of code that runs in just 4 seconds, half of the time than the C# code.


WITH cteRaw(ts, type, e, s)
AS (
SELECT StartedAt,
1 AS type,
NULL,
ROW_NUMBER() OVER (ORDER BY StartedAt)
FROM dbo.Tasks

UNION ALL

SELECT FinishedAt,
-1 AS type,
ROW_NUMBER() OVER (ORDER BY FinishedAt),
NULL
FROM dbo.Tasks
), cteCombined(ts, e, s, se)
AS (
SELECT ts,
e,
s,
ROW_NUMBER() OVER (ORDER BY ts, type DESC)
FROM cteRaw
), cteFiltered(ts, grpnum)
AS (
SELECT ts,
(ROW_NUMBER() OVER (ORDER BY ts) - 1) / 2 AS grpnum
FROM cteCombined
WHERE COALESCE(s + s - se - 1, se - e - e) = 0
)
SELECT MIN(ts) AS starttime,
MAX(ts) AS endtime
FROM cteFiltered
GROUP BY grpnum;

Subject: Fast solution
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 1:41 PM
Message: And here is a solution which runs in 1 second vs the 8 seconds for the C# code


;WITH cteSource(StartedAt, FinishedAt)
AS (
SELECT s.StartedAt,
e.FinishedAt
FROM (
SELECT StartedAt,
ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
FROM dbo.Tasks
) AS s
INNER JOIN (
SELECT FinishedAt,
ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
FROM dbo.Tasks
) AS e ON e.rn = s.rn
WHERE s.StartedAt > e.FinishedAt

UNION ALL

SELECT MIN(StartedAt),
MAX(FinishedAt)
FROM dbo.Tasks
), cteGrouped(theTime, grp)
AS (
SELECT u.theTime,
(ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
FROM cteSource AS s
UNPIVOT (
theTime
FOR theColumn IN (s.StartedAt, s.FinishedAt)
) AS u
)
SELECT MIN(theTime),
MAX(theTime)
FROM cteGrouped
GROUP BY grp
ORDER BY grp

Subject: Packing Date and Time
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 2:03 PM
Message: And here is a more optimized version of the previous. Execute in slightly less time but uses 15% less CPU.

WITH cteSource(StartedAt, FinishedAt)
AS (
SELECT s.StartedAt,
e.FinishedAt
FROM (
SELECT StartedAt,
ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
FROM dbo.Tasks
) AS s
INNER JOIN (
SELECT FinishedAt,
ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
FROM dbo.Tasks
) AS e ON e.rn = s.rn
WHERE s.StartedAt > e.FinishedAt

UNION ALL

SELECT MIN(StartedAt),
MAX(FinishedAt)
FROM dbo.Tasks
), cteGrouped(theTime, grp)
AS (
SELECT u.theTime,
(ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
FROM cteSource AS s
CROSS APPLY (
VALUES (s.StartedAt),
(s.FinishedAt)
) AS u(theTime)
)
SELECT MIN(theTime),
MAX(theTime)
FROM cteGrouped
GROUP BY grp;

Subject: Performance comparison
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 4:17 PM
Message: I and Alex are doing a performance comparison right now. And we are having some discrepanies.

I am running SQL Server on my laptop connected to my home network. When running SSMS on my development machine against my laptop (4GB RAM and 4 cores), my solution run in 1 second and Alex solution takes about 15 seconds. I think most of the time here is spent pushing all 2 million rows over the network.

When Alex tests the two solutions at his place, the C# solution clocks in at 11 seconds whereas my solution clocks in at 12 seconds.

We are investigating why my solution is running so much slower at his place.

Subject: Call for help
Posted by: Peter Larsson (not signed in)
Posted on: Thursday, April 04, 2013 at 4:21 PM
Message: It would be great if someone else could help out testing the two solutions.

Have the database on a server and run the two solutions from another machine on the same network to simulate a production environment.

Noone really runs enterprise applications on the database server, right?

Subject: New version of Peter Larsson's code
Posted by: Tony Davis (view profile)
Posted on: Friday, April 05, 2013 at 3:11 AM
Message: I've uploaded a version called PeterLarssonCode_v2 (see the speech bubble to the right of the article title) containing fixes for the bugs spotted by puzsol.

Cheers,
Tony.

Subject: use two test cases
Posted by: AlexK (not signed in)
Posted on: Friday, April 05, 2013 at 12:08 PM
Message: Peter,

I encourage you to use two sets of test data: with one gap and with 2m-1 gaps, and post two sets of numbers. When you return 2m-1 gaps, you are more exposed to the speed of your network.

With one gap, my solution returns in 1.4 sec which is 8.5 times faster than the 12 sec from your best solution. My setup is quite similar to what is actually used in production, although a little bit slower.

Subject: Kathi KellenBerger,Help me please
Posted by: Naresh Ellanti (view profile)
Posted on: Wednesday, April 17, 2013 at 6:50 AM
Message: Kathi KellenBerger,

I've the similar requirement as peter's
Listing 4: The Islands and Gaps Pre-2012 Solution,

Further i need to do some calculation
I want calculate some of working hours in for a particular day and particular period.

Subject: Help me please
Posted by: Naresh Ellanti (view profile)
Posted on: Wednesday, April 17, 2013 at 6:54 AM
Message:
I've the similar requirement as peter's
Listing 4: The Islands and Gaps Pre-2012 Solution,

Further i need to do some calculation
I want calculate SUM of working hours for a particular day and particular period order by staff member
Do i need to insert the result data into temp table and i have to write a query?

Subject: Help me please
Posted by: Naresh Ellanti (view profile)
Posted on: Wednesday, April 17, 2013 at 7:04 AM
Message:
I've the similar requirement as peter's
Listing 4: The Islands and Gaps Pre-2012 Solution,

Further i need to do some calculation
I want calculate SUM of working hours for a particular day and particular period order by staff member
Do i need to insert the result data into temp table and i have to write a query?

Subject: Peter
Posted by: Naresh Ellanti (view profile)
Posted on: Wednesday, April 17, 2013 at 7:10 AM
Message: I want to calculate following structure:
[Employee], [Shift],[In],[Out],[Total Hrs],[Shift Hrs],[Break Hrs],[Work hours],[Late minutes],[Regular OT],[Week end OT],[Holiday OT]

now i got some thing from Listing 4: The Islands and Gaps Pre-2012 Solution,
help me further to proceed

Subject: Similar Need, but Without EndDateTime
Posted by: TheSQLGuru (view profile)
Posted on: Thursday, April 25, 2013 at 9:19 AM
Message: I recently had to solve a similar need processing temporal API-execution logging data for a Web20 client. All I had to work with was AppID, UserID and datetime stamp for each API call and I had to come up with session-based metrics. It took multiple LAG/LEAD operations with some complex operations to get the needed output. I wound up with a query plan with TWENTY SEVEN operations but just a SINGLE HIT on the table!! The enhanced SQL 2012 Windowing Function support is AWESOME!! Given that the enhancements are basically a v1.0 implementation with both functional omissions and some significant performance issues in some cases (RANGE vs ROWS for example) it should continue to improve too!

Subject: Mind-bending SQL Substituting for Elegant Design
Posted by: iposner (view profile)
Posted on: Tuesday, May 21, 2013 at 3:00 AM
Message: While I admire the mind-bending SQL gymnastics on show, I can't help to feel that this is yet another replay of that tired record where heroic SQL is played out yet again.

Any database design that requires relative positioning of records one-to-another in the same table is a design prone to iterative scans - and is not purely set driven as there is additional data implied by the positioning of records within the set.

What's needed here is not the medication of T-SQL but the surgeon's knife of DDL!

As another poster has suggested, a design incorporating check-in check-out information in the same row would be a far better and simpler solution.

Subject: This solution is something to be admired BUT...
Posted by: vbp1 (view profile)
Posted on: Tuesday, May 21, 2013 at 8:00 AM
Message: this is a very elegant and functional solution.
I have always been an admirer of people who can come up with a gem like this
And that actually works nice an reliably.

However practice show that if something out there require this much of engendering and workaround to be done,
And that needs this to be done on regular basis, than it is not a very good design to begin with.

As it happens I had worked with application(s) dealing with time clock data and such.
One of my previous employment was with a company that were developing a commercial application for work scheduling and time keeping.
It started with the manual time scheduling and time keeping application, progressed to importing time data from an external systems
Like ADP, Kronos, Time Plus and several custom built proprietary systems as well as simply an Excel spreadsheet.
And later to developing an in-house time tracking system with their own (punch in time clocks) entry terminals

All and all the whole solution provided a myriads of UI screens and Onscreen and print out reports.
Many of which were customized to customer needs and likes.


Let us dissect the Spec a little bit…

The business logic spec is

Based on data from a time clock, the challenge was to calculate based on a series of entry and exit times for employees
, the amount of time between each entry and exit recorded, for each person.
The requirement is to calculate the amount of time in hours, minutes and seconds between each Enter and Exit, for each staff member.
The first point to note is that while there is usually a matching Exit row for each Enter, sometimes there is an Enter with no Exit or an Exit with no Enter. Here are the business rules:
• If there are consecutive Enter rows, use the first one.
• If there are consecutive Exit rows, use the last one.
• Ignore any Exit rows with no matching Enter row (i.e. if the first row for an employee is an Exit, then ignore it).
• Ignore any Enter rows with no matching Exit row (i.e. if the last row for an employee is an Enter, then ignore it).
• The Exit for an Enter could be a subsequent day.

#1 we need to understand the DATA and the DATASOURCE.
a. data is from a time clock/entry point terminal (based on the specs above)
thus we can conclude that the data is ALWAYS in a linier ordered structure.
Also as each record have the time field we can (or should be able to ) reconstruct the entry order based on that as well.
EXCEPTION here only when time clock/terminal is corrupt and send an outdated bad data along with good data.

#2 is there any other useful data in any other database/application exists that we can use for some smart data processing using Fuzzy logic techniques or such.

#3. We need to understand WHY this is needed.
 Is it needed for a performance report?
 Is data used for tardiness report?
 Is the data used in Payroll?
 Is the date preserved in any way to be used for something else or just for historical significant reason.
 Why are we ignoring orphaned records? (only entry or only exit records)
 Are there any consequences to the employee based on the data?

As illustrated by poster “BO” in “Thoughts about practical things in real world”
<QUOTE> The problem with matching entry and exit columns on the same row, is that it does not work in the real world.
Often when you arrive at a door, you sneak in with someone else and thus miss a new entry row.
Or the opposite, when you leave the office you go through the door someone else opened just before you and you don't swipe your own card.<END QUOTE>
However many of this situation can be remedied by
a. using logic on data processing
b. implementing and more importantly enforcing some company policies that would force employees to do the right things
leaving only situation that occur do to human error or laziness.
That is if the Entry/Exit system is used for payroll as well as security.

The Spec above clearly show that the data is not from Entry/Exit system but from Time clock.
So basically this is employee Time Card data.

If an employee wants to get paid than he/she MUST punch time card in/out as needed.
Depending on the system used, the bad records should only occur by accident do to human error prone nature, as everyone want to be paid for their work.
Or do to malicious intent in attempt to circumvent the system (which in some situation is possible if the records are kept manually and person to person relations are employed.)


When I see Spec outlined above I read it very much like this :

#1. We have data we have no control over thus we bring it in AS IS.
a. the data is a mess since none of the employees are doing what is expected from them since their action have no consequences and/or we have no enforceable business policies to remedy this .

#2 we do not mind doing our payroll by hand and spend hours on entering, verifying and correcting the data for payroll instead of looking into a good business solutions system that would do this for us.

#3 we do not mind some data loss here as we will do everything we need manually (see #2)


I have seen this before many times. In fact I seen this during my employment with company mentioned at the beginning of this post.
And every time we would go to the prospected customer with our system we were met with resistance and caution(which is understandable).
In comparison to companies like ADP, Kronos, Time Plus, etk. We were nobody.

But once we showed what our system can do for HR, Payroll and other business entities the hook was in.


So the proper solution here would be not to have a query like above (if it was only one time thing than fine, but if it is a recurring run , a more intelligent solution is needed.) but rather a set of changes in business policies (if possible), and a proper data processing engine that would fix and cleanup the data in a manner that would eliminate the need for such acrobatics in SQL.

At the very least the data import process should:
1. look at the employee record set(s) and either generate matching records as need
(this can only be done IF an employee schedule data is available as in we have a table that tell us the hours that an employee is scheduled to work
Also we would need to mark this record for as NOT FOR PAYROL USE as it requires verification of actual time employee entered/exited )
or remove them from the table(do not import the records into reporting table but keep it in the source for historical use).
Best practice here would be to create report table and a set of processing routines that would process data on schedule.
I am very much doubt that this report needs to be run as an ad hoc thing thus we can make sure that all data is available by the time we run it.

2. validate the entry to make sure that resulting set conforms to all the rules
3. calculate the time frame customer seeks . we might also need to calculate the SUM() of all entries per employee per day if customer want total working hours vs total away hours comparison.









Subject: Simple is Good
Posted by: drueter (view profile)
Posted on: Tuesday, May 21, 2013 at 2:02 PM
Message: While developers do need to consider a risk of poor future performance with large datasets, developers must also consider other risks:

*Risk of error in original code
*Risk of introducing future error due to another developer not understanding code
*Risk of introducing future error due to original developer not remembering the details of the code
*Risk of certain data conditions resulting in erroneous results
*etc.

Simplicity helps safeguard against these risks. Simplicity also often helps optimize performance...at least to a reasonable level.


I'd suggest starting with a very basic set-based solution like this:

SELECT
x.StaffMember,
x.EnterDateTime,
x.ExitDateTime,
x.WorkTime
FROM (
SELECT
sh1.StaffMember,
sh1.EventDate + CAST(sh1.EventTime AS datetime) AS EnterDateTime,
sh2.EventDate + CAST(sh2.EventTime AS datetime) AS ExitDateTime,
CONVERT(CHAR(8), DATEADD(SECOND, DATEDIFF(SECOND, sh1.EventDate + CAST(sh1.EventTime AS datetime) , sh2.EventDate + CAST(sh2.EventTime AS datetime)), '00:00:00'), 8) AS WorkTime,
ROW_NUMBER() OVER (PARTITION BY sh1.StaffMember, sh1.EventDate, sh1.EventTime ORDER BY sh2.EventDate + CAST(sh2.EventTime AS datetime) DESC) AS Seq
FROM
StaffHours sh1
JOIN StaffHours sh2 ON
sh1.StaffMember = sh2.StaffMember AND
sh1.EventDate + CAST(sh1.EventTime AS datetime) <= sh2.EventDate + CAST(sh2.EventTime AS datetime)
LEFT JOIN StaffHours sh3 ON
sh1.StaffMember = sh3.StaffMember AND
sh1.EventType = sh3.EventType AND
sh3.EventDate + CAST(sh3.EventTime AS datetime) > sh1.EventDate + CAST(sh1.EventTime AS datetime) AND
sh3.EventDate + CAST(sh3.EventTime AS datetime) < sh2.EventDate + CAST(sh2.EventTime AS datetime)
WHERE
sh1.EventType = 'Enter' AND
sh2.EventType = 'Exit' AND
sh3.StaffMember IS NULL
) x
WHERE
x.Seq = 1


This query should be very easy for most people to read, and performs fairly well: it results in Scan Count of 3 (instead of 6 in the original query).


sh1 is the Enter transactions, some of which will be omitted (in the case of extraneous unmatched Enters)

sh2 is the possible Exit transactions (Exits that are later than the Enter), that will be filtered by a ROW_NUMBER sequence to get only the latest one

sh3 is the extraneous Enter transactions to exclude

This simple query follows the business rules point-by-point to avoid error and confusion:

a) "If there are consecutive Enter rows, use the first one": handled by the "sh3.StaffMember IS NULL" WHERE condition

b) "If there are consecutive Exit rows, use the last one": handled by the "x.Seq = 1" WHERE condition

c) "Ignore any Exit rows with no matching Enter row": handled by the JOIN conditions for sh2

d) "Ignore any Enter rows with no matching Exit row": handled by the JOIN conditions for sh2

e) "The Exit for an Enter could be a subsequent day": handled by evaluating EventDate + EventTime


When it does come time to optimize, simply looking at Scan Count is not enough: an index seek is much less expensive than a table scan, but both get treated as a "Scan Count".

Also, by having to build EventDateTime from the EventDate and EventTime components on the fly, we have built-in performance limitations. A DDL change to add an EventDateTime column that could be indexed would be much better. In the absence of that possibility, even a view or an indexed view could help.

Furthermore, in the "real world" (I say this having built this type of payroll system in the past), you almost certainly want to crunch a stream of Enter / Exit transactions into a higher-level table of intervals worked. (Reasons: manual editability of time data, ease of reporting, performance, et al)

The processing of this data needs to happen only once, and only on the unprocessed rows. Thus even if you do end up with millions of rows of Enter / Exit transactions, you only need to touch a small handful of them--so resorting to complicated queries are usually not needed.

The neat thing about SQL is that there are a lot of ways to accomplish something. When selecting a particular way I want to do something, I usually put more emphasis on avoiding errors than I do on (premature) optimization. When it does come time to optimize (because performance demands it), there are generally lots of ways of accomplishing that too.

Subject: is worked with MySQL
Posted by: anhdv (view profile)
Posted on: Tuesday, October 22, 2013 at 10:01 PM
Message: Could you tell me if this support for MySQL

Subject: Convert this SQL Query to LINQ
Posted by: aRTx (view profile)
Posted on: Tuesday, December 10, 2013 at 1:43 AM
Message: Is this possible?
To convert this beutiful SQL Query to LINQ

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.