Click here to monitor SSC
Dwain Camps

Calculating Gaps Between Overlapping Time Intervals in SQL

14 February 2014

There are a number of real-life reporting tasks in SQL that require a 'gaps and islands' analysis. There are a number of techniques around that work, but finding ones that scale well makes for a tougher, but interesting,  challenge.

When building a web site that requires login access it is normal practice, and even a statutory requirement in some places, that you keep a record of user logins.  This log can be invaluable for high-usage web sites in order to identify periods when the web site is infrequently used, so that a suitable time for maintenance such as operating system patching can be scheduled.

Since you have no control over when authorized users log into your web sites, you’ll always have periods when more than one are logged in at the same time.  Login and logout times will be staggered, and so the intervals when users are logged in will be overlapping.  Today we’ll ultimately be showing you how to find the gaps of inactivity, but to do so we will also show you a neat way to identify those periods where there is activity.

Although this may seem like a slightly contrived problem, there are plenty of times that you are faced with the sort of reporting where special SQL techniques are required. For example:

  • If you have a fleet of vehicles where events represent departure and return, you may wish to ask during what periods are at least one of my vehicles out and when are they all not being used?
  • In a manufacturing plant with multiple assembly lines where events are line starts and line stops, you can find out during what periods at least one line is in use or the periods when all lines are idle.
  • By adding a bit of PARTITIONing to these queries, you can handle the case of a hotel that has multiple types of rooms.  Then you can ask the questions, during what times is at least one room of a type occupied or when are all rooms of a particular type unoccupied.

New features in SQL 2012 have been developed to make short work of some of these gaps problems, so we will need SQL 2012 to run many of the examples, specifically those that relate to data clean-up.  These solutions are not limited to login/logout intervals; actually any defined interval between specific events will do.

The accompanying resource file: Sample Queries (2012).sql contains all of the sample queries shown in the following sections.  All SQL resource files are appended with either (2008) or (2012) to indicate which version of SQL they can be run in.

Problem Setup and Explanation

First we’ll create a table with some basic set-up data that represents three user logins.  The login periods will overlap and contain a gap to illustrate the essentials of what we are trying to do.

IF OBJECT_ID('tempdb.dbo.#Events', 'U') IS NOT NULL

DROP TABLE #Events;

 

CREATE TABLE #Events

(

    ID BIGINT                      IDENTITY PRIMARY KEY

    ,UserID                        INT

    ,EventTime                     DATETIME

    ,[Event]                       TINYINT -- 0=Login, 1=Logout

);

 

INSERT INTO #Events

SELECT 1, '2013-09-01 15:33', 0

UNION ALL SELECT 2, '2013-09-01 16:15', 0

UNION ALL SELECT 2, '2013-09-01 17:00', 1

UNION ALL SELECT 3, '2013-09-01 17:10', 0

UNION ALL SELECT 1, '2013-09-01 18:20', 1

UNION ALL SELECT 3, '2013-09-01 19:10', 1

UNION ALL SELECT 1, '2013-09-02 11:05', 0

UNION ALL SELECT 1, '2013-09-02 11:45', 1;

 

SELECT * FROM #Events ORDER BY UserID, EventTime;

The raw sample data displayed by the final SELECT are as follows:

ID    UserID  EventTime                   Event

1     1       2013-09-01 15:33:00.000     0

5     1       2013-09-01 18:20:00.000     1

7     1       2013-09-02 11:05:00.000     0

8     1       2013-09-02 11:45:00.000     1

2     2       2013-09-01 16:15:00.000     0

3     2       2013-09-01 17:00:00.000     1

4     3       2013-09-01 17:10:00.000     0

6     3       2013-09-01 19:10:00.000     1

We can see that logins ([Event]=0) and logouts ([Event]=1) are evenly matched, meaning the logout time was recorded for each login.  Let’s take a look graphically at these four events.

Each login ID is represented by a different color.  You can see that there are overlaps between the grouping of logins on the afternoon of 01 Sep and the single login by UserID=1 on 02 Sep.  From the data, we ultimately seek to expose the gap between the logout of UserID=3 (blue) and the second login of UserID=1 (2013-09-01 19:10 to 2013-09-02 11:05).

Pairing of Different, Sequential Events

The following solutions assume that concurrent logins by a single account are not allowed.  When your login/logout events are equally paired, i.e., there are no missing logout endpoints, a solution such as I’ll be describing will group these events so that your query can place start/end times of the events on a single row.

Sample Query #1: How to Pair up Logins with Logouts

SELECT UserID, LoginTime=MIN(EventTime), LogoutTime=MAX(EventTime)

FROM

(

    -- Construct a row number (rn) that is 1, 1, 2, 2, 3, 3, ... for each row

    SELECT UserID, EventTime

        ,rn=(ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY EventTime)-1)/2

    FROM #Events

) a

-- When we group by rn we have paired sessions

GROUP BY UserID, rn

ORDER BY UserID, LoginTime;

Results from Sample Query #1:

UserID  LoginTime                 LogoutTime

1       2013-09-01 15:33:00.000   2013-09-01 18:20:00.000

1       2013-09-02 11:05:00.000   2013-09-02 11:45:00.000

2       2013-09-01 16:15:00.000   2013-09-01 17:00:00.000

3       2013-09-01 17:10:00.000   2013-09-01 19:10:00.000

Note how, in the derived table, we’ve created a grouping column as (ROW_NUMBER()+1)/2, which is a well-known trick for pairing consecutive rows.  Here are the results in rn for the first four rows:

1.       (1+1)/2 = 1

2.       (2+1)/2 = 1 (integer division truncates the result)

3.       (3+1)/2 = 2

4.       (4+1)/2 = 2 (integer division truncates the result)

5.      

Many times, usually due to incomplete implementation, web sites will not record or generate a logout event.  This clutters the data with spurious, unpaired logins.  Let’s add a couple of these to our data.

INSERT INTO #Events

SELECT 1, '2013-09-01 18:00', 0

UNION ALL SELECT 2,  '2013-09-01 18:01', 0;

 

SELECT * FROM #Events ORDER BY UserID, EventTime;

The sample data now appears as follows, where IDs 1 and 10 represent a login that is not paired with a corresponding logout event.

ID   UserID   EventTime                  Event

1    1        2013-09-01 15:33:00.000    0

9    1        2013-09-01 18:00:00.000    0

5    1        2013-09-01 18:20:00.000    1

7    1        2013-09-02 11:05:00.000    0

8    1        2013-09-02 11:45:00.000    1

2    2        2013-09-01 16:15:00.000    0

3    2        2013-09-01 17:00:00.000    1

10   2        2013-09-01 18:01:00.000    0

4    3        2013-09-01 17:10:00.000    0

6    3        2013-09-01 19:10:00.000    1

In order to pair logins with logouts in this case a different approach is required.  Note the use of OUTER APPLY (rather than a CROSS APPLY) is required to bring into the results set the final, unpaired login for UserID=2.

Sample Query #2: Pairing Logins with no Corresponding Logout

SELECT UserID, LoginTime=a.EventTime, LogoutTime=b.EventTime

FROM #Events a

-- Use an OUTER APPLY to retain outer query rows when inner query returns NULL

OUTER APPLY

(

    -- find the first of any events for this user after the login. If a logout

    -- then use its EventTime. If a login return NULL

    -- For logout events (Event=1) pick up the EventTime as the time of logout

    SELECT TOP 1 EventTime=CASE WHEN [Event] = 1 THEN EventTime END

    FROM #Events b

    WHERE a.UserID = b.UserID AND b.EventTime > a.EventTime

    ORDER BY b.EventTime

) b

-- Filter on only login events

 WHERE [Event] = 0

ORDER BY UserID, LoginTime;

Since the execution plan of this query shows two Clustered Index Scans, compared to the execution plan of the first showing a single Clustered Index Scan, we’d expect the second query to be a bit slower than the first.  So this data clean-up comes at a cost. 

Results from Sample Query #2:

UserID  LoginTime                  LogoutTime

1       2013-09-01 15:33:00.000    NULL

1       2013-09-01 18:00:00.000    2013-09-01 18:20:00.000

1       2013-09-02 11:05:00.000    2013-09-02 11:45:00.000

2       2013-09-01 16:15:00.000    2013-09-01 17:00:00.000

2       2013-09-01 18:01:00.000    NULL

3       2013-09-01 17:10:00.000    2013-09-01 19:10:00.000

Unfortunately, further testing of this pairing method using a large row set identified that it is quite slow.  To find a better alternative, we’ll turn to the new LEAD analytic function in SQL 2012.  You’ll find that the following query returns the exact same row set as the prior, but ultimately performs much better (in fact it takes about 10% of the time to run).

Sample Query #3: Using LEAD to Improve Performance (where some logins have no logout)

SELECT UserID, LoginTime, LogoutTime

FROM

(

    SELECT ID, UserID

        ,LoginTime=CASE WHEN [Event] = 0 THEN EventTime END

        ,LogoutTime=CASE

            -- If the next Event by EventTime is a logout

            WHEN LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime) = 1

            -- Then we pick up the corresponding EventTime

            THEN LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime)

            -- Otherwise logout time will be set to NULL

            END

        ,[Event]

    FROM #Events

) b

-- Filter only on login events

WHERE [Event] = 0

ORDER BY UserID, LoginTime;

It is quite easy to eliminate the rows with missing LogoutTimes in either of the above queries by adding a check for NULL LogoutTime to the WHERE clause.

It would also be possible to implement other business rules for cases where no logout time is recorded for a login event, for example:

  • Use the average duration of that user’s logins to predict an approximate LogoutTime.
  •  Assume the user is still logged in, so substitute GETDATE() for the LogoutTime.

We will leave these as exercises for the interested reader.

Packing the Overlapping Login Intervals

Sometime back, a good friend of mine pointed me to an article that describes a way to pack overlapping intervals.  That article is called Packing Intervals and it is by none other than SQL MVP and all-around guru, Mr. Itzik Ben-Gan.  When I first read that article, I was impressed by the elegantly simple explanation for how this highly imaginative query works, so I won’t try to inadequately reproduce it here.  What I can tell you is that it took me all of about three milliseconds (coincidentally the resolution of the SQL DATETIME data type) to realize that this was something I needed to retain in my tool box.  Let’s look at how we would apply this technique to our sample data, by first putting our Example Query #3 (eliminating the spurious logins) into a Common Table Expression and then applying the Packing Intervals technique.

Sample Query #4: Pack the Overlapping Intervals down to non-overlapping Events

WITH GroupedLogins AS

(

    -- Essentially the same as Sample Query #3 except a different filter

    SELECT UserID, LoginTime, LogoutTime

    FROM

    (

        SELECT ID, UserID

            ,LoginTime=CASE WHEN [Event] = 0 THEN EventTime END

            ,LogoutTime=CASE

                -- If the next Event by EventTime is a logout

                WHEN LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime) = 1

                -- Then we pick up the corresponding EventTime

                THEN LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime)

                -- Otherwise logout time will be set to NULL

                END

            ,[Event]

        FROM #Events

    ) a

    -- Filter only on login events that are not "open" (no logout)

    WHERE [Event] = 0 AND LogoutTime IS NOT NULL

),

C1 AS

(

    -- Since the CTE above produces rows with a start and end date, we'll first unpivot

    -- those using CROSS APPLY VALUES and assign a type to each.  The columns e and s will

    -- either be NULL (s NULL for an end date, e NULL for a start date) or row numbers

    -- sequenced by the time.  UserID has been eliminated because we're looking for overlapping

    -- intervals across all users.

    SELECT ts, [Type]

        ,e=CASE [Type]

            WHEN 1 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Type] ORDER BY LogoutTime) END

        ,s=CASE [Type]

            WHEN -1 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Type] ORDER BY LoginTime) END

    FROM GroupedLogins

    CROSS APPLY (VALUES (1, LoginTime), (-1, LogoutTime)) a([Type], ts)

),

C2 AS

(

    -- Add a row number ordered as shown

    SELECT C1.*, se=ROW_NUMBER() OVER (ORDER BY ts, [Type] DESC)

    FROM C1

),

C3 AS

(

    -- Create a grpnm that pairs the rows

    SELECT ts, grpnm=FLOOR((ROW_NUMBER() OVER (ORDER BY ts)-1) / 2 + 1)

    FROM C2

    -- This filter is the magic that eliminates the overlaps

    WHERE COALESCE(s-(se-s)-1, (se-e)-e) = 0

),

C4 AS

(

    -- Grouping by grpnm restores the records to only non-overlapped intervals

    SELECT StartDate=MIN(ts), EndDate=MAX(ts)

    FROM C3

    GROUP BY grpnm

)

SELECT *

FROM C4

ORDER BY StartDate;

It is instructive to see how each successive CTE (ignoring the first because we’ve already seen what it does) returns a row set that brings us closer to our target of non-overlapping intervals.  Rather than being repetitive, I’ll refer you to the comments at each step (above).

Results from C1:

ts                       Type   e       s

2013-09-01 17:00:00.000  -1     1       NULL

2013-09-01 19:10:00.000  -1     3       NULL

2013-09-01 18:20:00.000  -1     2       NULL

2013-09-02 11:45:00.000  -1     4       NULL

2013-09-01 16:15:00.000  1      NULL    1

2013-09-01 17:10:00.000  1      NULL    2

2013-09-01 18:00:00.000  1      NULL    3

2013-09-02 11:05:00.000  1      NULL    4

Results from C2:

ts                       Type   e       s      se

2013-09-01 16:15:00.000  1      NULL    1      1

2013-09-01 17:00:00.000  -1     1       NULL   2

2013-09-01 17:10:00.000  1      NULL    2      3

2013-09-01 18:00:00.000  1      NULL    3      4

2013-09-01 18:20:00.000  -1     2       NULL   5

2013-09-01 19:10:00.000  -1     3       NULL   6

2013-09-02 11:05:00.000  1      NULL    4      7

2013-09-02 11:45:00.000  -1     4       NULL   8

Results from C3:

ts                       grpnm

2013-09-01 16:15:00.000  1

2013-09-01 17:00:00.000  1

2013-09-01 17:10:00.000  2

2013-09-01 19:10:00.000  2

2013-09-02 11:05:00.000  3

2013-09-02 11:45:00.000  3

The results produced by this fascinating method are shown below.

Final Results from Sample Query #4:

StartDate                 EndDate

2013-09-01 16:15:00.000   2013-09-01 17:00:00.000

2013-09-01 17:10:00.000   2013-09-01 19:10:00.000

2013-09-02 11:05:00.000   2013-09-02 11:45:00.000

If we return to look at Sample Query Results #2, we find that these represent the packed intervals across all of the UserIDs.  You can also think of these as “islands” of time where users are logged into our system.

Before we move on to finding the gaps in these islands, let’s take a closer look at what’s happening in CTEs GroupedLogins and C1.  The only reason we have GroupedLogins to begin with is to remove the spurious logins.  Looking at the work being done by C1, it is ungrouping that which we just grouped.  This leads us to believe that if the spurious logins were not present it might be possible to simplify the query, so let’s try to do this after first deleting the two rows we added to our original sample data.

Sample Query #5: Simplify the Packing Method by using our logins/logouts directly

DELETE FROM #Events WHERE ID IN (9,10);

 

WITH  C1 AS

(

    -- It is no longer necessary to CROSS APPLY VALUES to UNPIVOT.  The columns e and s will

    -- either be NULL (s NULL for an end date, e NULL for a start date) or row numbers

    -- sequenced by the time.  UserID has been eliminated because we're looking for overlapping

    -- intervals across all users.  Now we're using the native event type [Event] instead of

    -- creating one during the UNPIVOT.

    SELECT EventTime, [Event]

        ,e=CASE [Event]

            WHEN 0 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) END

        ,s=CASE [Event]

            WHEN 1 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) END

    FROM #Events

),

C2 AS

(

    -- Add a row number ordered as shown

    SELECT C1.*, se=ROW_NUMBER() OVER (ORDER BY EventTime, [Event] DESC)

    FROM C1

),

C3 AS

(

    -- Create a grpnm that pairs the rows

    SELECT EventTime, grpnm=FLOOR((ROW_NUMBER() OVER (ORDER BY EventTime)-1) / 2 + 1)

    FROM C2

    -- This filter is the magic that eliminates the overlaps

    WHERE COALESCE(s-(se-s)-1, (se-e)-e) = 0

),

C4 AS

(

    -- Grouping by grpnm restores the records to only non-overlapped intervals

    SELECT StartDate=MIN(EventTime), EndDate=MAX(EventTime)

    FROM C3

    GROUP BY grpnm

)

SELECT *

FROM C4

ORDER BY StartDate;

Here we have used the native event type, instead of constructing one using CROSS APPLY VALUES as Mr. Ben-Gan did in his original C1 CTE.  The results from this query are shown below.

Results from Sample Query #5:

StartDate                EndDate

2013-09-01 15:33:00.000  2013-09-01 19:10:00.000

2013-09-02 11:05:00.000  2013-09-02 11:45:00.000

These correspond to the packed intervals in our original data, including the depiction in our diagram.  Note that it is slightly different than the packed results we got from Sample Query #4, because the “spurious logins” that we removed resulted in slightly different overlapping intervals.

Converting Packed Intervals to the Corresponding Gaps

Once we have our packed intervals (“islands”) it is simple enough to add just a little bit of code to convert these into gaps.  We draw upon a previous Simple Talk article called The SQL of Gaps and Islands in Sequence Numbers and look for the CROSS APPLY VALUES method to convert islands to gaps.  We will add this technique to Sample Query #5 to get to:

Sample Query #6: Convert Packed Intervals into Gaps Between them

WITH C1 AS

(

    -- The columns e and s will either be NULL (s NULL for an end date, e NULL for a start date)

    -- or row numbers sequenced by the time.  UserID has been eliminated because we're looking

    -- for overlapping intervals across all users.  Now we're using the native event type [Event]

    -- instead of creating one during the UNPIVOT.

    SELECT EventTime, [Event]

        ,e=CASE [Event]

            WHEN 0 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) END

        ,s=CASE [Event]

            WHEN 1 THEN NULL

            ELSE ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) END

    FROM #Events

),

C2 AS

(

    -- Add a row number ordered as shown

    SELECT C1.*, se=ROW_NUMBER() OVER (ORDER BY EventTime, [Event] DESC)

    FROM C1

),

C3 AS

(

    -- Create a grpnm that pairs the rows

    SELECT EventTime, grpnm=FLOOR((ROW_NUMBER() OVER (ORDER BY EventTime)-1) / 2 + 1)

    FROM C2

    -- This filter is the magic that eliminates the overlaps

    WHERE COALESCE(s-(se-s)-1, (se-e)-e) = 0

),

C4 AS

(

    -- Grouping by grpnm restores the records to only non-overlapped intervals

    SELECT StartDate=MIN(EventTime), EndDate=MAX(EventTime)

    FROM C3

    GROUP BY grpnm

)

-- CROSS APPLY VALUES to convert islands to gaps

SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)

FROM

(

    SELECT EventTime

        ,rn=ROW_NUMBER() OVER (ORDER BY EventTime)/2

    FROM C4 a

    CROSS APPLY (VALUES (StartDate), (EndDate)) b(EventTime)

    ) a

GROUP BY rn

HAVING COUNT(*) = 2

ORDER BY StartTime;

Comparing the results of this query with the Results from Sample Query #5, we can see by inspection that the row returned is the gap between the two islands.

Results from Sample Query #6:

StartTime                 EndTime

2013-09-01 19:10:00.000   2013-09-02 11:05:00.000

Obviously, if we needed to handle spurious logins, the little bit of code we added after C4 could be applied just as easily to Sample Query #4.

Building a Test Harness

Now that we have a way to calculate gaps between overlapping intervals, we want to make sure that the code we’ve built will scale to large row sets in a reasonable fashion.  To do this, we must build a test harness with a realistic number of rows.

We can simulate the duration of a login by starting the login at a particular time of day and then assume that the user was logged in for a random period of time.  In SQL, we can generate uniform random numbers like this:

SELECT 10+ABS(CHECKSUM(NEWID()))%120

This will generate a uniformly-distributed random integer between 10 and 130, which could for example represent the minutes the user was logged in.  But we’re going to have a little fun and say that our logins are not based on a uniform distribution, rather the time logged in will behave more like a normal distribution where the mean is 240 seconds and the standard deviation is 60 seconds.  To do this, we need a way to generate normally distributed random numbers.

CREATE FUNCTION dbo.RN_NORMAL

(

    @Mean       FLOAT

    ,@StDev     FLOAT

    ,@URN1      FLOAT

    ,@URN2      FLOAT

)

RETURNS FLOAT WITH SCHEMABINDING

AS

BEGIN

    -- Based on the Box-Muller Transform

    RETURN (@StDev * SQRT(-2 * LOG(@URN1))*COS(2*ACOS(-1.)*@URN2)) + @Mean

END

The function presented above does this.  Generating Non-uniform Random Numbers with SQL has an empirical proof that it does, and you will also find several other random number generators for other distributions contained therein.  To use this function, you must pass it two uniform random numbers on the interval 0 to 1.

The test data (1,000,000 rows) can be generated based on 500 days and 1000 UserIds as follows:

DECLARE @StartDate          DATETIME    = '2005-01-01'

    -- 500 days x 1000 users x 2 endpoints = 1,000,000 rows

    ,@UserIDs               INT         = 1000

    ,@Days                  INT         = 500 

    ,@MeanLoginSeconds      INT         = 240

    ,@StdDevLoginSeconds    INT         = 60;   

 

WITH Tally (n) AS

(

    SELECT TOP

    (

        SELECT CASE WHEN @UserIDs > @Days THEN @UserIDs ELSE @Days END

    )

        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

)

INSERT INTO #Events

SELECT UserID=f.n, EventTime, [Event]

-- Create 2 uniform random numbers

 

 FROM (SELECT RAND(CHECKSUM(NEWID()))) a(URN1)

CROSS APPLY (SELECT RAND(CHECKSUM(NEWID()))) b(URN2)

-- Create 1 Normally-distributed random number from the 2 URNs using the

-- using the Box-Muller transform function

 CROSS APPLY

(

    SELECT dbo.RN_NORMAL(@MeanLoginSeconds, @StdDevLoginSeconds, URN1, URN2)

) c(NRN)

-- Use our Tally table to generate the required number of @Days then convert to a date

CROSS APPLY (SELECT n FROM Tally WHERE n BETWEEN 1 AND @Days) d

CROSS APPLY (SELECT DATEADD(day, d.n-1, @StartDate)) e([Date])

-- Use our Tally table to generate the required number of @users and logins/logouts

 CROSS APPLY (SELECT n FROM Tally WHERE n BETWEEN 1 AND @UserIDs) f

CROSS APPLY (SELECT DATEADD(second, ABS(CHECKSUM(NEWID()))%84000, e.[Date])) g(LoginStart)

CROSS APPLY

(

    VALUES (0, LoginStart),(1, DATEADD(second, NRN, LoginStart))

) h([Event], EventTime);

Each UserID logs in once per day.  Rows resulting from the query can be increased in increments of 1,000,000 rows by changing @Days to 1000, 1500 and 2000.  Using Sample Query #6 with the above table generates about 30,000 packed intervals and consequently nearly the same number of gaps (actually exactly one less).  This can be demonstrated by running the SQL code stored in the resource file: Check Test Harness (2008).sql.

Since each of our queries is built upon a series of steps, we’ll break each step down and get timings on each.  Two additional test harness SQL files are provided:

  • Test Harness 1 (2008).sql – tests 2 queries which are like Sample Query #5 plus the additional code to convert those islands to gaps (Sample Query #6).
  • Test Harness 2 (2012).sql – tests 3 queries which are like Sample Query #3 (removing the logins with missing logouts), Sample Query #4 (to calculate the packed intervals) and finally with the additional code to convert those islands into gaps. 

The DELETE statement below appears in Test Harness 2 (2012).sql to remove a few logout events.

WITH Logouts AS

(

    SELECT *, rn=ROW_NUMBER() OVER (ORDER BY EventTime)

    FROM #Events

    WHERE [Event] = 1

)

DELETE FROM Logouts

WHERE rn % 300 = 2;

This deletes about 1,667 logout rows out of 1,000,000 (about 0.17%).

Performance Results

Using Test Harness 1 (2008).sql (run in SQL 2008) we can demonstrate that the additional time for CPU and Elapsed run times when adding the CROSS APPLY VALUES Islands to Gaps method to the interval packing is nearly insignificant.  Together the solutions both scale quite linearly and are fairly reasonable when there are no missing logouts.

The result below at 4M rows for elapsed time (Interval Gaps less than Pack Intervals) can only be explained by an increasing amount of parallelism introduced by SQL Server into the query.

We see a similar story when reviewing the results when there are missing logouts (using Test Harness 2 (2012).sql (run in SQL 2012).  Simply needing to remove the missing logouts takes slightly more than 50% of the CPU, but when we layer calculating the gaps on top of the packing algorithm, that additional work is quite minimal.

Elapsed times tell a similar story except that the amount of time for removing the open-ended logins does not comprise nearly as large a percentage of the overall time.  Nonetheless, both interval packing and calculating the gaps between them scales quite linearly.

Note that test results for the cases of no vs. missing logouts were run on different machines due to the need for SQL 2012 for the latter, so the CPU/Elapsed times recorded may not be comparable in terms of raw magnitude.  Test machines:

  • No Missing Logouts: Windows 7 Professional and SQL 2008 R2 (both 64 bit), Dell Inspiron laptop w-Intel Core i5-2430M @ 2.4 GHz, 8GB RAM
  •  Missing Logouts: Windows 7 Enterprise and SQL 2012 (both 64 bit), Lenovo laptop w-Intel Core i5-3230M @ 2.60 GHz, 12GB RAM

Conclusion

The important concepts we have demonstrated in this article are:

  • Combining rows where login and logout events are paired.
  •  A fast solution using the SQL 2012 analytic function LEAD for combining rows where login and logout events are paired, but some of the logout events are missing.
  • The excellent solution proposed by SQL MVP Itzik Ben-Gan for packing intervals, with modifications to support both of the above cases.
  •  Using the CROSS APPLY VALUES approach to convert islands (packed intervals) to gaps (the periods of no login activity).
  •  Finally, and very importantly, that these solutions seem linearly scalable to multi-million row sets.

Further Reading

Dealing with event intervals is a fascinating and challenging problem in SQL.  If you are interested in pursuing this further, I strongly recommend that you take a look at some of the links below.  They provide even more advanced, high-performing techniques for dealing with time intervals.

Thank you for your attention and we hope you found the information in this article interesting and ultimately useful.

Dwain Camps

Author profile:

Dwain Camps has been a project manager for many years. Because performance of applications can be a critical success factor for projects, he has been evangelizing on the need to develop highly performing SQL. By mentoring and authoring articles on SQL, he hopes to train a future generation of software engineers on the right and wrong ways to deliver SQL code. He also has a special interest in developing solutions to complex, data intensive problems using high performance SQL because the declarative nature of SQL allows development of algorithmically unique solutions that procedural languages may not be capable of.

Search for other articles by Dwain Camps

Rate this article:   Avg rating: from a total of 21 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: SQL 2012 Version
Posted by: puzsol (view profile)
Posted on: Monday, March 03, 2014 at 9:32 PM
Message: Hi Dwain,

Having a stupid obsession with the islands and gaps problem, I played around with the SQL 2012 version, and found that there are two issues:

1 - Some "gaps" are of 0 length... ok, so essentially they are gaps between the groups and your definition of the problem didn't state what to do in this case... I figure it's just a matter of the ordering of the type that will modify this behaviour to eliminate them (if that was desired).... which leads me to the second issue:

2 - On my machine the row_number() order for the C2 is inadequate, in that on some of my tests, I had a situation where two logout times were identical... their (se) numbers came out 14,15 while their (e) numbers were 7,6 respectively, resulting in a 'gap' in the wrong spot. This may be hard to reproduce due to machine configuration, test data etc... but for me, changing C2 to:
C2 AS
(
-- Add a row number ordered as shown
SELECT C1.*, se=ROW_NUMBER() OVER (ORDER BY ts, coalesce(e,s), [Type] DESC)
FROM C1
)
seemed to fix it... though you may want to put the order of operations in a different sequence. (ie Type, then e/s order). In my opinion, this is the problem of using maths to perform logic rather than some other SQL comparison with meaning.

Having had my own personal method, where you simply order the login times, and logout times, then match the rows (using the row_number() to join them like an id), then looking for the gaps like you looked for orphaned login events... I compared them... and found that despite having less logic and order operations it took about the same time (very frustrating - it should be a more efficient operation, but SQL lacks the syntax)... but not one to give up, I tried a half way approach, borrowing from your method and came up with:

;WITH GroupedLogins AS
(
SELECT LoginTime, ROW_NUMBER() OVER (ORDER BY LoginTime) as LoginOrder
, nTime as LogoutTime, ROW_NUMBER() OVER (ORDER BY nTime) as LogoutOrder
FROM
(
SELECT EventTime as LoginTime
, [Event]
, LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime) as nTime
, LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime) as nEvent
FROM #Events
) a
WHERE [Event] = 0 and nEvent = 1
), OrderedEvents as
(
SELECT a.ts, a.ord, a.isLogout, ROW_NUMBER() OVER (ORDER BY a.ts, a.isLogout desc, a.ord) as TimeOrder
FROM GroupedLogins g
CROSS APPLY (VALUES (0, g.LoginOrder, g.LoginTime), (1, g.LogoutOrder, g.LogoutTime)) a(isLogout, ord, ts)
), GapData as
(
select e.ts as StartTime
, e.isLogout
, e.ord
, LEAD(e.ts, 1) OVER (ORDER BY e.TimeOrder) as nTime
, LEAD(e.isLogout, 1) OVER (ORDER BY e.TimeOrder) as nIsLogout
, LEAD(e.ord, 1) OVER (ORDER BY e.TimeOrder) as nOrd
from OrderedEvents e
)
select StartTime, nTime as EndTime
into #Results_4
from GapData
where isLogout = 1 and nIsLogout = 0
and nOrd = (ord + 1)

On my machine, it seems to take slightly more CPU time, but less execution time (more parallel I think)... and I found that using where, rather than case to eliminate data was better for plan estimation.

It effectively states that whenever we are to have a gap, there will be an equal number of login and logouts up to the time that is a logout with the next time being a login.

NOTE 1: This version will also give you 0-length gaps. But removing the "desc" on the "isLogout" for the "TimeOrder" row_number, will remove these.
NOTE 2: Technically you don't even need the TimeOrder column, just put the ordering from it into the LEAD functions in the GapData table... it's slightly faster again, but it makes understanding the whole thing easier to leave it this way.

I would also like to point out that the lack of a meaningful index on the event table may be a factor in the execution times, and I am sure the query could be made faster with an appropriate one... with perhaps an alternate query (I've seen cases where a simple sub-select was actually faster), but I'll leave that to others to play with.

Thanks for the article.

Subject: Query #6
Posted by: Antonín Lejsek (view profile)
Posted on: Sunday, April 06, 2014 at 11:20 PM
Message: Hi,

I'd like to thank You for this article, I really enjoyed it. The idea of analyzing parentheses structure is brilliant, but there are places where implementation seems more complicated to me than it has to be. I rewrote the Query #6 to this:

WITH C1 AS(
select EventTime, [Event]
,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) as e
,ROW_NUMBER() OVER (ORDER BY EventTime, [event] desc) as se
FROM #Events
)
SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
FROM C1
where se = 2*e + [event] -1
GROUP BY e + [event]
having count(*)=2
ORDER BY StartTime;

As puzsol pointed out, there are zero-length gaps. Personally I see this as flaw and better be without them:

WITH C1 AS(
select EventTime, [Event]
,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) as e
,ROW_NUMBER() OVER (ORDER BY EventTime) as se
FROM #Events
)
SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
FROM C1
where se = 2*e + [event] -1
GROUP BY e + [event]
having MIN(EventTime) < MAX(EventTime)
ORDER BY StartTime;

Subject: Query #6
Posted by: puzsol (view profile)
Posted on: Tuesday, April 08, 2014 at 5:30 PM
Message: Hi Antonin,

I might be missing something, but I think you have not grasped the full complexity of the problem Dwain proposed. As when I ran your query against the test harness it returned no rows.

What I don't see in your solution is something to cope with:
a) Dirty data - The test harness removes some logout events to create open-ended logins... just a few... but removing them changes your query from returning 56k rows to just one.

b) Users - The login and logouts only make sense in the context of a user, despite wanting gaps across all users, you must pair the login and logouts on a per-user basis, not purely time basis.

This is why query 6 is as complex as it is. You could simplify it, but not as simple as you state, and I cannot do a performance comparison in the current state, as it does not produce the correct results.

I found that doing the insert into a temporary table allowed me to compare rows against the presented solution and then did my best to account for any differences (which is how I found the missing order by clause in Dwain's solution). Please try this for yourself and let me know how you go.

Subject: Query #6
Posted by: puzsol (view profile)
Posted on: Wednesday, April 09, 2014 at 12:09 AM
Message: Sorry, upon further reflection I was totally wrong... update to follow (perhaps).

Subject: Query #6
Posted by: puzsol (view profile)
Posted on: Wednesday, April 09, 2014 at 12:52 AM
Message: Hi Antonin,

My appologies, as stated before, I was completely wrong in my comments on your post.

a) Query #6 was never intended to run against dirty data (my mistake)
b) Users do not matter for this part (no dirty data)

Your query is correct, and very fast.

One caveat, is that the two row_number() functions and maths can still have the same issue that Dwain's original one did... ie if you use the query:

;WITH C1 AS(
select EventTime, [Event]
,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime, ID asc) as e
,ROW_NUMBER() OVER (ORDER BY EventTime, [event] desc, ID desc) as se
FROM #Events
)
SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
--into #Results_3
FROM C1
where se = 2*e + [event] -1
GROUP BY e + [event]
having count(*)=2
ORDER BY StartTime;

Where I explicitly chose the wrong sub-ordering (ID asc in one, ID desc in the other), you get the WRONG rows returned.... that is, while the row_number function tends to process the rows in the same order, it is not guaranteed to... so you need to specify a sub-order for when they have the same Event and timestamp (not often), otherwise you leave yourself open for SQL implementation issues in the future.

Subject: Query #6
Posted by: puzsol (view profile)
Posted on: Wednesday, April 09, 2014 at 2:43 AM
Message: For "Clean" data I still find, this to be faster (which I referenced in the first answer, but didn't include):

;with C1 as
(
SELECT e.EventTime as ts
, ROW_NUMBER() OVER (ORDER BY e.EventTime) as ord
FROM #Events e
where e.[Event] = 0
), C2 as
(
SELECT e.EventTime as ts
, ROW_NUMBER() OVER (ORDER BY e.EventTime) as ord
FROM #Events e
where e.[Event] = 1
)
select C2.ts as StartTime, C1.ts as EndTime
--into #Results_4
from C1
inner join C2 on C2.ord = C1.ord - 1
where C2.ts < C1.ts

With the following warnings:
1 - Don't use it on dirty data... the test harness was capable of producing overlapping logout-logins due to the fact that the NRN value can be negative sometimes (at least on my system), these can produce wrong results. Curiously the other versions don't have a problem with it.
2 - Don't add an order by clause - it kills the query performance as it does a loop instead of merge join - though adding the merge join hint seems to fix it somewhat. But it seems to produce the results in the right order without it... there's that SQL implementation I warned about...

Subject: .
Posted by: Antonín Lejsek (view profile)
Posted on: Tuesday, April 15, 2014 at 8:55 PM
Message: This is the most elegant and fastest solution so far, hats off. I wonder You didn't write it sooner. And You are absolutely right, that my rewritten Query #6 is not correct in the some way the original Query #6 in the article.
The problem of finding gaps seems to consist of two separated parts
- data cleaning
- gap finding itself
While all solutions use the same technique for the cleaning anyway, coupling Your last query with cleaning seems to be the best. It just is not one query (at least I was not able to make it one query without speed impacts). I write it for completeness, hope I can, it is all Your code :-)

;WITH GroupedLogins AS
(
SELECT LoginTime, LogoutTime
FROM
(
SELECT [Event], LoginTime = EventTime
,nEvent = LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime)
,LogoutTime = LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime)
FROM Events
) a
-- Filter only on login events that are not "open" (no logout)
WHERE [Event] = 0 AND nEvent=1
)
SELECT * INTO #g FROM GroupedLogins

;WITH C1 as
(
SELECT e.LoginTime as ts
, ROW_NUMBER() OVER (ORDER BY e.LoginTime) as ord
FROM #g e
), C2 as
(
SELECT e.LogoutTime as ts
, ROW_NUMBER() OVER (ORDER BY e.LogoutTime) as ord
FROM #g e
)
SELECT C2.ts as StartTime, C1.ts as EndTime
INTO #Results_5
FROM C1
INNER JOIN C2 ON C2.ord = C1.ord - 1
WHERE C2.ts <= C1.ts --or C2.ts < C1.ts for removing zero-length

What I do not understand is, why the result of merge join on "ord" is not considered ordered by "ord", so "order by ord" results into another sorting. There has to be something I do not see.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.