Calculating Gaps Between Overlapping Time Intervals in SQL

There are a number of real-life reporting tasks in SQL that require a 'gaps and islands' analysis. There are a number of techniques around that work, but finding ones that scale well makes for a tougher, but interesting, challenge.

When building a web site that requires login access it is normal practice, and even a statutory requirement in some places, that you keep a record of user logins.  This log can be invaluable for high-usage web sites in order to identify periods when the web site is infrequently used, so that a suitable time for maintenance such as operating system patching can be scheduled.

Since you have no control over when authorized users log into your web sites, you’ll always have periods when more than one are logged in at the same time.  Login and logout times will be staggered, and so the intervals when users are logged in will be overlapping.  Today we’ll ultimately be showing you how to find the gaps of inactivity, but to do so we will also show you a neat way to identify those periods where there is activity.

Although this may seem like a slightly contrived problem, there are plenty of times that you are faced with the sort of reporting where special SQL techniques are required. For example:

  • If you have a fleet of vehicles where events represent departure and return, you may wish to ask during what periods are at least one of my vehicles out and when are they all not being used?
  • In a manufacturing plant with multiple assembly lines where events are line starts and line stops, you can find out during what periods at least one line is in use or the periods when all lines are idle.
  • By adding a bit of PARTITIONing to these queries, you can handle the case of a hotel that has multiple types of rooms.  Then you can ask the questions, during what times is at least one room of a type occupied or when are all rooms of a particular type unoccupied.

New features in SQL 2012 have been developed to make short work of some of these gaps problems, so we will need SQL 2012 to run many of the examples, specifically those that relate to data clean-up.  These solutions are not limited to login/logout intervals; actually any defined interval between specific events will do.

The accompanying resource file: Sample Queries (2012).sql contains all of the sample queries shown in the following sections.  All SQL resource files are appended with either (2008) or (2012) to indicate which version of SQL they can be run in.

Problem Setup and Explanation

First we’ll create a table with some basic set-up data that represents three user logins.  The login periods will overlap and contain a gap to illustrate the essentials of what we are trying to do.

The raw sample data displayed by the final SELECT are as follows:

We can see that logins ([Event]=0) and logouts ([Event]=1) are evenly matched, meaning the logout time was recorded for each login.  Let’s take a look graphically at these four events.

1942-clip_image002.jpg

Each login ID is represented by a different color.  You can see that there are overlaps between the grouping of logins on the afternoon of 01 Sep and the single login by UserID=1 on 02 Sep.  From the data, we ultimately seek to expose the gap between the logout of UserID=3 (blue) and the second login of UserID=1 (2013-09-01 19:10 to 2013-09-02 11:05).

Pairing of Different, Sequential Events

The following solutions assume that concurrent logins by a single account are not allowed.  When your login/logout events are equally paired, i.e., there are no missing logout endpoints, a solution such as I’ll be describing will group these events so that your query can place start/end times of the events on a single row.

Sample Query #1: How to Pair up Logins with Logouts

Results from Sample Query #1:

Note how, in the derived table, we’ve created a grouping column as (ROW_NUMBER()+1)/2, which is a well-known trick for pairing consecutive rows.  Here are the results in rn for the first four rows:

Many times, usually due to incomplete implementation, web sites will not record or generate a logout event.  This clutters the data with spurious, unpaired logins.  Let’s add a couple of these to our data.

The sample data now appears as follows, where IDs 1 and 10 represent a login that is not paired with a corresponding logout event.

In order to pair logins with logouts in this case a different approach is required.  Note the use of OUTER APPLY (rather than a CROSS APPLY) is required to bring into the results set the final, unpaired login for UserID=2.

Sample Query #2: Pairing Logins with no Corresponding Logout

Since the execution plan of this query shows two Clustered Index Scans, compared to the execution plan of the first showing a single Clustered Index Scan, we’d expect the second query to be a bit slower than the first.  So this data clean-up comes at a cost. 

Results from Sample Query #2:

Unfortunately, further testing of this pairing method using a large row set identified that it is quite slow.  To find a better alternative, we’ll turn to the new LEAD analytic function in SQL 2012.  You’ll find that the following query returns the exact same row set as the prior, but ultimately performs much better (in fact it takes about 10% of the time to run).

Sample Query #3: Using LEAD to Improve Performance (where some logins have no logout)

It is quite easy to eliminate the rows with missing LogoutTimes in either of the above queries by adding a check for NULL LogoutTime to the WHERE clause.

It would also be possible to implement other business rules for cases where no logout time is recorded for a login event, for example:

  • Use the average duration of that user’s logins to predict an approximate LogoutTime.
  •  Assume the user is still logged in, so substitute GETDATE() for the LogoutTime.

We will leave these as exercises for the interested reader.

Packing the Overlapping Login Intervals

Sometime back, a good friend of mine pointed me to an article that describes a way to pack overlapping intervals.  That article is called Packing Intervals and it is by none other than SQL MVP and all-around guru, Mr. Itzik Ben-Gan.  When I first read that article, I was impressed by the elegantly simple explanation for how this highly imaginative query works, so I won’t try to inadequately reproduce it here.  What I can tell you is that it took me all of about three milliseconds (coincidentally the resolution of the SQL DATETIME data type) to realize that this was something I needed to retain in my tool box.  Let’s look at how we would apply this technique to our sample data, by first putting our Example Query #3 (eliminating the spurious logins) into a Common Table Expression and then applying the Packing Intervals technique.

Sample Query #4: Pack the Overlapping Intervals down to non-overlapping Events

It is instructive to see how each successive CTE (ignoring the first because we’ve already seen what it does) returns a row set that brings us closer to our target of non-overlapping intervals.  Rather than being repetitive, I’ll refer you to the comments at each step (above).

Results from C1:

Results from C2:

Results from C3:

The results produced by this fascinating method are shown below.

Final Results from Sample Query #4:

If we return to look at Sample Query Results #2, we find that these represent the packed intervals across all of the UserIDs.  You can also think of these as “islands” of time where users are logged into our system.

Before we move on to finding the gaps in these islands, let’s take a closer look at what’s happening in CTEs GroupedLogins and C1.  The only reason we have GroupedLogins to begin with is to remove the spurious logins.  Looking at the work being done by C1, it is ungrouping that which we just grouped.  This leads us to believe that if the spurious logins were not present it might be possible to simplify the query, so let’s try to do this after first deleting the two rows we added to our original sample data.

Sample Query #5: Simplify the Packing Method by using our logins/logouts directly

Results from Sample Query #5:

These correspond to the packed intervals in our original data, including the depiction in our diagram.  Note that it is slightly different than the packed results we got from Sample Query #4, because the “spurious logins” that we removed resulted in slightly different overlapping intervals.

Converting Packed Intervals to the Corresponding Gaps

Once we have our packed intervals (“islands”) it is simple enough to add just a little bit of code to convert these into gaps.  We draw upon a previous Simple Talk article called The SQL of Gaps and Islands in Sequence Numbers and look for the CROSS APPLY VALUES method to convert islands to gaps.  We will add this technique to Sample Query #5 to get to:

Sample Query #6: Convert Packed Intervals into Gaps Between them

Results from Sample Query #6:

Obviously, if we needed to handle spurious logins, the little bit of code we added after C4 could be applied just as easily to Sample Query #4.

Building a Test Harness

Now that we have a way to calculate gaps between overlapping intervals, we want to make sure that the code we’ve built will scale to large row sets in a reasonable fashion.  To do this, we must build a test harness with a realistic number of rows.

We can simulate the duration of a login by starting the login at a particular time of day and then assume that the user was logged in for a random period of time.  In SQL, we can generate uniform random numbers like this:

This will generate a uniformly-distributed random integer between 10 and 130, which could for example represent the minutes the user was logged in.  But we’re going to have a little fun and say that our logins are not based on a uniform distribution, rather the time logged in will behave more like a normal distribution where the mean is 240 seconds and the standard deviation is 60 seconds.  To do this, we need a way to generate normally distributed random numbers.

The function presented above does this.  Generating Non-uniform Random Numbers with SQL has an empirical proof that it does, and you will also find several other random number generators for other distributions contained therein.  To use this function, you must pass it two uniform random numbers on the interval 0 to 1.

The test data (1,000,000 rows) can be generated based on 500 days and 1000 UserIds as follows:

Each UserID logs in once per day.  Rows resulting from the query can be increased in increments of 1,000,000 rows by changing @Days to 1000, 1500 and 2000.  Using Sample Query #6 with the above table generates about 30,000 packed intervals and consequently nearly the same number of gaps (actually exactly one less).  This can be demonstrated by running the SQL code stored in the resource file: Check Test Harness (2008).sql.

Since each of our queries is built upon a series of steps, we’ll break each step down and get timings on each.  Two additional test harness SQL files are provided:

  • Test Harness 1 (2008).sql – tests 2 queries which are like Sample Query #5 plus the additional code to convert those islands to gaps (Sample Query #6).
  • Test Harness 2 (2012).sql – tests 3 queries which are like Sample Query #3 (removing the logins with missing logouts), Sample Query #4 (to calculate the packed intervals) and finally with the additional code to convert those islands into gaps. 

The DELETE statement below appears in Test Harness 2 (2012).sql to remove a few logout events.

This deletes about 1,667 logout rows out of 1,000,000 (about 0.17%).

Performance Results

Using Test Harness 1 (2008).sql (run in SQL 2008) we can demonstrate that the additional time for CPU and Elapsed run times when adding the CROSS APPLY VALUES Islands to Gaps method to the interval packing is nearly insignificant.  Together the solutions both scale quite linearly and are fairly reasonable when there are no missing logouts.

1942-clip_image004.gif

The result below at 4M rows for elapsed time (Interval Gaps less than Pack Intervals) can only be explained by an increasing amount of parallelism introduced by SQL Server into the query.

1942-clip_image006.gif

We see a similar story when reviewing the results when there are missing logouts (using Test Harness 2 (2012).sql (run in SQL 2012).  Simply needing to remove the missing logouts takes slightly more than 50% of the CPU, but when we layer calculating the gaps on top of the packing algorithm, that additional work is quite minimal.

1942-clip_image008.gif

Elapsed times tell a similar story except that the amount of time for removing the open-ended logins does not comprise nearly as large a percentage of the overall time.  Nonetheless, both interval packing and calculating the gaps between them scales quite linearly.

1942-clip_image010.gif

Note that test results for the cases of no vs. missing logouts were run on different machines due to the need for SQL 2012 for the latter, so the CPU/Elapsed times recorded may not be comparable in terms of raw magnitude.  Test machines:

  • No Missing Logouts: Windows 7 Professional and SQL 2008 R2 (both 64 bit), Dell Inspiron laptop w-Intel Core i5-2430M @ 2.4 GHz, 8GB RAM
  •  Missing Logouts: Windows 7 Enterprise and SQL 2012 (both 64 bit), Lenovo laptop w-Intel Core i5-3230M @ 2.60 GHz, 12GB RAM

Conclusion

The important concepts we have demonstrated in this article are:

  • Combining rows where login and logout events are paired.
  •  A fast solution using the SQL 2012 analytic function LEAD for combining rows where login and logout events are paired, but some of the logout events are missing.
  • The excellent solution proposed by SQL MVP Itzik Ben-Gan for packing intervals, with modifications to support both of the above cases.
  •  Using the CROSS APPLY VALUES approach to convert islands (packed intervals) to gaps (the periods of no login activity).
  •  Finally, and very importantly, that these solutions seem linearly scalable to multi-million row sets.

Further Reading

Dealing with event intervals is a fascinating and challenging problem in SQL.  If you are interested in pursuing this further, I strongly recommend that you take a look at some of the links below.  They provide even more advanced, high-performing techniques for dealing with time intervals.

Thank you for your attention and we hope you found the information in this article interesting and ultimately useful.

For more articles like this, sign up to the fortnightly Simple-Talk newsletter.

Tags: ,

  • 41575 views

  • Rate
    [Total: 31    Average: 4.5/5]
  • puzsol

    SQL 2012 Version
    Hi Dwain,

    Having a stupid obsession with the islands and gaps problem, I played around with the SQL 2012 version, and found that there are two issues:

    1 – Some "gaps" are of 0 length… ok, so essentially they are gaps between the groups and your definition of the problem didn’t state what to do in this case… I figure it’s just a matter of the ordering of the type that will modify this behaviour to eliminate them (if that was desired)…. which leads me to the second issue:

    2 – On my machine the row_number() order for the C2 is inadequate, in that on some of my tests, I had a situation where two logout times were identical… their (se) numbers came out 14,15 while their (e) numbers were 7,6 respectively, resulting in a ‘gap’ in the wrong spot. This may be hard to reproduce due to machine configuration, test data etc… but for me, changing C2 to:
    C2 AS
    (
    — Add a row number ordered as shown
    SELECT C1.*, se=ROW_NUMBER() OVER (ORDER BY ts, coalesce(e,s), [Type] DESC)
    FROM C1
    )
    seemed to fix it… though you may want to put the order of operations in a different sequence. (ie Type, then e/s order). In my opinion, this is the problem of using maths to perform logic rather than some other SQL comparison with meaning.

    Having had my own personal method, where you simply order the login times, and logout times, then match the rows (using the row_number() to join them like an id), then looking for the gaps like you looked for orphaned login events… I compared them… and found that despite having less logic and order operations it took about the same time (very frustrating – it should be a more efficient operation, but SQL lacks the syntax)… but not one to give up, I tried a half way approach, borrowing from your method and came up with:

    ;WITH GroupedLogins AS
    (
    SELECT LoginTime, ROW_NUMBER() OVER (ORDER BY LoginTime) as LoginOrder
    , nTime as LogoutTime, ROW_NUMBER() OVER (ORDER BY nTime) as LogoutOrder
    FROM
    (
    SELECT EventTime as LoginTime
    , [Event]
    , LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime) as nTime
    , LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime) as nEvent
    FROM #Events
    ) a
    WHERE [Event] = 0 and nEvent = 1
    ), OrderedEvents as
    (
    SELECT a.ts, a.ord, a.isLogout, ROW_NUMBER() OVER (ORDER BY a.ts, a.isLogout desc, a.ord) as TimeOrder
    FROM GroupedLogins g
    CROSS APPLY (VALUES (0, g.LoginOrder, g.LoginTime), (1, g.LogoutOrder, g.LogoutTime)) a(isLogout, ord, ts)
    ), GapData as
    (
    select e.ts as StartTime
    , e.isLogout
    , e.ord
    , LEAD(e.ts, 1) OVER (ORDER BY e.TimeOrder) as nTime
    , LEAD(e.isLogout, 1) OVER (ORDER BY e.TimeOrder) as nIsLogout
    , LEAD(e.ord, 1) OVER (ORDER BY e.TimeOrder) as nOrd
    from OrderedEvents e
    )
    select StartTime, nTime as EndTime
    into #Results_4
    from GapData
    where isLogout = 1 and nIsLogout = 0
    and nOrd = (ord + 1)

    On my machine, it seems to take slightly more CPU time, but less execution time (more parallel I think)… and I found that using where, rather than case to eliminate data was better for plan estimation.

    It effectively states that whenever we are to have a gap, there will be an equal number of login and logouts up to the time that is a logout with the next time being a login.

    NOTE 1: This version will also give you 0-length gaps. But removing the "desc" on the "isLogout" for the "TimeOrder" row_number, will remove these.
    NOTE 2: Technically you don’t even need the TimeOrder column, just put the ordering from it into the LEAD functions in the GapData table… it’s slightly faster again, but it makes understanding the whole thing easier to leave it this way.

    I would also like to point out that the lack of a meaningful index on the event table may be a factor in the execution times, and I am sure the query could be made faster with an appropriate one… with perhaps an alternate query (I’ve seen cases where a simple sub-select was actually faster), but I’ll leave that to others to play with.

    Thanks for the article.

  • Antonín Lejsek

    Query #6
    Hi,

    I’d like to thank You for this article, I really enjoyed it. The idea of analyzing parentheses structure is brilliant, but there are places where implementation seems more complicated to me than it has to be. I rewrote the Query #6 to this:

    WITH C1 AS(
    select EventTime, [Event]
    ,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) as e
    ,ROW_NUMBER() OVER (ORDER BY EventTime, [event] desc) as se
    FROM #Events
    )
    SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
    FROM C1
    where se = 2*e + [event] -1
    GROUP BY e + [event]
    having count(*)=2
    ORDER BY StartTime;

    As puzsol pointed out, there are zero-length gaps. Personally I see this as flaw and better be without them:

    WITH C1 AS(
    select EventTime, [Event]
    ,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime) as e
    ,ROW_NUMBER() OVER (ORDER BY EventTime) as se
    FROM #Events
    )
    SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
    FROM C1
    where se = 2*e + [event] -1
    GROUP BY e + [event]
    having MIN(EventTime) < MAX(EventTime)
    ORDER BY StartTime;

  • puzsol

    Query #6
    Hi Antonin,

    I might be missing something, but I think you have not grasped the full complexity of the problem Dwain proposed. As when I ran your query against the test harness it returned no rows.

    What I don’t see in your solution is something to cope with:
    a) Dirty data – The test harness removes some logout events to create open-ended logins… just a few… but removing them changes your query from returning 56k rows to just one.

    b) Users – The login and logouts only make sense in the context of a user, despite wanting gaps across all users, you must pair the login and logouts on a per-user basis, not purely time basis.

    This is why query 6 is as complex as it is. You could simplify it, but not as simple as you state, and I cannot do a performance comparison in the current state, as it does not produce the correct results.

    I found that doing the insert into a temporary table allowed me to compare rows against the presented solution and then did my best to account for any differences (which is how I found the missing order by clause in Dwain’s solution). Please try this for yourself and let me know how you go.

  • puzsol

    Query #6
    Sorry, upon further reflection I was totally wrong… update to follow (perhaps).

  • puzsol

    Query #6
    Hi Antonin,

    My appologies, as stated before, I was completely wrong in my comments on your post.

    a) Query #6 was never intended to run against dirty data (my mistake)
    b) Users do not matter for this part (no dirty data)

    Your query is correct, and very fast.

    One caveat, is that the two row_number() functions and maths can still have the same issue that Dwain’s original one did… ie if you use the query:

    ;WITH C1 AS(
    select EventTime, [Event]
    ,ROW_NUMBER() OVER (PARTITION BY [Event] ORDER BY EventTime, ID asc) as e
    ,ROW_NUMBER() OVER (ORDER BY EventTime, [event] desc, ID desc) as se
    FROM #Events
    )
    SELECT StartTime=MIN(EventTime), EndTime=MAX(EventTime)
    –into #Results_3
    FROM C1
    where se = 2*e + [event] -1
    GROUP BY e + [event]
    having count(*)=2
    ORDER BY StartTime;

    Where I explicitly chose the wrong sub-ordering (ID asc in one, ID desc in the other), you get the WRONG rows returned…. that is, while the row_number function tends to process the rows in the same order, it is not guaranteed to… so you need to specify a sub-order for when they have the same Event and timestamp (not often), otherwise you leave yourself open for SQL implementation issues in the future.

  • puzsol

    Query #6
    For "Clean" data I still find, this to be faster (which I referenced in the first answer, but didn’t include):

    ;with C1 as
    (
    SELECT e.EventTime as ts
    , ROW_NUMBER() OVER (ORDER BY e.EventTime) as ord
    FROM #Events e
    where e.[Event] = 0
    ), C2 as
    (
    SELECT e.EventTime as ts
    , ROW_NUMBER() OVER (ORDER BY e.EventTime) as ord
    FROM #Events e
    where e.[Event] = 1
    )
    select C2.ts as StartTime, C1.ts as EndTime
    –into #Results_4
    from C1
    inner join C2 on C2.ord = C1.ord – 1
    where C2.ts < C1.ts

    With the following warnings:
    1 – Don’t use it on dirty data… the test harness was capable of producing overlapping logout-logins due to the fact that the NRN value can be negative sometimes (at least on my system), these can produce wrong results. Curiously the other versions don’t have a problem with it.
    2 – Don’t add an order by clause – it kills the query performance as it does a loop instead of merge join – though adding the merge join hint seems to fix it somewhat. But it seems to produce the results in the right order without it… there’s that SQL implementation I warned about…

  • Antonín Lejsek

    .
    This is the most elegant and fastest solution so far, hats off. I wonder You didn’t write it sooner. And You are absolutely right, that my rewritten Query #6 is not correct in the some way the original Query #6 in the article.
    The problem of finding gaps seems to consist of two separated parts
    – data cleaning
    – gap finding itself
    While all solutions use the same technique for the cleaning anyway, coupling Your last query with cleaning seems to be the best. It just is not one query (at least I was not able to make it one query without speed impacts). I write it for completeness, hope I can, it is all Your code 🙂

    ;WITH GroupedLogins AS
    (
    SELECT LoginTime, LogoutTime
    FROM
    (
    SELECT [Event], LoginTime = EventTime
    ,nEvent = LEAD([Event], 1) OVER (PARTITION BY UserID ORDER BY EventTime)
    ,LogoutTime = LEAD(EventTime, 1) OVER (PARTITION BY UserID ORDER BY EventTime)
    FROM Events
    ) a
    — Filter only on login events that are not "open" (no logout)
    WHERE [Event] = 0 AND nEvent=1
    )
    SELECT * INTO #g FROM GroupedLogins

    ;WITH C1 as
    (
    SELECT e.LoginTime as ts
    , ROW_NUMBER() OVER (ORDER BY e.LoginTime) as ord
    FROM #g e
    ), C2 as
    (
    SELECT e.LogoutTime as ts
    , ROW_NUMBER() OVER (ORDER BY e.LogoutTime) as ord
    FROM #g e
    )
    SELECT C2.ts as StartTime, C1.ts as EndTime
    INTO #Results_5
    FROM C1
    INNER JOIN C2 ON C2.ord = C1.ord – 1
    WHERE C2.ts <= C1.ts –or C2.ts < C1.ts for removing zero-length

    What I do not understand is, why the result of merge join on "ord" is not considered ordered by "ord", so "order by ord" results into another sorting. There has to be something I do not see.

  • Ajain

    One of the best article i have read on SQL. I have been working on SQL from a long time and sometime struggle to make the code fast and efficient and the above methods can be a great help. Though almost all scenarios in this article are covered but i still want to understand how you can handle time calculation when In & Out time are from different dates. Ex:

    1 2015-08-07 18:10:30 0 – Login
    1 2015-08-08 02:30:00 1 – Logout
    1 2015-08-08 17:45:00 0
    1 2015-08-09 02:10:20 1
    .
    .
    .
    .
    .
    .
    and so on…..

    here how you will calculate the overall our for a day for user 1 and what should be the date to show in the result(login date or log out date)