Click here to monitor SSC
Dwain Camps

The SQL of Gaps and Islands in Sequences

25 July 2013

Some SQL problems are intriguing because, just when good methods emerge and are accepted, other alternative solutions are discovered.  The fun of exploring problems such as 'Gaps and Islands' is all the greater when we have a thorough test-harness to  try out the alternative solutions.

The word ‘Gaps’ in the title refers to gaps in sequences of values. Islands are unbroken sequences delimited by gaps. The ‘Gaps and Islands’ problem is that of using SQL to rapidly detect the unbroken sequences, and the extent of the gaps between them in a column.

Islands and gaps appear in all sorts of sequences, be they IDENTITY columns where some rows have been removed or dates that occur in sequence (but some are missing).  In all cases, the sequences do not contain duplicates. The ‘Gaps and Islands’ problem isn’t entirely an academic game, since a number of business processes demand some way of detecting gaps and islands in sequences. A typical example might occur in the express distribution business where a consignment has many packages numbered sequentially.  Typically you scan all packages when a consignment reaches the depot.  If packages are missing those represent the gaps.  So if you want to represent the event “consignment arrived at the depot” and list the package numbers that did arrive, you’d want to group them like 1-10, 15-18, etc. where those are the islands.

It is complex and interesting task to come up with a solution that performs and scales well.  Chapter 5 of SQL Server MVP Deep Dives, Gaps and Islands by Itzik Ben-Gan, is probably the best and most thorough explanation of the main solutions.  My interest was piqued by that article and I played about with the examples to try to gain a little better insight.  While doing so, I happened upon a rather different solution to the problem, and was sufficiently intrigued to develop a test harness to evaluate the way that the various algorithms performed and scaled.

Let’s dip our toe in the pond by looking at some sample data.

CREATE TABLE GapsIslands (ID INT NOT NULL, SeqNo INT NOT NULL);

 

ALTER TABLE dbo.GapsIslands ADD CONSTRAINT pk_GapsIslands PRIMARY KEY (ID, SeqNo);

 

INSERT INTO dbo.GapsIslands

SELECT 1, 1 UNION ALL SELECT 1, 2 UNION ALL SELECT 1, 5 UNION ALL SELECT 1, 6

UNION ALL SELECT 1, 8 UNION ALL SELECT 1, 9 UNION ALL SELECT 1, 10 UNION ALL SELECT 1, 12

UNION ALL SELECT 1, 20 UNION ALL SELECT 1, 21 UNION ALL SELECT 1, 25 UNION ALL SELECT 1, 26; 

 

SELECT * FROM dbo.GapsIslands;

We’ve thrown in an ID column because you may need to retrieve islands and/or gaps across more than one record grouping.  However, this value is not used in the basic examples, but will be used in the later performance test harnesses. This produces a table of sample data (on the left) and I’ve added the results we should expect for islands (in the middle) and gaps (on the right):

Results

Islands

Gaps

ID SeqNo

1  1

1  2

1  5

1  6

1  8

1  9

1  10

1  12

1  20

1  21

1  25

1  26

ID StartSeqNo EndSeqNo

1  1          2

1  5          6

1  8          10

1  12         12

1  20         21

1  25         26

ID StartSeqNo EndSeqNo

1  3          4

1  7          7

1  11         11

1  13         19

1  22         24

 

In our example, islands are contiguous groups of integers.  In a way, gaps are the inverse of islands, as they are the endpoints between the islands. 

It is interesting to note that there will always be one less row of gaps than islands.

Some Islands Solutions

In the SQL MVP Deep Dives book, the author provides four solutions to the islands problem.  From his timing results, it seems that two of these seem to be much better than the others, based on the measure of elapsed time only.  Let’s take a look at these two.

INSERT INTO dbo.GapsIslands

SELECT 1, 1 UNION ALL SELECT 1, 2 UNION ALL SELECT 1, 5 UNION ALL SELECT 1, 6

UNION ALL SELECT 1, 8 UNION ALL SELECT 1, 9 UNION ALL SELECT 1, 10 UNION ALL SELECT 1, 12

UNION ALL SELECT 1, 20 UNION ALL SELECT 1, 21 UNION ALL SELECT 1, 25 UNION ALL SELECT 1, 26 

 

PRINT 'Islands Solution #1 from SQL MVP Deep Dives';

WITH StartingPoints AS

(

    SELECT ID, SeqNo, ROW_NUMBER() OVER(ORDER BY SeqNo) AS rownum

    FROM dbo.GapsIslands AS A

    WHERE NOT EXISTS (

        SELECT *

        FROM dbo.GapsIslands AS B

        WHERE B.ID = A.ID AND B.SeqNo = A.SeqNo - 1)

),

EndingPoints AS

(

    SELECT ID, SeqNo, ROW_NUMBER() OVER(ORDER BY SeqNo) AS rownum

    FROM dbo.GapsIslands AS A

    WHERE NOT EXISTS (

        SELECT *

        FROM dbo.GapsIslands AS B

        WHERE B.ID = A.ID AND B.SeqNo = A.SeqNo + 1)

)

SELECT S.ID, S.SeqNo AS start_range, E.SeqNo AS end_range

FROM StartingPoints AS S

JOIN EndingPoints AS E ON E.ID = S.ID AND E.rownum = S.rownum;

 

PRINT 'Islands Solution #3 from SQL MVP Deep Dives';

SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)

FROM (

    SELECT ID, SeqNo

        ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)

    FROM dbo.GapsIslands) a

GROUP BY ID, rn;

We have modified the code provided to support the ID column, which may reflect different user IDs, machine IDs, or whatever for your particular case.  When you run this code, you’ll find that the results precisely match the expected results for islands shown in the table above.  It is important to note that in at least the second case, the sequence numbers must be unique within an ID in order for the solution to work.

The following performance was achieved by these two solutions across 1,000,000 rows, with the index being essential to achieving these results.  This basically confirms the findings in the book: solution #3 is faster.

Islands

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

Islands #1

1545

1611

10568

Islands #3

733

807

2642

In the resources file, you may run the script: Islands Test Harness #1.sql to confirm these results for yourself and see how they may vary based on your machine’s configuration.

The Traditional Gaps Solutions

The gaps problem, in my mind at least, is intrinsically a bit more challenging than islands because, in effect you need to “make up” the data points that are the endpoints of each gap; meaning that by definition for the gaps, the endpoints of the gap do not already exist in the data.

As I thumb through my extensively dog-eared copy of the SQL MVP Deep Dives book, I find there are four solutions proposed for gaps.  Of these, there are 3 which seem to perform to relatively the same order of magnitude when it comes to elapsed time.

PRINT 'Gaps Solution #1 from SQL MVP Deep Dives';

SELECT ID, StartSeqNo=SeqNo + 1, EndSeqNo=(

    SELECT MIN(B.SeqNo)

    FROM dbo.GapsIslands AS B

    WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) - 1

FROM dbo.GapsIslands AS A

WHERE NOT EXISTS (

    SELECT *

    FROM dbo.GapsIslands AS B

    WHERE B.ID = A.ID AND B.SeqNo = A.SeqNo + 1) AND

        SeqNo < (SELECT MAX(SeqNo)

                 FROM dbo.GapsIslands B

                 WHERE B.ID = A.ID);

 

PRINT 'Gaps Solution #2 from SQL MVP Deep Dives';

SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1

FROM (

    SELECT ID, cur=SeqNo, nxt=(

        SELECT MIN(B.SeqNo)

        FROM dbo.GapsIslands AS B

        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo)

    FROM dbo.GapsIslands AS A) AS D

WHERE nxt - cur > 1;

 

PRINT 'Gaps Solution #3 from SQL MVP Deep Dives';

WITH C AS

(

    SELECT ID, SeqNo, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY SeqNo) AS rownum

    FROM dbo.GapsIslands

)

SELECT Cur.ID, StartSeqNo=Cur.SeqNo + 1, EndSeqNo=Nxt.SeqNo - 1

FROM C AS Cur

JOIN C AS Nxt ON Cur.ID = Nxt.ID AND Nxt.rownum = Cur.rownum + 1

WHERE Nxt.SeqNo - Cur.SeqNo > 1;

Once again a check of the results confirms that they are identical to those shown in the preceding table for gaps.  Test harness Gaps Test Harness #1.sql confirms that solution #1 from SQL MVP Deep Dives is the fastest.

Gaps

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

Gaps #1

903

810

23012

Gaps #2

5056

1522

3192926

Gaps #3

1591

1593

5278

An Alternative Gaps Solution

Since we have relatively a fast method of determining islands, perhaps it’s possible to convert that solution into one that will identify our gaps.  Let’s take a look at the islands results to see how it might be converted into gaps.

If we first insert two columns and fill the first with one less than the StartSeqNo and the second with one greater than the EndSeqNo, we can circle the numbers that represent the gaps.  Each pair of colored, circled numbers in the rightmost table perfectly represents our gaps.

This suggests to us that UNPIVOTing the islands might give us something to work with to arrive at the gaps.  There are at least three methods to un-pivot two columns: 1) using the SQL UNPIVOT keyword, 2) doing a UNION ALL between each of the columns and 3) using the CROSS APPLY VALUES (CAV) approach to UNPIVOT.  We’ll choose the latter because as this article: An Alternative (Better?) Method to UNPIVOT seems to demonstrate, that approach may be the swiftest.  We’ll also move the fastest islands bit of code magic into a Common Table Expression (CTE) so we can focus on our gaps solution.

WITH Islands AS

(

    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1

    FROM (

        SELECT ID, SeqNo

            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)

        FROM dbo.GapsIslands) a

    GROUP BY ID, rn

)

SELECT ID, SeqNo

    ,n=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)

    ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2

FROM Islands

CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)

ORDER BY ID, StartSeqNo;

We’ve thrown in a couple of row numbers to see if they might help us with the grouping we need.  These results are:

ID  SeqNo  n   m

1   0      1   0

1   3      2   1

1   4      3   1

1   7      4   2

1   7      5   2

1   11     6   3

1   11     7   3

1   13     8   4

1   19     9   4

1   22     10  5

1   24     11  5

1   27     12  6

We certainly have the numbers in SeqNo that represent the endpoints of our gap and we even have a grouping column (m) that will allow us to group our gap start/end points, which we arrived at by the totally simplistic method of dividing by two.  Now all we have to do is figure out a way to remove the first and last row from the result set!

Since all the gap endpoints of interest are in groups of two, this brings to mind the HAVING clause.  Let’s give that a try.

WITH Islands AS

(

    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1

    FROM (

        SELECT ID, SeqNo

            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)

        FROM dbo.GapsIslands) a

     GROUP BY ID, rn

)

SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)

FROM (

    SELECT ID, SeqNo

        ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2

    FROM Islands

    CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a

GROUP BY ID, m

HAVING COUNT(*) = 2

ORDER BY ID, StartSeqNo;

Inspection of the results set clearly shows we’ve achieved our expected results – the gaps!

Performance Comparison of Gaps Solutions

Performance always counts so let’s see how well this solution performs in a sufficiently large test harness versus the solutions proposed in the MVP Deep Dives book. 

In the two test harness scripts we mentioned previously, the code to generate a large test data set looks like this:

SET NOCOUNT ON

SET STATISTICS TIME, IO OFF

CREATE TABLE GapsIslands (ID INT NOT NULL, SeqNo BIGINT NOT NULL);

 

ALTER TABLE GapsIslands

 ADD CONSTRAINT pk_GapsIslands PRIMARY KEY (ID, SeqNo);

 

DECLARE @Records INT = 1000000;

 

-- Generate a sequence numbers table with 10 unique IDs for a total

-- of @Records rows, with small gaps around every 500 or so.

WITH Tally (n) AS

(

    SELECT TOP (1+CAST(FLOOR((@Records/10)/500.+@Records/10) AS INT))

        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

)

INSERT INTO dbo.GapsIslands

SELECT m, n

FROM (SELECT TOP 10 n FROM Tally) a(m)

CROSS APPLY Tally b

WHERE n % 500 <> m;

PRINT 'Rows in test harness: ' + CAST(@@ROWCOUNT AS VARCHAR(12));

 

-- Make the gaps wider towards the end

DECLARE @Gaps INT = 1

WHILE @Gaps <= 3

BEGIN

    WITH Gaps AS (SELECT TOP (@Gaps * 1000) ID, SeqNo FROM GapsIslands ORDER BY SeqNo DESC)

    UPDATE Gaps

    SET SeqNo = SeqNo * POWER(10, @Gaps) * (ID + 1);

   

    SELECT @Gaps = @Gaps + 1;

END

While reviewing this article, Peter Larsson (alias PESO) suggested an alternative using CROSS APPLY VALUES that may improve the speed.  So let’s take a look at that method also.

WITH cteSource(ID, Seq, Num)

AS (

   SELECT d.ID, f.Seq, f.Num

   FROM (

       SELECT ID,

              ROW_NUMBER() OVER (PARTITION BY ID ORDER BY MIN(SeqNo)) AS Grp,

              MIN(SeqNo) AS StartSeqNo,

              MAX(SeqNo) AS EndSeqNo

       FROM (

            SELECT ID, SeqNo,

                   SeqNo - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo) AS rn

            FROM dbo.GapsIslands

            ) AS a

       GROUP BY ID,rn

       ) d

   CROSS APPLY (

       VALUES (d.Grp, d.EndSeqNo + 1),(d.Grp - 1, d.StartSeqNo - 1)

       ) AS f(Seq, Num)

)

SELECT ID, MIN(Num) AS StartSeqNo, MAX(Num) AS EndSeqNo

FROM cteSource

GROUP BY ID, Seq

HAVING COUNT(*) = 2; 

Using the gaps solutions identified previously, a quick check of the results at 1,000,000 rows delivers these results (run Gaps Test Harness #2.sql).

Gaps

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

SQL MVP Deep Dives Gaps #1

967

827

22997

SQL MVP Deep Dives Gaps #2

5258

1653

3192923

SQL MVP Deep Dives Gaps #3

1840

1834

5278

CROSS APPLY VALUES – Islands to Gaps

718

933

2639

CAV – Islands to Gaps (by PESO)

640

711

2639

Elapsed times for the new CAV (Cross Apply Values) methods appear quite competitive, and both logical IOs and CPU time appear improved.  In a moment we’ll see if these results can be reproduced and how they scale.

An Alternative Islands Solution

If it is possible to convert Islands to Gaps, perhaps it is also possible to convert Gaps to Islands.  Let’s consider a data transformation as follows:

For this solution, we’ll take the end point and add one, while we subtract one from the start point.  We need to add a row, and that row will be the first and last sequence number in our series.  Finally, we’ll un-pivot and add a row number that groups each resulting sequence number in pairs (note that you could either add 1 to or subtract 1 from the row number before dividing by 2).

So now, skipping the intermediate explanatory step as it is quite similar to what we did before, we can arrive at the following code to calculate islands from gaps:

WITH Gaps AS

(

    SELECT ID, StartSeqNo=SeqNo + 1, EndSeqNo=(

        SELECT MIN(B.SeqNo)

        FROM dbo.GapsIslands AS B

        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo) - 1

    FROM dbo.GapsIslands AS A

    WHERE NOT EXISTS (

        SELECT *

        FROM dbo.GapsIslands AS B

        WHERE B.ID = A.ID AND B.SeqNo = A.SeqNo + 1) AND

            SeqNo < (SELECT MAX(SeqNo)

                     FROM dbo.GapsIslands B

                     WHERE B.ID = A.ID)

)

    ,MinMax AS

(

    SELECT ID, MinSeqNo=MIN(SeqNo), MaxSeqNo=MAX(SeqNo)

    FROM dbo.GapsIslands

    GROUP BY ID

)

SELECT ID, MinSeqNo=MIN(SeqNo), MaxSeqNo=MAX(SeqNo)

FROM (

    SELECT ID, SeqNo, m=(ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)-1)/2

    FROM (

        SELECT ID, MinSeqNo=EndSeqNo+1, MaxSeqNo=StartSeqNo-1

        FROM Gaps

        UNION ALL

        SELECT ID, MinSeqNo, MaxSeqNo

        FROM MinMax) a

    CROSS APPLY (VALUES (MinSeqNo),(MaxSeqNo)) b (SeqNo)) b

GROUP BY ID, m;

The Gaps CTE above is the SQL MVP Deep Dives fastest (elapsed time) approach to gaps and the MinMax CTE is used to compute the additional row we’ll need to go from gaps to islands.

So how do these results compare to our fastest islands solutions at 1,000,000 rows?

Islands

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

SQL MVP Deep Dives Islands #1

1217

1256

10568

SQL MVP Deep Dives Islands #3

593

727

2641

CROSS APPLY VALUES – Gaps to Islands

1358

890

25690

The SQL MVP Deep Dives Islands solution #3 still wins the elapsed time race but our approach is a respectable showing, being solidly in second place in terms of elapsed time.  It appears that parallelism accounts for this positioning, as CPU for the CAV – Gaps to Islands solution exceeds the elapsed time.  These results may be reproduced by running Islands Test Harness #2.sql.

Scalability

Using each #2 test harness, we’ll generate test data from 200,000 to 2,000,000 rows and chart the results for comparison.  You’ll see that the final results appear to be a bit mixed.

For the CAV (Cross Apply Values) solutions for islands to gaps, their ranking appears as #1 for CPU across the board, although it is a slim lead and the convergence at 2M rows suggests that it may start lagging behind above that.  For elapsed time, the CAV solutions appear to be basically tied with MVP DD S#1 up to about 1.2M rows, but then they fall to second place ranking above that row count.  In logical IOs they are the clear winner across the board (the SQL MVP Deep Dive solution #2 was omitted due to scaling of the chart).  MVP DD S#3 does get good marks for logical IOs.

For the CAV (Cross Apply Values) solution of gaps to islands, it manages to achieve a #2 ranking for CPU above 1M rows but lagged slightly behind prior to that.  For elapsed time, above 1.8M rows it seems to have achieved the #1 ranking of the 3 solutions (below 1.8M it is #2), but our scaling tests don’t go above 2M rows so we’re not sure how long it would maintain that lead.  However on logical IOs, it is a clear #3 (and would probably remain so) regardless of the number of rows in the test harness.

It also appears that for the CAV gaps to islands solution, we lost the parallelism that resulted in the highly favorable elapsed time result, possibly because SQL had cached a non-parallelized execution plan at the lower row counts.

Ultimately the CAV (Cross Apply Values) approaches may offer benefits under some conditions but your choice should be made based on a careful analysis of which solution works best for your constraints of data rows, parallelism and the specific resource that you most wish to conserve (CPU, elapsed time or logical IOs).

A Google search for anything similar to an approach like this for calculating islands and gaps came up empty, but we’d love to hear if anyone has published anything on it before or has other improvements to suggest.  And we’d most certainly be thrilled to hear from our valued readers if this solution has been of benefit to them. Perhaps in SQL 2012 someone can come up with something faster utilizing a window frame.  We look forward to hearing about that too.

 Timing tests run for this article were done in SQL Server 2008 R2 on a Dell Inspiron Core i5 2.4 GHz with 8GB of memory. 

 Until next time

The Excel file with the results of the test, and the graphs, is attached, as are the four SQL source files for the tests. Links to them can be found at the head of this article

Dwain Camps

Author profile:

Dwain Camps has been a project manager for many years. Because performance of applications can be a critical success factor for projects, he has been evangelizing on the need to develop highly performing SQL. By mentoring and authoring articles on SQL, he hopes to train a future generation of software engineers on the right and wrong ways to deliver SQL code. He also has a special interest in developing solutions to complex, data intensive problems using high performance SQL because the declarative nature of SQL allows development of algorithmically unique solutions that procedural languages may not be capable of.

Search for other articles by Dwain Camps

Rate this article:   Avg rating: from a total of 12 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: A few sidenotes
Posted by: Torben Nehmer (not signed in)
Posted on: Monday, July 29, 2013 at 1:05 AM
Message: Hi,

first, a quite interesting article you wrote here. What I miss however, are these things:

First, you test scaling over a quite limited range, just one magnitude. However, things get really interesting if you add another (or even two) magnitudes. Getting Gaps in 200k rows is easy given potent hardware. Here, I have the problem to locate gaps of sizes between 2 and 2 Million in a dataset of approximately 400 million rows.

What I have been using (thanks to the MS Press T-SQL Querying book) is approximately this:

SELECT A.UNSIGNEDRECID + 1 AS UNSIGNEDSTARTRANGE,
ENDRANGE.UNSIGNEDRECID AS UNSIGNEDENDRANGE,
ENDRANGE.UNSIGNEDRECID - A.UNSIGNEDRECID AS AVAIL
FROM _RECIDS A
CROSS APPLY (
SELECT TOP 1 C.UNSIGNEDRECID - 1 AS UNSIGNEDRECID
FROM _RECIDS C WHERE C.DATAAREAID = A.DATAAREAID AND C.UNSIGNEDRECID > A.UNSIGNEDRECID) ENDRANGE
WHERE NOT EXISTS (SELECT * FROM _RECIDS B WHERE B.UNSIGNEDRECID = A.UNSIGNEDRECID + 1 AND A.DATAAREAID = B.DATAAREAID)
AND A.UNSIGNEDRECID < (SELECT TOP 1 D.UNSIGNEDRECID FROM _RECIDS D WHERE D.DATAAREAID = A.DATAAREAID ORDER BY D.UNSIGNEDRECID DESC)

This is the best solution I could find so far in terms of IO (which is the limiting factor in such large datasets). The actual gap search is a pair of inexpensive Table Scans over the base table plus a single quick index seek to find the end of the range. This scales relativly well even in the millions of gaps but is especially fast if you do incremental processing - I need only a few thousand gaps at a time starting at a specified location in the sequence.

The real key solution I a missing however is the new SQL 2012 way of life. Using window functions, the gaps problem gets really trivial and is downgraded to a single scan of the target table/index:

WITH SRC AS (
SELECT
DATAAREAID,
UNSIGNEDRECID,
LEAD(UNSIGNEDRECID) OVER (ORDER BY DATAAREAID, UNSIGNEDRECID) AS UNSIGNEDRECIDNEXT
FROM _RECIDS
)
SELECT UNSIGNEDRECID + 1 AS UNSIGNEDSTARTRANGE,
UNSIGNEDRECIDNEXT - 1 AS UNSIGNEDENDRANGE,
UNSIGNEDRECIDNEXT - UNSIGNEDRECID - 1 AS AVAIL
FROM SRC
WHERE UNSIGNEDRECIDNEXT - UNSIGNEDRECID > 1

The performance in both CPU-time, wall clock time and IO load is second to none.


Best regards,
Torben

Subject: A few sidenotes II
Posted by: Torben Nehmer (not signed in)
Posted on: Monday, July 29, 2013 at 1:59 AM
Message: Hi again,

I did a few more comparisons with the algorithms presented here.

While CAV – Islands to Gaps (by PESO) is indeed interesting from the basic idea (it requires only a single scan of the data), the amount of sorting and aggregating required actually is contraproductive if you need only the first 10k gaps (which I use as an example, as mentioned above).

PESOs Solution requires the Server to compute the complete resultset regardless what to do, moving around 7.5 GB of data in my test set with several hundred million rows. I cancelled the test run after 5 minutes, where the solution outlined above needs just two or three seconds to do the work.

MVP Deep Dives Gaps #3 exhibits the same problem when limited by a TOP statement albeit for other reasons (I am still waiting for a actual execution plan to deduce this).

Ultimately, it seems that the problem has quite a few facets depending on a) how big your dataset is and b) whether you always need the full results or just a window into the resultset.

The first point is of particular importance, I guess, as all algorithms scale well here if I limit the dataset to a million rows.

Best regards,
Torben

Subject: RE: A few side notes
Posted by: Dwain Camps (not signed in)
Posted on: Monday, July 29, 2013 at 2:47 AM
Message: Hi Torben,

Thanks for taking a look.

Yes, I did operate over a limited number of rows in my test harness. No excuses for that. For such large row sets as you are dealing with, testing any proposed solution is always an imperative and I apologize if I didn't give you sufficient results to infer scalability to those heights.

I wish I had SQL 2012 to use the solution you posted. I am aware of it but there are probably a few folks out there that haven't upgraded yet where this solution might be useful (I hope).

Subject: Looking for Gaps using SQL Server 2012 Lag function
Posted by: eklein@ahd.com (not signed in)
Posted on: Monday, July 29, 2013 at 9:09 AM
Message: I have no idea how this compares for performance with you examples, however the new LAG() function introduced in 2012 certainly is easier to read.


SELECT * FROM (
SELECT HDR_line , lag(hdr_line,1,0) OVER(ORDER BY hdr_line) AS lag_hdr
FROM dbo.Claim_hdr
) dt
WHERE hdr_line <> lag_hdr +1

Subject: non-problem looking for a human to give it significance
Posted by: Anonymous (not signed in)
Posted on: Tuesday, August 13, 2013 at 7:21 AM
Message: How about this? "sequences" should only be surrogate keys and never "human readable" references. To me this whole exercise is about a human wanting the computer to make sense. Use your business key to group your packages, let the computer use the sequence to group/separate its "packages".

Subject: non-problem looking for a human to give it significance
Posted by: Dwain Camps (not signed in)
Posted on: Tuesday, August 13, 2013 at 6:59 PM
Message: For a non-problem, there certainly is a lot of extant literature on solving it.

Whether by design or lack thereof, frequently there are times when it is necessary to do this silly kind of stuff.

Thankfully, it can be done and has been made a bit more intuitive in SQL 2012.

Subject: Not solution for similar case, where I deal with Time in HH:MM format
Posted by: pavan6e (view profile)
Posted on: Monday, September 23, 2013 at 8:38 AM
Message: I have a table to record some activity with start time and end time in HH:MM (24 hr clock) format. Can you please help me with the query to get the gaps?? I have tried as mentioned below.


SELECT ID, StartSeqNo=CONVERT(VARCHAR(5),(dateadd(mi, 15,(cur))),114), EndSeqNo=convert(varchar(5),datediff(MINUTE,1,nxt),114)
FROM (
SELECT ID, cur=starttime, nxt=(
SELECT top 1 endtime FROM dbo.ABC AS B
WHERE B.ID = A.ID AND B.endtime > A.starttime and personid='XYX'
and weekdate='2013-09-09 00:00:00.000' order by weekdate,starttime desc,endtime desc)
FROM dbo.ABC AS A where personid='XYX'
and weekdate='2013-09-09 00:00:00.000') AS D
WHERE cast(datediff(MINUTE,nxt,cur) as numeric(6,2))/60.00 >= 0;

This is not giving me the expected result. Can you please help me with this?

Subject: Looking for Gaps using SQL Server 2012 Lag function
Posted by: Dwain.C (view profile)
Posted on: Monday, November 18, 2013 at 5:13 PM
Message: Re: the LAG solution that eklein suggested, you might want to take a look at Wayne Sheffield's blog comparison. He tested LEAD against a couple of other solutions and I added CROSS APPLY VALUES (islands to gaps) to his test harness and also tried LAG. Note that I just found this blog post:

http://blog.waynesheffield.com/wayne/archive/2012/04/sql-server-2012-performance-test-gap-detection/

The LEAD solution was not a contender and I found that LAG had a similar performance characteristic in his test harness.

Subject: Not solution for similar case, where I deal with Time in HH:MM format
Posted by: Dwain.C (view profile)
Posted on: Monday, November 18, 2013 at 5:16 PM
Message: Regarding an equivalent Islands to Gaps solution for time, the answer is yes this will work.

The only difference is that you don't add/subtract one from the islands start/end points. You either add/subtract 0 or 3 milliseconds (for a DATETIME), depending on your intent.

To suggest a specific query for your case, I'd need you to supply some DDL and test data, and I'd be happy to do so.

Subject: RE: Looking for Gaps using SQL Server 2012 Lag function
Posted by: Dwain.C (view profile)
Posted on: Thursday, December 19, 2013 at 10:35 PM
Message: I ran across the following performance test of SQL LEAD as a gaps solution against a couple of others. LAG would be expected to perform similarly.

http://blog.waynesheffield.com/wayne/archive/2012/04/sql-server-2012-performance-test-gap-detection/

I've personally tested LAG and LEAD against the solutions I proposed here and they weren't even close performance-wise. Very nice, tight syntax though.


 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.