Often, the quickest way to determine whether you have missing data in a table such a ledger or journal is to see if there are gaps in the dates where one wouldn't expect them. But how do you do that in an emergency, particularly in a large table, when every minute counts?
The Task Defined.
Imagine this: The customer runs the application that you've designed, and cannot see any sign of business transactions over a time period where he'd expect to see them. Your manager asks you to investigate. He will almost certainly ask you to find all of the missing days or hours in a date range where no data has been recorded: The date range can be several years.
The situation may become even worse; missing data can have unpleasant consequences, and that fact will make both the customer and the boss very nervous.
As a result, you will be under stress: you have to reassure your boss and the customer, and you need to do that in a very limited amount of time.
In this article, we will discuss a few SQL, one-query techniques that can help you to find the missing dates.
Firstly, you will need to create and load an auxiliary table that will store the sequence of whole numbers. The volume of numbers in that table will be equal to the number of days in the year 2007 (see Listing1):
Secondly, generate some test data and load that data into a sample table. Then produce the gaps, using the next algorithm: delete dates, where the number of days since Jan 01, 1900, divided by 4 or 5 or 6 has zero in remainder.
The second preparation step can be done this way (see Listing2):
Now that we have the auxiliary table and the sample data, we can try out a few techniques.
Technique #1 –The simple list
The first technique is quite common and uses an outer join (see Listing3):
There is nothing wrong with the solutions shown in the Listing3. However, if you need to find the missing hours in the whole year or the missing days in a few years, these queries may produce a long list of dates, which would be soon become unwieldy and difficult to interpret.
Your manager would not appreciate being given this sort of report.
Therefore, you need to find another solution, which will represent the result in a way that is more convenient for reading and analyzing.
Technique #2 –finding the ranges
Think of the missing dates as being a number of gaps in the continuous sequence of the dates. Then, instead of generating each missing date, you can show just the beginning and the end of each gap. The code in the Listing4 demonstrates two variants of such a solution:
Both queries from the Listing 4 use self-join and produce the same result in approximately the same time. However, if you test the queries on a larger sample set; for ten years, for example, instead of one year, you will find that the second query runs much faster than the first one.
The reason for such a difference in performance is the way in which the queries deal with dates.
The first query uses an arithmetic operator (+) in order to add one day to the date. The second query for the same operation employs SQL Server built-in function.
The first query from the Listing 4 is typical in showing how the use of arithmetic operations in date/time calculations can decrease the queries’ performance.)
So far, we examined the technique #2 on relatively small data sets.
There are, however, the situations when the amount of data that needs to be processed will be significantly larger. For example, your system may collect the POS (point of sale) transactions from multi-chain department and grocery stores located in different countries and (or) time zones.
In that case, if you asked to find the missing minutes (seconds) in a few months or even years, you may find that the fastest (second) query from the Listing 4 is also inefficient.
Then, you will need to enhance your queries in order to get the acceptable execution time.
One possible enhancement would be to employ an auxiliary column, as shown in the Listing 5:
The queries from the Listing 5 are based on the idea of SQL Server MVP Erland Sommarskog (Stockholm).
All you need to do is to assign the order number to each row in the table and then to compare the values (they can be dates or numbers or even letters) in the adjacent rows, using the order numbers.
The first query in the Listing 5 uses the identity column in order to enumerate the rows in the table.
The second and third queries use ranking function ROW_NUMBER().
All three queries are much faster than the fastest (second) query from the Listing 4.
Note, that queries from the Listing 5 require column theDate (table sampleDates) to be sorted before assigning the order number. Since table sampleDates has a primary key constraint, the clustered index is also created on column theDate. That means you don’t need to sort the data in table sampleDates explicitly: clustered index will do that automatically.
One more thing that you should note is the incomplete results, returned by all the queries from the Listing 4 and Listing 5.
Indeed, Jan 01, Dec 30 and Dec 31 are missing dates in table sampleDates. However, they are not shown in the result as the gaps, because all the queries treat Jan 02, 2007 as the beginning and Dec 29, 2007 as the end of the sample data set.
In order to get the correct result, you need to insert the dates Dec 31, 2006 and Jan 01, 2008 into the table sampleDates.
Technique #3 –the Pivot table
One more technique presents missing dates in the form of the spreadsheet.
This type of output is one of the most attractive for the human eye and probably is the favorite document type of your boss.
The technique shown in the Listing 6, uses SQL Server 2005 PIVOT operator in order to produce tabular output:
The beauty of the query from the Listing 6 is not only in its cross-tab view. That query also shows days of the week, which is very convenient, when you need to analyze business (application) that runs not whole week, let us say 5-6 days per week. In that case, the missing Saturdays and Sundays will be considered as weekends, but not as the missing days.
How To Find The Missing Hours
This bonus example shows how to apply the above techniques to the case of missing hours
You can also use this paragraph as a reference, when you need to find missing minutes or seconds.
First, create a sample data set (see Listing 7):
The following script shows the solutions for missing hours (see Listing 8):
The first solution from the Listing 8 shows the missing hours as the beginning and the end of the gap.
The second solution shows the number of missing hours per each day
If this is not enough, the third solution will give you detailed information about missing hours.
The Task Completed: Here Are The Results
You found missing dates and sent the results to your boss. After short investigation, the problem was identified: some data has been deleted from database by mistake, during the application upgrade.
In that situation, everybody relied on you and once again, you rose to the challenge. You found the old backup, restored the database under another name and moved the missing data to the original database.
Here is the result of your efforts:
The customer is satisfied. The manager impressed by your skills. You can be proud of your achievements.
The source code to this article can be downloaded from the speech-bubble at the top of the article by clicking on 'TSQL Source Code'