Click here to monitor SSC
Av rating:
Total votes: 36
Total comments: 14


Phil Factor
Confessions of a DBA: My worst mistake
23 January 2012

Over the next few months, we'll be asking various well-known DBAs to describe their worst disaster caused by a mistake they made.  To kick off the series, we asked Phil Factor to confess. He came up with a classic: The mistaken belief that a backup WITH CHECKSUM guaranteed a good backup that could be restored, and the ensuing disaster.

I suspect that pretty well all serving DBAs will have memories that will make them cringe with embarrassment. Even the dignified figures that give such authoritative presentations at PASS will once have made mistakes that, I imagine, will now make them whistle at the memory. Although my most embarassing mistake is already documented as The Computer that Swore, my worst mistake was the result of assuming that a backup WITH CHECKSUM guaranteed a good backup that could be restored. It looked good but I didn’t reckon on the possibility of a data page that gets corrupted whilst in memory and then written to disk with a valid checksum. Also, I thought then that all pages had checksums; they don’t. Now, of course, I’ve sat through a couple of Paul Randal’s presentations in which he explains this, and other ways that data can be backed up even though there are corruptions. The WITH CHECKSUM is fine to catch some errors but you really need to catch all errors that would cause a bad restore. No, there is no substitute for checking that a database can be restored by actually restoring it and running DBCC CheckDB on it.

I was developing a database for a web startup. It was an ambitious system for doing wholesale commodity trading, and it had a complex data model, and a lot of data, originally held messily in a number of different ‘legacy’ systems. The process of sorting out this ‘donkey’s breakfast’ of data was a nightmare, but the time came that I was able to release to production a functional application that allowed data entry. It had, at that time, around 150 concurrent users, but the comms bandwidth throttled this enough to make it manageable. I set up full recovery with a weekly full backup with ten minute transaction log backups. We’re going back a few years, as you’ll appreciate when I tell you it was a backup onto DAT tape. I could cycle the backups to go back six weeks just to be super-sure, and the tapes were recycled.

When I first got the ‘SQL Server Fatal Error Table Integrity Suspect’ error, I ran DBCC CheckDB and determined that a couple of tables were corrupted. Nowadays, we get excellent advice of what to do, but then it was different. Books Online at that time incorrectly advised a restart but it didn’t help. We all have days like this, so thinking that some sudden hardware failure had struck the machine, I restored onto another, clean, machine from backup. As the helpful text on BOL put it ‘Restore from the latest known good backup’. It failed, leaving the database in ‘suspect’ mode. Then, of course, came the long slow process of going back through the backups with DBCC tools to work out where the corruption happened. I got to the one that I’d done six weeks ago; the oldest surviving backup. It wouldn’t restore. I had no ‘known good’ backups. Quite the contrary, I knew I hadn’t. The data corruption had been there for some time, festering and growing until the point it triggered the fatal error

I scratched my head. It was an odd phrase ‘the last known, good backup’. How do you know for certain if you don’t do a restore? I’d naively assumed that I’d guaranteed that it would restore by backing up with checksum. Wrong. The only way that one could ‘know’ is to restore it, and then run DBCC CheckDB on it. It is an interesting thought, and no backup system will save you from having to do this task.

This is the sort of moment in one’s professional career that one briefly considers donning a false beard and immediately setting off for Australia under an assumed name. It was certainly a painful moment having to tell the company directors that the database would be offline for a while. There were not only traders but a lot of people inputting information. The painful process of determining what was wrong and putting it right is a fascinating tale but not relevant to this story. Suffice it to say that a skeletal service was restored within a day, and the database burst back into full life after two working days and a weekend. A hardware failure had, undetected, gradually increased the corruption of the data pages until it caused the fatal error.

I cannot really offer an excuse. The best I can do is to say that I didn’t have the time to do a routine checkDB on a restored backup, which was probably true, but nowadays one can script this check, and there are now third-party tools (e.g. SQL Virtual Restore) that can cut down on the time required by running a DBCC CheckDB on what is, in reality, a backup rather than a restored database.

Even with the help of Powershell, SMO and all the third-party tools, those DBAs with, say, a hundred production servers in their care will soon work out that the availability of time and hardware doesn’t permit a verification of every backup. Tom Larock came up with a useful technique of using statistical sampling to do the best possible attempt to run comprehensive checks.

One sideline in this story is the skeletal service. A valuable safety-net in the case of a major disaster is to be able to quickly provide a service that is sufficient to let the business continue. Each database is different, and it takes planning and rehearsal,  but I generally find that a small restricted subset of the service can provided quickly from a 'skeletal' database with only an essential subset of the data. I always have scripts and data standing by on the server ready for this remote eventuality, and fortunately did then.

Whatever else one should take from this story, the one I hope you’d take is the idea that the only high-availability strategy that makes sense is a ‘Restore’ strategy, rather than a Backup strategy.



This article has been viewed 11596 times.
Phil Factor

Author profile: Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 25 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

To translate this article...

Search for other articles by Phil Factor

Rate this article:   Avg rating: from a total of 36 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: No Room At The Inn
Posted by: BuggyFunBunny (view profile)
Posted on: Friday, January 27, 2012 at 1:51 PM
Message: -- No, there is no substitute for checking that a database can be restored by actually restoring it and running DBCC CheckDB on it.

I've worked big places and small ones. The smaller ones were in the habit of buying the teeniest machine possible to run the business, most often with VAR/ISP all-in-one packages driven by the GL. They never had enough machine to restore to. Not a fun way to live, and no amount of pleading would get them to spring for enough machine.

Subject: Confessions of a DBA: My worst mistake
Posted by: Glenn (view profile)
Posted on: Thursday, February 02, 2012 at 2:06 AM
Message: Phil
Would your scenario of undetected corruptions not have been avoided by doing a CheckDB on the "live" database on a regular basis? We always run CheckDB before doing a backup. We then know at least that the database is ok. Even though as you say it does not guarantee the integrity of the backup, it is highly unlikely that all backups for several weeks would be corrupt. From what I understand from your article, it wasn't really the backups that were a problem for you, the problem was that you were continuing to back up a corrupt database – or am I missing the point?
Glenn

Subject: CheckDB
Posted by: Phil Factor (view profile)
Posted on: Thursday, February 02, 2012 at 3:03 AM
Message: @Glenn
Agreed,Use CheckDB whenever opportunity presents itself, although I usually do it on a test-restore server as it then doesn't matter how long it takes (my databases tend to get big). My only excuse for not having done it before backing up for those six weeks was the pressure that both I and the Server were under at the time, and the silly idea that the 'CHECKSUM' option would have spotted any serious corruption. I could have set it to run as a weekend task before the backup. OK I goofed, but then this is a confession! You are right that I was continuing to back up a corrupt database.

Subject: Check DB
Posted by: Anonymous (not signed in)
Posted on: Thursday, February 02, 2012 at 5:52 AM
Message: Phil, you would look good with a beard, but Australia is full of criminals, you wouldn't want to live there. :)

Unfortunately I work in a place without a dedicated DBA and no budget to hire one, so as a developer, I have to keep up with our 10 or so database servers with 100+ databases. Unfortunately #2 is that I'm expected to spend my day developing (which I can bill at a higher rate) and not time making sure that everything is OK. Where would I find a good article on automating such a process so that it didn't take away what my company values as "time too valuable" to keep everything in good working order?

Subject: Re: Check DB
Posted by: Phil Factor (view profile)
Posted on: Thursday, February 02, 2012 at 6:47 AM
Message: @Anonymous

Without a doubt, I'd recommend Ola's scripts for automating all the DBA maintenance chores, including backup. It is described here by Brad.
http://www.simple-talk.com/sql/database-administration/automate-and-improve-your-database-maintenance-using-ola-hallengrens-free-script/

I do test restores via the SQL Backup GUI, but it can be done in TSQL, or by using PowerShell and SMO, pretty easily.

Subject: 3rd party virtual tools
Posted by: Jonlee (not signed in)
Posted on: Thursday, February 02, 2012 at 7:10 AM
Message: @Phil, nice article btw.

Would you say using Idera's virtual database (other products out there) constitutes as proving your have a successful restore strategy?

Thanks

Subject: Re: 3rd party virtual tools
Posted by: Phil Factor (view profile)
Posted on: Thursday, February 02, 2012 at 7:30 AM
Message: @JonLee: I would certainly do so if I'd tried using it for this! My understanding was that Idera weren't recommending it for this purpose but I'm probably out of date; if so, apologies! Their site says it 'supports read, write and DBCC maintenance commands.'
Wes Brown's review here (over a year ago) mentions the initial problems  http://sqlserverio.com/2010/09/21/software-review-ideras-virtual-database/ 
... and check out the section on CheckDB.
'You can’t run a DBCC CHECKDB on it. Since DBCC uses sparse files and snapshots to get a consistent look at the data it can’t run on a SQLvdb. SQLvdb is also using sparse files to do some of the things it does. What’s worse is the DBCC command just hangs out and looks like it is running OK. I let it run for a couple of hours before killing it.'

Subject: Re: 3rd party virtual tools
Posted by: Jonlee (not signed in)
Posted on: Thursday, February 02, 2012 at 2:51 PM
Message: Thank you for your response.

It does support checkdb, but only ever tried on one of our small DBs and not our 1Tb. It is however still a little buggy but is well on its way to providing quick way in recovering deleted data quickly (and no I dont work for idera ;-)).

It was only the other day that I was re-reading Paul's myth busters... "you should always plan a good backup strategy", he answered "no"... "You should plan a restore strategy..". This is one thing that some people forget. Me included in my youth!


Subject: Automated restore and consistency checks
Posted by: Anonymous (not signed in)
Posted on: Monday, February 06, 2012 at 7:33 AM
Message: A quick Google search turned up SQL Verify. Anybody tried this?

An interesting approach: generate a checksum of your backup set, restore your backup set and run consistency checks on it. If everything turns out ok, any subsequent copies you make of the backup set can be verified by just comparing file checksums, instead of running the restore/consistency check cycle repeatedly.

Subject: RESTORE VERIFYONLY
Posted by: Anonymous (not signed in)
Posted on: Monday, February 06, 2012 at 3:30 PM
Message: Thanks for the article!

You said "no substitute for checking that a database can be restored by actually restoring it and running DBCC CheckDB". Is a DBCC CHECKDB followed by a backup with RESTORE VERIFYONLY not good enough?

Thanks!

Subject: RESTORE VERIFYONLY
Posted by: Anonymous (not signed in)
Posted on: Monday, February 06, 2012 at 10:41 PM
Message: Is a DBCC CHECKDB followed by a backup with RESTORE VERIFYONLY not good enough?

Apparently not, according to the guys selling SQL Verify (http://www.yohz.com/products_sqlverify_restoreverifyonly.htm). Certainly food for thought.

Subject: SQL Backup 7
Posted by: Brent McCracken (not signed in)
Posted on: Tuesday, February 07, 2012 at 2:41 AM
Message: Red Gate SQL Backup 7 has the scheduling capability for restores and performing a checkdb at the end of the restore

Subject: Testing the restores.
Posted by: Phil Factor (view profile)
Posted on: Tuesday, February 07, 2012 at 12:32 PM
Message: It may be that backups are now so incredibly reliable that one can argue that the CheckDB of the live system is enough. I just can't accept that, though the risk is much lesser now. It wasn't long ago that, with DAT drives, and MO drives, that this definitely wasn't the case, and I've experienced several faulty restores whilst using dodgy media. Since this nasty incident that I've described, I've been rather keen on doing the restore+CheckDB check, and several times it has failed for various reasons, mostly due to faulty media.

Subject: antipodean DBA
Posted by: SQL Ferret (view profile)
Posted on: Sunday, February 12, 2012 at 12:52 AM
Message: Wait! I wear a false beard and I moved to Australia!

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

The Parodist: A SQL Server Application
 Every year, we ask Phil Factor to celebrate the holiday season with an article on SQL Server... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk