Click here to monitor SSC
  • Av rating:
  • Total votes: 40
  • Total comments: 7
Grant Fritchey

Backups, What Are They Good For?

15 May 2012

Pixar recently confessed, in an engaging video,  that Toy Story 2 was almost lost due to a bad backup, but sometimes there is no 'almost'. Grant Fritchey casts a sympathetic eye over some catastrophic data losses, and gives advice on how to avoid what he has termed an RGE (résumé generating event).

Edwin Starr asked a slightly different question and came up with the answer “Absolutely nothing” (and yes, one could argue with that in context with the original question, but I’ll leave that for the experts). While I’m butchering quotes and stretching similes down to a single atom in width, I’ve also heard, “Backups? We don’t need no backups.”

To both these, I have to respond, “I beg to differ.” And I’m not alone. Don’t believe me? Well, assuming you can track them down, I’d suggest asking the people who used to work at these various businesses what they think about the idea of setting up a thorough, tested, and monitored backup plan.

Note, I’m not criticizing any of these people. They made mistakes. The Gods know I make mistakes almost minute by minute, so I’m not throwing stones here. But we have beautiful, perfect, lessons to learn from the mistakes of others, so let’s learn.

ma.gnolia

Ma.gnolia was a small, but well regarded, site for aggregating your links so that you had a common repository between different machines and different environments. It’s a great idea. Anyway, the company had a database around 500gb on MySQL. There was a backup process in place which was, from what I can tell, to stream the MySQL data file to a second server.

What Happened

As this article from Wired makes out, ma.gnolia were suffering from data corruption in the MySQL database. This had been an ongoing issue. It got progressively worse, as these things do, and then one day… their world stopped. They went to the backup, but, the file copy had copied all the corruption too. They tried to recover through a disk recovery service, but all they ever got back was corrupted files. The database was dead, and so was the company.

What Could Have Happened

They had data corruption.

I’ve worked for a startup, so I know how this goes. Things were working well enough. The data corruption only hurt a few people or a few pieces of data and therefore was no reason to stop. Forward, faster, that’s the motto for startups.

The problem is, data corruption gets worse. It just does. Sooner or later, you’re not looking at losing a page of data, you’re looking at losing a database, which is what happened. If you identify corruption in the system, you need to fix it immediately. If it means downtime, it means downtime. You’re in a “pay me a little now or pay me a lot later” situation. Take the hit. Fix the issue.

Further, they seemed to have not tested their backup. There was an implicit assumption that since the backup was in place, they had no worries. But you have to test your backups. You must. If you haven’t checked your backups, then you don’t know what’s there. If you have corruption in your production system, I’m willing to put money down you have corruption in your backups. Further, since this evidently wasn’t an actual backup, but rather more like a mirror, it’s doubly an issue.

If they had been testing the backups they would have known that the corruption was getting copied over. They could have tried fixing things on the backup server to see what recovery path was open to them, if any. At the very least they would have known that they couldn’t rely on their backups and could have done something else, anything else.

Story Sources

CouchSurfing

CouchSurfing is a networking site (still in operation, but without the original data) that allows you to arrange a place to sleep for little to no cost. Another great idea. This is another startup running on a MySQL database. Once more, they were running a backup process that entailed copying the files, not running any type of backup operation.

What Happened

A drop table statement was issued. On production. Against their most important tables. All of them. All at once. Oops.

So, they went to the backups. Only to find that there were only partial backups from parts of the file system, but not from all the tables in the database. The company was out of business, but, it was so popular with its users that it was resuscitated and rebuilt from scratch, but the original database was destroyed.

What Could Have Happened

I know this is only the second example, but hopefully you’re beginning to sense a pattern. You need to test your backups. You can’t simply trust that you have a backup system in place. Especially since, it’s entirely possible that the choices that have been made might not lead to a good recovery of your database. You need to know that you can restore the databases that define your business. The only way to do this for sure is to test them.

You also need to get away from the idea that copying files is the same thing as a backup. Database systems are actually complex pieces of software with multiple moving parts. I don’t know the details about backups in MySQL, but clearly it’s possible to miss files that define the database. Same thing in SQL Server. Not to mention that SQL Server locks files, so either your backup system will skip them, or it will copy them, but they won’t be transactionally consistent. This means when you go to restore them, it’ll fail like you have data corruption. Figure out how backups work within your RDBMS and perform them appropriately. If you’re going to use a file copy system, two things, you need to test restores from it just like you do from a database backup process, and, it needs to be transactionally aware.

Don’t do this, and you’re left with nothing, no couch, no surfing, no TV Party.

Story Sources

Danger

Lest you think this is all about small startups, let’s talk about Danger, or as it’s known on the web Microsoft/Danger. Yes, you read that correctly, Microsoft failed a backup process. It can, and does, happen to large companies as well as small ones.

The Danger Data Center was used by the T-Mobile Sidekick phone to act as an off-the-phone backup for personal data. I have not been able to identify what type of database system we’re talking about here. It may have been SQL Server but according to several stories the applications and structures had not yet been fully taken into Microsoft methods and practices. There were backups in place.

What Happened

It’s not completely clear. It was reported by many sources that the service went away over a weekend. An announcement was made that anyone who powered off their phones, or who had powered off their phones, lost all data. Then, a couple of weeks later, the data was recovered from older backups, so there was some data loss, but not all. Officially, Microsoft lists it as a service outage, not a data loss, not nothing.

The rumors, which were reported widely by the technology media is that there was a SAN upgrade going on, managed by Hitachi. Hitachi allegedly switched everything over to a new SAN, but then it went off line. Someone asked for backups, but, according to report, no one took any before starting the process. The full story isn’t clear from what I can find.

What Should Have Happened

If you’re performing an operation on your production system, step one must be, run new backups. Always. You can’t go wrong being paranoid about your production data. I know that you’ve performed the SAN migration 1000 times in the last week. I know you wrote the manual on it. I know you wrote a book on it. And, I know that stuff happens, so I’m going to run a backup now, OK? I’m probably also going to take a copy of last night’s backup and the log backups for the last 24 hours too.

It’s that simple. Your data is only as good as your last backup.

Oh, which brings up one more thing, why did it take two weeks plus to recover? I’m guessing here, so assume this is absolute speculation, not knowledge. They perhaps didn’t know where the backups were kept and it took them a while to locate them and validate them. Possibly. They might never have done a restore with any of the systems before, so there was a lot of trial and error and fumbling until everything was back online. You need to practice recovering your systems so you can do it when you’re under pressure.

Story Sources

JournalSpace.com

This was (and is again, they’re back, reincarnated after selling the domain name) a blog hosting software. They were running , I think, SQL Server (for sure, some type of SQL database). According to several reports including this one from Tech Crunch there was nothing even approaching a backup in place. Instead, they were relying on the fact that they had a RAID array as their “backup”.

What Happened

Through some series of circumstances, entire databases were dropped. RAID, which is meant to backup HARDWARE, not data, simply did what it was supposed to do and dropped the databases on the mirrored drives, as instructed. Poof. Gone Daddy Gone.

What Should Have Happened

Oh, I don’t know, maybe if they had SET UP A BACKUP PROCESS instead of assuming that hardware redundancy was the same thing as having a backup. For those unclear about the issue, it isn’t. You need to have a backup. Further, you should have a backup that goes offsite. Buildings burn/flood/fall/collapse/get hit by meteors/get stepped on by Godzilla/lose power/have roofs collapse (sometimes not even from Godzilla). And it’s not like, in this case, there wasn’t a warning. You need to take a backup of your data, and you need to have a copy of that backup somewhere other than your building. That’s assuming that your entire business is built around the data. If it’s completely unimportant (your collection of LOLCat pictures for example), then sure, maybe take a backup, maybe not. But if your business is built around that data, protect it. Back it up, or watch it go away.

Sorry, but this one bothers me. It’s not just that a company apparently made a mistake , a hosting company was also involved and they also made a mistake. This many people shouldn’t make this many mistakes.

Story Sources

Pixar

This isn’t the same thing at all. Instead of a database, we’re talking about files, but the story is so perfect an illustration of my entire point that I have to include it.

What Happened

This one is a classic. Someone basically issued ‘DELETE *.*’ to their file system… the one that stored the movie under production at the time, Toy Story 2. When they went to their backups, they found that they had been failing for over a month. This was late in production and they would have to recreate at least a year’s worth of work to get everything back online.

Luckily, for all of us, the head geek had been copying the movie to her home computer to show her kids. So she had it, complete and intact.

What Should Have Happened

We’re back to validating your backups. You simply can’t assume that the backup is good. You have to get some method of validation. If you don’t validate the backups, you’re left with nothing but a ball of confusion.

Story Source

Summary

The score is two dead for certain, one resurrected and one “nothing to see here, move along.” However, each of these was a completely catastrophic data loss which could have been prevented with a few simple steps:

  • Take a backup of your databases
  • Validate your backups through testing
  • Validate your databases through consistency checks
  • Move a copy of the backups offsite
  • Test your restore process

Yes, all of this is extra labor, but none of it is difficult or unknown. There are studies that show that small-to-medium businesses are not prepared and, will, in fact, go out of business due to data loss (although the precise numbers are absolutely in dispute). This information is available to businesses, insurance companies and data professionals. In short, you have no excuses. If you’re not doing backups, testing backups, and monitoring backups, you’re just waiting for a résumé generating event (RGE) to occur.

Related Content:

Are your backups corruption-free? Avoid having your own backup disaster
New SQL Backup Pro 7 includes easy, integrated backup verification. Quickly verify your SQL Server backups using DBCC CHECKDB and discover corruption before it’s too late. Learn more.

Grant Fritchey

Author profile:

Grant Fritchey, SQL Server MVP, works for Red Gate Software as Product Evangelist. In his time as a DBA and developer, he has worked at three failed dot–coms, a major consulting company, a global bank and an international insurance & engineering company. Grant volunteers for the Professional Association of SQL Server Users (PASS). He is the author of the books SQL Server Execution Plans (Simple-Talk) and SQL Server 2008 Query Performance Tuning Distilled (Apress). He is one of the founding officers of the Southern New England SQL Server Users Group (SNESSUG) and it’s current president. He earned the nickname “The Scary DBA.” He even has an official name plate, and displays it proudly.

Search for other articles by Grant Fritchey

Rate this article:   Avg rating: from a total of 40 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Multiple offsite backups
Posted by: timothyawiseman@gmail.com (view profile)
Posted on: Tuesday, May 15, 2012 at 1:24 PM
Message: Another great article, and I rather like the case studies.

You do mention keeping a copy of the backups offsite, which I think is wise. But I think its worth highlighting that there is value in having at least two sets of offsite backups. Physical media is subject to corruption, especially if exposed to any kind of extremes during transit. For anything truly mission critical, it makes sense to have at least two offsite copies.

Also, I find it hard to think of a collection of LOLCat pictures as completely unimportant data.

Subject: re: Multiple offsite backups
Posted by: Grant.Fritchey (view profile)
Posted on: Thursday, May 17, 2012 at 8:24 AM
Message: Can't argue with you at all. It's hard to be too paranoid depending on how important this data is to your company.

And lolcats are very unimportant. Loldogs though... I have multiple copies in several different locations.

Subject: Danger failure
Posted by: Geoff Hiten (SQLCraftsman) (not signed in)
Posted on: Friday, May 18, 2012 at 7:15 AM
Message: Grant,

It was Oracle, Sun, and Linux that crashed and took out Danger. Microsoft didn't want to admit they were running non-MS technologies, but that was what Danger ran when they got bought.
http://www.tgdaily.com/software-features/44329-oracle-linux-sun-to-blame-for-sidekick-danger-data-loss

Subject: Leone Siciliani
Posted by: Leone Siciliani (not signed in)
Posted on: Monday, May 21, 2012 at 4:15 PM
Message: Thanks for that awesome posting. Useful, and it saved MUCH time! :-)
HTTP://www.KneeNeckBackPain.com/

Subject: Backups
Posted by: dba-one (view profile)
Posted on: Friday, May 25, 2012 at 7:25 AM
Message: As a DBA, I understand without having to be told that backups are one of the fundamentals of my occupation. It may be understandable if you fail in any particular task but it is not understandable or acceptable to fail when it comes to backup and restore.

On top of SQL Server based backups or perhaps a 3rd party tool I have never assumed a network admin or anyone else for that matter would assure something made it to tape, off site, etc. If you are ultimately responsible for something never think that someone else is doing their part.

Mistakes will happen but not having a dependable backup isn't a mistake. It is the result of either stupidity, laziness or both.

Subject: Validation
Posted by: DougTucker (view profile)
Posted on: Monday, May 28, 2012 at 10:39 AM
Message: Message received - gotta have real backups!

Could you please elaborate on the process of validating backups and validating databases?

Subject: re: Validation
Posted by: Grant.Fritchey (view profile)
Posted on: Monday, June 04, 2012 at 8:14 AM
Message: I actually have another article all about testing backups here on Simple-Talk.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.