Pixar recently confessed, in an engaging video, that Toy Story 2 was almost lost due to a bad backup, but sometimes there is no 'almost'. Grant Fritchey casts a sympathetic eye over some catastrophic data losses, and gives advice on how to avoid what he has termed an RGE (résumé generating event).
Edwin Starr asked a slightly different question and came up with the answer “Absolutely nothing” (and yes, one could argue with that in context with the original question, but I’ll leave that for the experts). While I’m butchering quotes and stretching similes down to a single atom in width, I’ve also heard, “Backups? We don’t need no backups.”
To both these, I have to respond, “I beg to differ.” And I’m not alone. Don’t believe me? Well, assuming you can track them down, I’d suggest asking the people who used to work at these various businesses what they think about the idea of setting up a thorough, tested, and monitored backup plan.
Note, I’m not criticizing any of these people. They made mistakes. The Gods know I make mistakes almost minute by minute, so I’m not throwing stones here. But we have beautiful, perfect, lessons to learn from the mistakes of others, so let’s learn.
Ma.gnolia was a small, but well regarded, site for aggregating your links so that you had a common repository between different machines and different environments. It’s a great idea. Anyway, the company had a database around 500gb on MySQL. There was a backup process in place which was, from what I can tell, to stream the MySQL data file to a second server.
As this article from Wired makes out, ma.gnolia were suffering from data corruption in the MySQL database. This had been an ongoing issue. It got progressively worse, as these things do, and then one day… their world stopped. They went to the backup, but, the file copy had copied all the corruption too. They tried to recover through a disk recovery service, but all they ever got back was corrupted files. The database was dead, and so was the company.
What Could Have Happened
They had data corruption.
I’ve worked for a startup, so I know how this goes. Things were working well enough. The data corruption only hurt a few people or a few pieces of data and therefore was no reason to stop. Forward, faster, that’s the motto for startups.
The problem is, data corruption gets worse. It just does. Sooner or later, you’re not looking at losing a page of data, you’re looking at losing a database, which is what happened. If you identify corruption in the system, you need to fix it immediately. If it means downtime, it means downtime. You’re in a “pay me a little now or pay me a lot later” situation. Take the hit. Fix the issue.
Further, they seemed to have not tested their backup. There was an implicit assumption that since the backup was in place, they had no worries. But you have to test your backups. You must. If you haven’t checked your backups, then you don’t know what’s there. If you have corruption in your production system, I’m willing to put money down you have corruption in your backups. Further, since this evidently wasn’t an actual backup, but rather more like a mirror, it’s doubly an issue.
If they had been testing the backups they would have known that the corruption was getting copied over. They could have tried fixing things on the backup server to see what recovery path was open to them, if any. At the very least they would have known that they couldn’t rely on their backups and could have done something else, anything else.
- Ma.gnolia founder discusses site outage and data loss
- Ma.gnolia Effect: Should we trust the clouds (yes, but listen to this man)
- Ma.gnolia’s Bad Day
- Ma.gnolia Data is Gone For Good
CouchSurfing is a networking site (still in operation, but without the original data) that allows you to arrange a place to sleep for little to no cost. Another great idea. This is another startup running on a MySQL database. Once more, they were running a backup process that entailed copying the files, not running any type of backup operation.
A drop table statement was issued. On production. Against their most important tables. All of them. All at once. Oops.
So, they went to the backups. Only to find that there were only partial backups from parts of the file system, but not from all the tables in the database. The company was out of business, but, it was so popular with its users that it was resuscitated and rebuilt from scratch, but the original database was destroyed.
What Could Have Happened
I know this is only the second example, but hopefully you’re beginning to sense a pattern. You need to test your backups. You can’t simply trust that you have a backup system in place. Especially since, it’s entirely possible that the choices that have been made might not lead to a good recovery of your database. You need to know that you can restore the databases that define your business. The only way to do this for sure is to test them.
You also need to get away from the idea that copying files is the same thing as a backup. Database systems are actually complex pieces of software with multiple moving parts. I don’t know the details about backups in MySQL, but clearly it’s possible to miss files that define the database. Same thing in SQL Server. Not to mention that SQL Server locks files, so either your backup system will skip them, or it will copy them, but they won’t be transactionally consistent. This means when you go to restore them, it’ll fail like you have data corruption. Figure out how backups work within your RDBMS and perform them appropriately. If you’re going to use a file copy system, two things, you need to test restores from it just like you do from a database backup process, and, it needs to be transactionally aware.
Don’t do this, and you’re left with nothing, no couch, no surfing, no TV Party.
Lest you think this is all about small startups, let’s talk about Danger, or as it’s known on the web Microsoft/Danger. Yes, you read that correctly, Microsoft failed a backup process. It can, and does, happen to large companies as well as small ones.
The Danger Data Center was used by the T-Mobile Sidekick phone to act as an off-the-phone backup for personal data. I have not been able to identify what type of database system we’re talking about here. It may have been SQL Server but according to several stories the applications and structures had not yet been fully taken into Microsoft methods and practices. There were backups in place.
It’s not completely clear. It was reported by many sources that the service went away over a weekend. An announcement was made that anyone who powered off their phones, or who had powered off their phones, lost all data. Then, a couple of weeks later, the data was recovered from older backups, so there was some data loss, but not all. Officially, Microsoft lists it as a service outage, not a data loss, not nothing.
The rumors, which were reported widely by the technology media is that there was a SAN upgrade going on, managed by Hitachi. Hitachi allegedly switched everything over to a new SAN, but then it went off line. Someone asked for backups, but, according to report, no one took any before starting the process. The full story isn’t clear from what I can find.
What Should Have Happened
If you’re performing an operation on your production system, step one must be, run new backups. Always. You can’t go wrong being paranoid about your production data. I know that you’ve performed the SAN migration 1000 times in the last week. I know you wrote the manual on it. I know you wrote a book on it. And, I know that stuff happens, so I’m going to run a backup now, OK? I’m probably also going to take a copy of last night’s backup and the log backups for the last 24 hours too.
It’s that simple. Your data is only as good as your last backup.
Oh, which brings up one more thing, why did it take two weeks plus to recover? I’m guessing here, so assume this is absolute speculation, not knowledge. They perhaps didn’t know where the backups were kept and it took them a while to locate them and validate them. Possibly. They might never have done a restore with any of the systems before, so there was a lot of trial and error and fumbling until everything was back online. You need to practice recovering your systems so you can do it when you’re under pressure.
- What Caused the Sidekick Fail?
- Sidekick Data Returns
- T-Mobile and Microsoft/Danger data loss is bad for the cloud
- Microsoft and Danger to Blame for Sidekick Data Loss
This was (and is again, they’re back, reincarnated after selling the domain name) a blog hosting software. They were running , I think, SQL Server (for sure, some type of SQL database). According to several reports including this one from Tech Crunch there was nothing even approaching a backup in place. Instead, they were relying on the fact that they had a RAID array as their “backup”.
Through some series of circumstances, entire databases were dropped. RAID, which is meant to backup HARDWARE, not data, simply did what it was supposed to do and dropped the databases on the mirrored drives, as instructed. Poof. Gone Daddy Gone.
What Should Have Happened
Oh, I don’t know, maybe if they had SET UP A BACKUP PROCESS instead of assuming that hardware redundancy was the same thing as having a backup. For those unclear about the issue, it isn’t. You need to have a backup. Further, you should have a backup that goes offsite. Buildings burn/flood/fall/collapse/get hit by meteors/get stepped on by Godzilla/lose power/have roofs collapse (sometimes not even from Godzilla). And it’s not like, in this case, there wasn’t a warning. You need to take a backup of your data, and you need to have a copy of that backup somewhere other than your building. That’s assuming that your entire business is built around the data. If it’s completely unimportant (your collection of LOLCat pictures for example), then sure, maybe take a backup, maybe not. But if your business is built around that data, protect it. Back it up, or watch it go away.
Sorry, but this one bothers me. It’s not just that a company apparently made a mistake , a hosting company was also involved and they also made a mistake. This many people shouldn’t make this many mistakes.
This isn’t the same thing at all. Instead of a database, we’re talking about files, but the story is so perfect an illustration of my entire point that I have to include it.
This one is a classic. Someone basically issued ‘DELETE *.*’ to their file system… the one that stored the movie under production at the time, Toy Story 2. When they went to their backups, they found that they had been failing for over a month. This was late in production and they would have to recreate at least a year’s worth of work to get everything back online.
Luckily, for all of us, the head geek had been copying the movie to her home computer to show her kids. So she had it, complete and intact.
What Should Have Happened
We’re back to validating your backups. You simply can’t assume that the backup is good. You have to get some method of validation. If you don’t validate the backups, you’re left with nothing but a ball of confusion.
The score is two dead for certain, one resurrected and one “nothing to see here, move along.” However, each of these was a completely catastrophic data loss which could have been prevented with a few simple steps:
- Take a backup of your databases
- Validate your backups through testing
- Validate your databases through consistency checks
- Move a copy of the backups offsite
- Test your restore process
Yes, all of this is extra labor, but none of it is difficult or unknown. There are studies that show that small-to-medium businesses are not prepared and, will, in fact, go out of business due to data loss (although the precise numbers are absolutely in dispute). This information is available to businesses, insurance companies and data professionals. In short, you have no excuses. If you’re not doing backups, testing backups, and monitoring backups, you’re just waiting for a résumé generating event (RGE) to occur.
- SQL Server Backup Crib Sheet
- Confessions of a DBA, my worst mistake
- Grant Fritchey's SQL Server Howlers
Are your backups corruption-free? Avoid having your own backup disaster
New SQL Backup Pro 7 includes easy, integrated backup verification. Quickly verify your SQL Server backups using DBCC CHECKDB and discover corruption before it’s too late. Learn more.