To Fly, To Serve, To Fry Your Servers

So, the story goes that an Ops engineer walked into a data center with the necessary pass, a cheery wave and a ‘good morning’. Shortly afterwards, he made history. At around 8.30AM, British Airway’s entire communications systems went down at the height of the May holiday, forcing them to cancel flights from the UK’s two main airports at the busiest weekend of the year. The disruption continued for days, with about 600 flights cancelled, 75,000 people stranded and an estimated cost to BA of well over £100 million.

The engineer in question, we are told by BA, was a contractor doing maintenance work at Boadicea House, one of two BA data centers near Heathrow airport. Apparently, he disconnected a UPS for about 15 minutes but then returned power to the servers in an “uncontrolled fashion”. He may have “interrupted the automatic switchover sequence” between backup and generator power supplies, causing a surge that damaged servers and shut down the entire data center.

The details of what happened are still sketchy, and the investigation so far has focused on human error. “The engineer was authorized to be on site but not to do what he did,” said Chief Executive, Willie Walsh.

The corporate instinct for self-preservation has obscured the shyer instinct to reveal the entire truth. We can only speculate. However, if you put a person in a position where they can accidentally cause such calamity, then that’s not “human error” but catastrophic process failure. Normal operational practice makes it impossible for a single human error during a power restart procedure to have such catastrophic consequences.

Countless questions have been raised about lack of investment, poor testing, and the state of the hardware at BA’s aging data center, but particularly the failure of their automatic failover and disaster recovery systems. Why did a failure at one data center have such a drastic and long-lasting impact around the world? After all, companies like Google can switch out entire data centers within seconds, as a matter of routine.

There should have been instantaneous failover to one of BA’s neighboring Heathrow data centers, Comet House, or to Cranebank. The power surge at Boadicea House may have corrupted some data, which was then synchronized with the secondary data center, causing subsequent failure there too. Was BA’s disaster recovery plan properly tested? Many companies test failover only of certain applications, and as part of a controlled, staged process, but neglect to test a real uncontrolled shutdown. After the failure of their automatic failover system, why couldn’t BA perform the standard procedure of a manual failover from backups, to one of their remote data centers?

Proper disaster recovery planning and testing involves substantial risk, cost and manpower, but it is the only way to be sure that it will work when you need it to. Even in the face of the most unpredictable set of circumstances, your organization is obliged to ensure that essential business processes can continue working. BA failed to do that.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.

  • 2575 views

  • Rate
    [Total: 5    Average: 4.8/5]
  • callcopse

    Whilst it may make me look like one of those people who slows down to look at crashes on motorways, I’d love to be a fly on the wall in some of the subsequent meetings.

    In all seriousness I would take it as a really good move for BA to publish a sanitized blog or similar putting forth the technical lessons they have learned from this experience. It might even be really good PR.

    • Gina Taylor

      Hi @callcopse:disqus, congratulations on winning this week’s commentary competition! Drop an email to newsletter@simple-talk.com and we’ll arrange your gift card.

  • Peter Schott

    I worked for an org that had an annual off-site data recovery test. One of our admins went to an off-site location, took our tapes, install media, and keys, and had to get our entire solution working again as quickly as reasonably possible. Because this was practice it was done during business hours, not scrambling. It did point out issues with our RTO process, which were addressed after his return. Most times the process went smoothly so few, if any, changes were needed.

    More recently our process has been around being able to build all of our apps and deploy them from source. That was done well, leaving the main concern point being the database recovery. Off-site backup copies help quite a bit there with the main concern being validation of those backups. We’re slowly moving those over to Azure SQL so backups will be handled by Azure.

    That leaves the main concern being poorly written queries or malicious code. We run authorized ad-hoc scripts through a small team to validate that they’ll do what we expect and nothing more. We audit all other access and are able to restore in the case of something bad getting through.

  • Keith Rowley

    Was it here I read about a company that regularly and randomly crashed their different data centers etc just to prove they were reliable? (Maybe Netflix.)

    This seems both extream and essentially the only really reliable to verify this.

  • TodConover

    There should be an app for that. I’m not kidding. Our industry is way bad at disaster recovery and everything like ii… like regular old backups, version control, and just recalling what we did this morning. And it shouldn’t be so. Why not a standard, a methodology, an operating system, a development environment that protects against these things? All you need is a system that remembers. Hey isn’t that what all computer apps do? Why isn’t my every keystroke recorded, such that recovery is possible? And while we’re at it, why isn’t every insert, file move, and deletion recorded? And if we know what, then we can know when and who. And if we know all that then it can be coordinated with every other what, when and who in the world so that, if need be, we can recall it and get back to any place and time we want. Is that so hard?

  • willliebago

    Seems like another typical company that doesn’t want to pay for something that directly adds cost and does not contribute to revenue.
    How do we convince management that this technical debt will eventually come due?

  • Matt

    I am constantly amazed at the number of companies in Australia that regard DR processes as purely optional. I have been with only one company that took DR seriously and, when a major problem hit (floods), the site was in a flood zone and we had to rip it out prior to the floods hitting (and the water never made it past the front door).
    Government “projects” all want DR and often confuse “high availability” to mean the same thing and promptly cut the funding for the DR segment when the costs start mounting too high.