Failure in any software and hardware is inevitable over time. These failures will occur either in your own code, in the services your solution relies on or the hardware on which it runs. How then do you design your solutions to be resilient, with failure in mind? In the world of services and software, we often talk about failures in the terms of down time, crashes and interruptions in availability. We even have hard metrics defined in the form of Service Level Agreements (SLA) that indicate just how much failure is acceptable for a particular service. Some SLAs, for example, will state that a site or service will have an required ‘uptime’ of 99.95%. How, then, do you define what constitutes a failure, or what falls within that 0.05%? More importantly, how do you deal with it?
There are several techniques to deal with failures that minimize their effect on the service.
Accessible vs. Available
The two words ‘accessible’ and ‘available’ are often used interchangeably, but they are certainly different. A website can be accessible from the internet, meaning that you can reach the server that it is hosted on, even though it may not be available to process your request. Admittedly, this distinction would be a bit trite to any consumer of this website. If they can’t actually use the web site do they care that they can at least reach it? The answer is that it is likely their experience during the failure that matters. Do they get the yellow and white ASP.NET Screen of Death? Do they see a simple browser-supplied HTTP 500 error page, or do they see a page showing that the system is currently experiencing difficulty and will be back up soon? The page could maybe even display some static marketing information about the website or company to keep them occupied, and perhaps an update on what is happening with the service.
Degrade, but Don’t Fail
Several years ago I was working at a company where one of the developers was elected to be the main test resource for a line-of-business application project. The other developers were flabbergasted when the developer-turned-tester brought them over to his cube and showed them an ASP.NET error-screen from the application. It turns out that, while he was performing an operation in the application, he had reached over and pulled the network cable from the development database server. The application code didn’t care for this at all, but the tester was proving the point that there are lot more possible failure points than the ones you have caught gracefully as exceptions in your code. Without access to the data, the application really couldn’t accomplish much at all; however, the error screen wasn’t the experience that the tester felt was appropriate.
In the early days of Twitter, their users were greeted on many occasions with the ‘Fail Whale’ image, indicating that they were aware of a service problem. The service was available, but you couldn’t tweet or read other tweets. Twitter had decided to degrade the user-experience, but not fail completely.
Another example of choosing to degrade the service rather than to fail entirely can be found in the design of the NASA Space Shuttle. The Shuttle had four general-purpose synchronized computers that were used to help fly the orbiter. If one of these failed, the shuttle could still maintain operations easily, because there remained the ability to verify results between the three remaining synchronized computers. Actually, a fifth computer was on board that used a completely different set of software and which could only help on ascent and descent. That fifth computer served as a backup for a much-reduced, but vital, functionality. Instead of the very expensive approach that the Apollo program took, in which the quality level had to be extremely high for all of the parts and systems, the shuttle was designed with less expensive parts and used redundancy to help ensure reliability. Sound familiar? Using redundancy is how most Cloud Computing providers aim to give you the ‘High Availability’ for their own services.
In order to be able to switch over to a degraded experience level for your service, you’ll need to know that failures are occurring. This requires a good understanding of the health of your application. The details of configuring diagnostics and setting up health-monitoring is outside the scope of this article, but there are plenty of articles on Windows Azure Diagnostics and many third party providers can help monitor your application.
If you are using Microsoft Azure as your platform, a new endpoint-monitoring preview feature has recently been incorporated into the portal in order to track accessibility, or you can look at a third party options such as Pingdom.com. Beyond what you get natively from Windows Azure Diagnostics for general logging and for gathering metrics, there are also third-party services available to help track the log data and let you know what’s going on, for example Azure Watch and NewRelic.
Of course, having a good health-monitoring system isn’t just about detecting failures, but is also so that you can avoid failures by knowing when you may need to react to all sorts of events. For example, you may need to scale portions of your solution to handle spikes in your load or processing so that your system doesn’t fail due to overload. This also allows you to detect when you need to move to the degraded service level mentioned above. Make sure that, no matter which methodology you use to monitor your application, you can define points where you need to react; preferably via automated actions.
Reduce the Impact of Failures
The whole point of this article is to put you into the mindset that failure will happen. It’s just a matter of time. Since if we know that failure will occur at some point, what can we do beyond monitoring our application to reduce the impact of failures on our systems? As it turns out there are some patterns we can implement in our designs to help mitigate failures.
Dealing with Transient Errors
If you look at a lot of code that ‘handles exceptions’, you’ll likely see a try..catch block which logs the error message and then lets the exception bubble up. With this approach, your system will certainly know that a failure occurred because you logged it: However, it won’t help with actually dealing with that exception. It’s going to be common to have minor transient errors in distributed applications which can use many components to service a request, such as a database call, queue processing, or just accessing a web service to get data. Maybe there was a brief network-outage, or maybe the other service was simply temporarily overloaded. No matter what the cause, your code can be made more resilient by simply adding some retry logic when dealing with a distributed component. These retries can reduce the impact of failures that are occurring in dependent components for those times that the retries succeed.
Tools such as the Transient Fault Handling Block make it easy to incorporating this type of retry logic. You just need to define a good retry policy for each distributed component or remote resource that you will be using. The retry policy should make sense for that resource. For example, it could be a good strategy to retry a call to your distributed cache once, before going straight to the persisted store to get the data. Define these retry policies and monitor how often they get exercised over time. This can help you detect possible trends in bottlenecks within your components and dependent services.
Introduce Request Buffering
Synchronous, long running, operations are the bane of scalability. If your solution is designed to first receive a call and then process it synchronously before then returning the result, then the ability of that solution to scale will rely heavily on how quickly the requests can be resolved. Your system will perform poorly at best, and at worst will simply be unusable, if the components that are actually servicing the requests get overwhelmed and so become a bottleneck.
The difference between synchronous processing and request-buffering is a lot like the difference between going to a store against shopping online for the same product. If you go to an IKEA store and find a table that you want, you will then have to stand in line waiting to check out. Depending on how busy the store is at that time, this can lead to a lot of bottlenecks around the cash registers. Some people may even just give up waiting and go home. The efficiency of processing purchases at the store relies on how fast they can check each customer out. Compare this with shopping IKEA online where you can order that same table, and the processing of your transaction is done behind the scenes whilst the goods are delivered to you later. They can certainly serve many, many more customers online than they can at one of their stores.
By introducing request-buffering, you can reduce the reliance on the speed at which the system can process requests. Request-buffering is often implemented by decoupling the submission of the request from the processing of that request. A very common example is the submission of an order being decoupled from the processing of that order via a message queue. For instance, a user submits an order on the website and the order is placed into an order queue. The request made against the website is fast because the delivery of the message to the queue takes very little time compared to the time taken to process the order. All of the orders that are submitted to the queue are then buffered up in the queue until a component on the back-end can process it. While the notion of using a queue for order processing is pretty obvious, there are can be other places in your solution in which request-buffering might be less obvious: For example, using a queue to deal with posting simple comments or using asynchronous page-processing in your web site.
There is an entire design approach called ‘Command Query Responsibility Segregation’ (CQRS) which helps to segregate your application into commands (“Save this”) and queries (“get me that”). The folks at Microsoft Patterns and Practices worked with several development community leaders that use CQRS in their own solutions to help provide a wealth of information on the pattern.
Using request-buffering can lessen the impact of failures in the back-end processing because the consumers are shielded from those failures. The system can then retry any failed processing commands without directly affecting the users.
Introduce Capacity Buffering
Although it is great to decouple processing when the request is submitting an order, what happens if you need to serve data back to the consumer? Obviously the data has to come from somewhere and it is possible that the bottleneck suddenly moves to the process of reading data rather than writing it. To help with this we can use capacity-buffering options such as caching.
Capacity-buffering is a lot like the way that water utilities supply water to housing. There is often some sort of buffer in place such as a water tower. Water is pumped into the tower and stored until someone requests it by turning on their tap. The water is then delivered via the water lines from the tower rather than directly from the well. If the system went straight to the well then, if everyone turned on their taps at the same time, the system wouldn’t have the capacity to provide water to everyone. Since the water is coming out of the water tower, which has a much higher capacity to feed consumers, even small downtimes of the pumps into the tower will not affect the performance of the overall system.
Once you have identified the data in your solutions that can benefit from caching, and have introduced a caching layer, make sure to also add some metrics to see if you are caching the right things if you can. Many caching layers have options to page out cached data, or expire it once it becomes stale. If you notice that a lot of your data is falling out of the cache in this manner before multiple requests are reading it, then you could be unintentionally adding overhead by caching that particular type of data rather than providing a buffer.
You can also help to reduce the impact of failures by using capacity-buffering. If the persisted store, or a data source that you are dependent upon, is having failures, then your solution can still choose to serve results from the cache until the resource comes back online.
Introduce Dynamic Resource Addressing
Addresses are important when distributing components. When services and resources are moved around, make sure you have provided a mechanism to easily change their addresses for consumers or users of those resources. For example, make sure that your storage account URI or connection string is easily changed so that, if you need to switch over to your own secondary account, you can do so quickly. The same is true of queue paths and service URIs. This can even extend up to your DNS entries so that, in response to a complete system failure, you can fail over to another platform or data center.
Once failures are detected, you need to react to them. If you choose to get notifications of errors and decide to use manual intervention as your response, you are going to see a much higher recovery time than if the responses are automated via scripts or code. Any time that a human becomes involved, you are increasing your recovery time.
You can often reduce the recovery time from failures by creating scripts to automate various tasks for your solution. For example, if you have a script that will clear out your caches and prime them from persisted data, this can be used to reliably and consistently recover from unexpected downtime. In some cases, these types of scripts can be leveraged along with health-monitoring systems to automatically react to failures, again reducing the recovery time. At the very least, you can buy time to decide how to react to larger failures if you have a script that can set your solution into a reduced functionality feature set.
Although automation can help you mitigate some issues, they can get away from you due to bugs in their logic, or simply unexpected behaviors in the system that repeatedly execute them. Because of this, make sure that a human is alerted at some point to review the problems and the automated responses. If you have a script that will automatically scale your system up in response to an increase in load, it is important that, at some point, a threshold is met where a human makes the decision to continue to scale. It may increase your response time by having the human involved, but it can also save you from a cascading failure and increased cost.
Plan for Failure
As part of the process of designing your system, you need to identify possible points of failure. For each point of failure, assess the level of risk, and determine if, and how, you will deal with it. Think about the recent landing of the Mars Curiosity rover on Mars by NASA. That landing process was extremely complex, with any number of issues that could go wrong. Each one was reviewed and analyzed. A conscious decision was then made to either deal with the problem, or accept the risk. In some cases the full cost of the mission, $2.5 billion dollars, was risked because the cost to provide a backup was either deemed too high, or simply impossible to manage.
Try as you might, you won’t be successful in rooting out all the potential points of failure in your solution. Some surprise will always come up. For example, even while redundancy does increase your resilience and reliability, it does not completely save you. If there is a bug in the software, or a fault in the hardware used, it is conceivably possible that all instances will have the same failure at the same time. Look back at the Space Shuttle example that I mentioned earlier. Remember that it had four computers all running the same software. During a simulation in which the astronauts were practicing a trans-Atlantic abort-sequence, all four computers crashed. Since the computers were all synchronized, they all hit the same bug in the code, an uninitialized counter, at the same time and simply crashed. Had this been a real flight the outcome could have been disastrous. Fortunately, they had the fifth back up machine that would have been able to help them land! In complex systems, even redundancy won’t save you from edge-cases or low quality code. Most of the time, cloud systems are not running in such a synchronized manner, and so such a catastrophic failure is not likely: Just be aware that edge cases and parts of the system that are not normally executed need to be tested as much, or more so, than the parts that are more frequently used.
Learn from Mistakes, Yours and Other’s
It is said that we don’t live long enough to learn only from our own mistakes, and that’s definitely true. You need to learn from the mistakes of others too. When truly unexpected things happen, it is worth taking the time to really dig deep into what went wrong and then determine a course of action to help mitigate the issue in the future. You’ll notice that, after significant outages in both Windows Azure and Amazon EC2/S3, the vendors publish a root cause analysis. Read these and become familiar with them. Ask yourself whether this is something that can happen to your code. if your solution is based on some of those services (whether you were affected or not), what should you do if that happens to you? If you have users or customers, it’s probably best to be quick and very forthright in what happened. Share your root cause analysis with them, especially on how you plan to mitigate the issue in the future.
In February of 2012 Windows Azure suffered a severe outage in many of its services, including the management API. The issue boiled down to a simple code error in the calculation of dates for a certificate. Someone did something bad in code. It might be fun to laugh and say, “wow, can’t they get date math right?”, but then again, did you do a sweep of your own code looking for the same possible problem? I think you’d be really surprised what little gems haunt even the code of “senior” developers. When you see someone has made a mistake in code, in a design, etc., learn from it and make sure you won’t suffer the same fate.
There are many resources on the internet on the topic of building resilient and highly available solutions. I suggest searching out many of these and learning as much as you can. If you are building Windows Azure solutions, a good place to start is the article for “Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services“. Also, find as many articles and presentations you can on how other companies have architected their solutions.
The best advice I can give you right now for dealing with failure is to actually deal with it. Have plans in place. Do failure assessments on your designs to find all the holes and possible points of failure. Then, just like NASA, assess each issue and make a decision on how you would fix it (if you even can) and how much effort will that be. Then have a risk vs. cost discussion in order to make a decision on if you plan on addressing it or not in your solution. Finally, document the recovery plans on how to deal with the failure.
This article was inspired by a half day workshop that Brent Stineman and I created for the 2013 CodeMash conference. The content of the workshop focused on the theme of scalability and resilience of cloud applications written on the Windows Azure platform.