I’ve written introductions to both Azure Table Storage and Azure Blob Storage services in which I briefly mentioned the options for data redundancy that are offered by the service. I didn’t, however, elaborate much about their significance, or even about all the variety of options there are for providing redundancy. I’ve had many conversations over the years with people about these redundancy options and I’ve been surprised by just how many people equate the redundancy options with a backup.
Let me be perfectly clear here: The redundancy options provided for an Azure Storage account currently are for Durability and High Availability. They are not designed to provide, by themselves, either disaster recovery or a way to restore data.
When you create a storage account, you will need to decide how your data will be stored. Depending on where and how you are creating the account, the options will have a different terminology. If you are creating your account via the current Azure portal (http://manage.windowsazure.com) you’ll be prompted to select a “Replication mode”. If you are creating the account from the newer Azure Preview portal (http://portal.azure.com) then the options are called “Pricing Tiers”. If you are creating the account directly using the REST based management API you will notice that it is referred to as an “Account Type”. The terminology of cloud storage has changed to accommodate the changes in ideas about replication, pricing and tiered performance; but these terms still boil down to choices in the way that data is replicated. All the options in Azure storage include storing your data in triplicate, but vary where those copies are located and just how many additional copies there are.
Before we start talking about the differences in these replication options, let’s first review Azure Regions. You can see a list of Azure Regions on the main Azure website with an overview of where they are. When you create an Azure storage account, you indicate where you wish the account to be based. This is the primary location of the data but, depending on the replication option that you select, the data may also exist in other regions. Any access of the data at the primary location is guaranteed to be consistent because writes will not return until consistency can be ensured. The maintenance of three replicas of the data is a big part of what makes the data highly-available.
- Locally Redundant Storage (LRS) is the base option. This means that all three replicas of your data exist within the same facility within the region you selected as the account location. These replicas guarantee consistency by keeping in sync. If a replica is found to be corrupt, or the hardware hosting it has problems, then the platform will self-heal so that there are, once again, three copies.
- Zone Redundant Storage (ZRS) provides a more resilient way of locating data than LRS by spreading replicas across multiple facilities within the same region, and possibly even across regions. At present, ZRS is limited to supporting only the storage of Block Blobs. You can’t host your Virtual Hard Drives (page blobs), Table storage, Files or Queues out of a ZRS account. To learn more about the differences in Page and Block blobs you can read my Introduction to Blob Storage article.
- Geo-Redundant Storage (GRS) improves the durability of the account by also replicating the data asynchronously to another geographically-distant region. So, not only do you have your data stored in triplicate in the primary location, but also stored in triplicate at the secondary region. GRS is not limited as ZRS is to block blobs, but instead includes all storage features. You can’t, however, control the whereabouts of the secondary location of your data. Microsoft has decided for you by pairing distant regions which are usually within the same geo-political area. Your primary location determines the secondary one. Only the region in Brazil actually pairs with a region that is outside of the country of the primary location (it is paired with the South Central US as its secondary).
- The Read Access Geo Redundant Storage (RA-GRS) option provides the same redundancy as the GRS, but also provides a read-only endpoint at the secondary location so that you can inspect, but not alter, the data. Both GRS and RA-GRS perform the replication asynchronously between the primary to the secondary, so there can be a delay before the data arrives at the secondary location. There is currently no SLA on how fast this replication occurs, but in practice the data tends to be available within a few minutes at the secondary.
These options are described in detail in the documentation on the Azure website in an article called “Azure Storage Redundancy Options“. This documentation also includes a table of the primary and secondary pairs.
What do you get ‘out of the box’?
These options for redundancy provide Durability and High-Availability of your data. With them, you are reasonably assured that your data integrity is protected from hardware failures and updates to the platform. If a region suffers some hardware failures within the datacenter, then the platform will self-heal: This ensures a response to requests for the data.
Don’t get me wrong. Durability and High Availability are extremely important and a big part of why it is a good idea to use cloud storage. Sometimes this is sufficient insurance for your data, but in many cases, I don’t think the redundancy options are enough.
So what isn’t covered?
With all the replicas, it feels like the data is pretty safe. After all, that could be at least one more copy of the data than many people have for their production SQL Server databases. Best of all, it’s dead simple to enable; All you need to do is set the type of storage account, and you get flexibility to “grow” from LRS to GRS or RA-GRS as needed, for example. I’ve been surprised by how many people are completely happy with the options ‘out of the box’ for their data in Azure storage. In some cases they are right, and the high availability these options bring is sufficient for their requirements. The problem is that all these replicas are just that: an exact copy of the data.
In the same way that replication in SQL Server isn’t a backup, these redundancy options in Azure Storage are also not a form of backup. Imagine what would happen if some bit of buggy code deleted or modified a bunch of data in one of your Azure Tables. Since the replicas are copies, the repercussion of that bug is that you’ve now just lost that data.
Without a backup, there are several types of disaster against which you are unprotected.
- File or data corruption – If the file or data itself becomes corrupted, due to a bug in your code for example, then that data is lost as the corruption is copied to all the replicas.
- Accidental Deletion of files, data, or the entire storage account.
- Losing control of the account to someone else – if someone gets rights to your subscription or account they can do whatever they’d like to it, including deleting it. Code Spaces was ruined because someone got control over the administrative functions of their cloud resources.
- A storage outage within a region/disaster recovery of a region
That last one might raise a few eyebrows. Some might argue that the redundancy options provide some disaster-recovery options, particularly with the GRS or RA-GRS options. I can certainly see their viewpoint, but I disagree with this unless you are supplementing the options with some additional work on your own: They do not, alone, help with disaster recovery. If anything, Microsoft has some disaster recovery capabilities where they have the ability to failover from the primary location to the secondary with GRS and RA-GRS accounts; however, that ability rests with Microsoft. You can’t call up and ask them to perform a failover on your account. It is up to them to decide if they need to perform that failover. If the Storage Service is suffering an outage in a region then, even if you have data replicated via RA-GRS, you can’t do anything more than have your own code shift to reading the data from the RA-GRS secondary.
This sounds kind of dire, but if you think about it a bit it’s not going to be an easy call of when to switch over even if you had the ability to flip the switch yourself. The data is asynchronously replicated and so flipping the switch is likely to mean that you will lose data that was either in flight or on the primary but not yet in the secondary. How long are you willing to wait before you’d flip the switch vs when the issue might be resolved? Also, afterwards, if the primary is brought back online there would be repercussions on changing it back over. Since many applications also reside within the same locations as the storage they use by failing over you are also likely going to be adding latency and bandwidth charges to your solution unless you are also deploying your solution to the secondary region as well. Now imagine if you have to make that call for a bunch of customers all at the same time like Microsoft would be doing. It’s currently just not as cut and dried as we would like it to be.
What are your options?
Now that the hellfire and brimstone part of the article is out of the way we can talk about paths to redemption. There are several options you can take to help ensure that your data is not only backed up, but also put yourself in a position to be able to recover from many horrible, doomsday scenarios. These options aren’t rocket science or magical in any way but they all require some additional work on your part. Bear in mind that you may need to recover the data services to a different region than the original location of the account.
In some circumstances you may find that that data itself can be recreated very easily. If this is the case then it’s likely that, technically, you already have a backup, but perhaps it’s just in a different form than a storage account. For example, if a company used blob storage to store data for users as JSON documents for fast lookup they could rebuild their data, if necessary, within a relational database which was backed up. They would need a script or process to pull the data, create the JSON documents and upload all the data again when needed.
Likewise, you may have opted for a true backup of the data. This may be a backup that is not in a location with insufficient bandwidth to handle having traffic directed to it. For example, files may be stored on-premises in a File Share or in Amazon Glacier storage where accessing it can be costly. You can copy the data from the live site to your backup location, and then provide the facility to restore those blobs when needed.
Think about the data you are placing in the Azure storage and decide how easily it would be to replace, restore or recreate. What recovery time you are constrained by, and is that an acceptable timeframe? If it takes you 2 days to replace all of your storage data it might not be feasible to rely on that type of recovery.
Full/Incremental Backup Process
You could use scripts to perform a regular backup of your account to another location. Command line tools like AZCopy from the Storage team at Microsoft make this possible for blobs, table and some file data (exporting table and Azure Files data is currently in preview). Other options include paying for a backup service like the Redgate Cloud Services offering, or writing your own tool to move the data.
You can also work into your solution a system whereby when data is added to your storage account it is also sent to another location for storage there. In solutions that rely on messaging to process data, for example in system designed around the CQRS Pattern, it takes just an additional step in the processing to also store a copy in a completely separate account, in a different location or another cloud provider.
If you perform these full or incremental backups as a Sync or replication of the data in your storage account, then you’ll be no better off than if you had just relied on the built-in Geo Replication. These backups should give you the capability of restoring your services to a point in time. It is up to you to decide how long you keep these backups and how far back in time you should be able restore data.
The Azure Storage service is a great tool to have in your toolbox. It is the underpinning of a good deal of the services offered by Azure. As with any data store, it is important to know what you are getting when you choose your cloud storage options, and what additional precautions you may need to take. Everyone’s circumstances and requirements is going to be different so, armed with the knowledge of what’s available, it pays to analyze your own data and service requirements and make the best decisions you can.