It is wrong to assume that moving to a managed cloud platform means never having to be concerned with upgrades to the operating system: It is important to be aware of how these updates are applied, and how they can affect the availability and maintainability of your cloud-based applications
Windows Azure isn’t actually a single operating system (OS), but is composed of several different OSs all working together. It is important to know the relationship between these OSs and how they interact with the development tools you are using whilst you are maintaining your Windows Azure solutions. This article is not just for the DevOps and Infrastructure folks out there; developers should read this too.
Just how many OSs are there?
Windows Azure is a virtualized environment running on a customized Hyper-V platform. Instead of spending a great deal of money on specialized equipment, the data center design is based on using commodity hardware together with a self-healing platform so as to create a resilient system. As such, the physical servers that are used within the Windows Azure Data Centers vary little in their configuration. The virtual machines to which you deploy, for Cloud Services and Windows Azure Virtual Machines, run on a partition on one of these physical servers.
The main root partition of the server is running an operating system that is known as the host OS and is maintained by Microsoft. This host OS is responsible for managing the resources of the server, and for running the Windows Azure Agent that is used to communicate with the Windows Azure Fabric Controller. The Fabric Controllers monitor and control large segments of the data centers. As and when deployments are performed, the rest of the resources on the physical servers are carved up into one or more child partitions. These child partitions run their own OS, referred to as the guest OS. Given the multi-tenant nature of the virtualized platform that is serving several customers at the same time, the guest OS partitions neither have access to each other nor to the host OS directly, unless it is over public endpoints or they belong to the same deployment. The host OS is managing the guest partitions and it controls some of the interactions that the guest virtual machines have with the other systems within the data center.
For Cloud Services, the virtual machines run a slightly-modified installation of the Windows Server operating system that is optimized to run within the Windows Azure environment. You can control the version of Windows Server that is deployed for the virtual machines within your Cloud Service via configuration. Specifically this is handled by the osFamily attribute in the Service Configuration file (.cscfg). An OS family represents the version of Windows Server to be used as the guest OS. The currently supported options are:
To further scope the guest OS which is deployed, there is also an osVersion attribute. This is a little confusing because I believe that most people interpret a version to mean ‘2008’ or ‘2012’, closer to the general concept of an OS family. Instead of this the OS version, in the context of Windows Azure, maps to a specific guest OS release. The release number is made up of the base operating system, together with various security patches and bug fixes. You may see a guest OS described as 3.2 rel2. This means that it is running OS family 3, Windows Server 2012, and the OS version of .2. The “rel2” indicates that there was a bug or issue found with the guest OS version and a patch of that version was released, or what is known as ‘re-releasing’ the guest OS version. If this happens then your service will be updated to the new release of that guest OS version and you can’t then elect to run on a previous version of that release. Usually, rereleases occur because a bug is found within the release or more security patches need to be added to ensure that any critical security flaws are mitigated.
When you are configuring your Cloud Service, you can simply elect to be automatically updated to new Guest OS version as they are released. To do this you set the osVersion to an asterisk “*”. The Windows Azure platform will then update your virtual machines to the new guest OS versions for you. This is considered to be a ‘best practice’ by Microsoft because you then do not need to worry about patching your servers as the platform simply takes care of the update for you. One caution I would give when electing to be automatically updated is that this would be somewhat similar to turning on Windows Update on your servers within your own data centers. This means that security patches and such could be automatically installed that are incompatible with the code running in your applications. If you normally elect to apply security patches in your test/staging systems before rolling them out to production in order to test for incompatibilities, then you are more likely to want to manually set the osVersion for your Cloud Services so as to match your current process. This, of course, means that you lose some of the benefits of a Platform as a Service (PaaS) since you will be responsible for keeping track of the different Guest OS versions (more on that later); however, it does provide the means to ensure you have some time to test your solutions on new Guest OS versions before they are applied.
The osFamily and osVersion attributes are settings that cover all the roles within your Cloud Service, so you can’t configure a web role to run 2008 R2 while having a worker role run 2012 within the same Cloud Service deployment. It is possible that differing OS families and versions might exist in the same deployment during an upgrade where you have changed the configuration to use something different; however, this will only last as long as it takes to update all of the instances within your deployment.
Finally, you can only upgrade automatically to the next OS version for your Cloud Services; not the OS family. So you can’t say tell the system to automatically update your Cloud Service from Windows Server 2012 to Windows Server vNext when that comes out.
By contrast, the customer selects the guest OS for Windows Azure Virtual Machine, which is Microsoft’s Infrastructure as a Service (IaaS) offering. At the time of writing, you can select various versions of Windows Server or even some distributions of Linux as your guest OS. As with any IaaS product, you have sole responsibility for maintaining the OS. The concept of a guest OS family or version is not the same for these virtual machines. In the portal you may see a reference to OS Family on a Virtual Machine to mean either Windows or Linux when selecting an image from the gallery. Also, when you elect to provision an image from the gallery, then you may be prompted for the version of the image that you’d like to deploy. This is somewhat similar to the idea of the OS guest version in that each image in the gallery may have multiple versions released on different dates and may contain different patches and features. For your Windows-based Windows Azure Virtual Machines, you can either turn on automatic Windows Updates, control patching through other software such as System Center or manually control patching by applying them yourself.
How do Updates get Applied?
When you tell people that their Cloud Services can be automatically upgraded to include the newest patches, most of them immediately think that Windows Update is turned on for the virtual machines. No, Windows Update isn’t running on the Cloud Service virtual machines set to auto-update. What actually happens is that the Guest OS is walked through almost the same procedure that occurs when you roll out your own code using in-place upgrades. The instance being updated is taken out of the load balancer if public traffic is routed to it and asked to shut down. The OnStop method of the RoleEntry point for Web and Worker roles is called so that you can cleanly tidy up before being shut down.
After the instance is shut down, the OS base image that is used for the virtual machine is then fully replaced by a new image that has the updated Guest OS on it, complete with the new patches already installed. The virtual machine is then spun back up and your code is redeployed just as if it was a new instance being brought online. By approaching the update in this manner, they remove any concerns about an instance not updating correctly with Windows Update and they can also ensure that the base of every instance of the same Guest OS is the same for Cloud Services.
Also, just as when you are performing an in-place upgrade, the instances are taken down in a manner that respects the upgrade domain structure that was put in place to ensure the SLAs are met. If you aren’t familiar with upgrade domains, I must explain that they are a way that Windows Azure segments the instances within your defined roles so that when an upgrade occurs it has less impact to the overall solution. The instances within a given role are evenly spread across one or more upgrade domains. By default the value is five upgrade domains, and you can have as many as 20. You can set the upgradeDomainCount attribute in the Service Definition file to control this. For example, if you have sixteen instances of a given web role deployed with the default of five upgrade domains, the instances would be divided across those five domains to yield four upgrade domains containing three instances each and one upgrade domain containing four instances.
When a Guest OS update is applied to the service above, all of the instances in upgrade domain 0 are taken down and updated, then brought back online. After this upgrade, domain 1 would be updated, and so on until all instances within all the upgrade domains are updated. If you have more than one role-type in your Cloud Service, they are distributed evenly across the same upgrade domains: This means that you might have both web and worker role instances within upgrade domain 0. Since the upgrade domains are walked for a Guest OS update, as long as you have at least two instances of a given role running, your solution will not be completely offline during the upgrade.
As the platform walks the upgrade domains, it generally waits until the role instances reach a healthy state before starting to move on to the next upgrade domain. It is important to note that, in order to make sure that the system doesn’t become unstable or stuck on a given upgrade domain, it waits up to 15 minutes from when the instances start their Startup tasks. This ensures that the OS update itself and the deployment of the code were both successful because it has reached a point where code provided by the customers (you) has started. Keep in mind how long your start up tasks and role OnStart code takes. If they combine take more than 15 minutes then you could see even more reduction of resources during upgrades: This is because some machines are still initializing with your tasks while others are being taken down for upgrade.
These upgrades occur regularly because Guest OSes are generally published at least once a quarter, and sometimes more frequently if critical issues are found that need to be patched. You can’t really control when an upgrade will occur on your application if you have the automatic OS version updates configured. They will occur as the data center rolls them out and gets to the instances that are running your code. You might think that you get more control over when updates occur by locking into a specific guest OS version, and you are right to a degree. Even if you have explicitly set your guest OS version your instances can still be taken offline for a rerelease update of the guest OS version you are on, or for updating the host OS.
For instance of Windows Azure Virtual Machine, you are in charge of making sure they get the updates and patches since you control the guest OS completely; however, remember that the guest OS is a child partition on a machine that has that root, or host OS, which also gets updated periodically. When the host OS is updated all virtual instances, both Cloud Service and Virtual Machines alike, running under that host are taken offline during the update. The base image for the host OS is replaced just as a Cloud Service guest OS is updated so that the base image is clean.
If you only have one of that type of server it can be taken offline for Host OS updates, or for hardware failures for that matter. This is something to keep in mind when you run Windows Azure Virtual Machines, especially for special services such as DNS servers, Databases and AD Controllers. You need to think about redundancy for these specialized servers in much the same way that you plan out your own datacenter. That level of planning is beyond the scope of this article, but Windows Azure Virtual Machines have Availability Sets for much the same reason that Cloud Services has the idea of upgrade domains. I recommend reading up on Availability Sets in the online documentation.
How does this affect developers?
If you are a developer then you may be wondering about the relevance to you of the Guest OS. The goal of Platform as a Service (PaaS) is simply to provide you with a managed location to run your code, right? Well, changes to the platform can definitely affect your solutions. The Guest OS for Cloud Services represents a large part of the platform because that is the container for your code. The Windows Azure SDK that you use is only supported for specific versions of the Guest OS, so that it will fall to the developers to ensure that they use the right SDK for the Guest OS that is the target for the deployment.
If you opt for allowing the Guest OS to be automatically updated, be aware that you may find that the next Guest OS release might not support the SDK that your code was compiled against, though this is unlikely. For example, if you were set to auto-update your guest OS and you are running on OS Family 2 at the time of this article, the newest Guest OS is 2.16 which shipped on July 16th, 2013 and supports the SDK 1.3 and higher. Just to give some perspective, the 1.3 SDK shipped in November of 2010. If your solution is one that will not see a lot of new features or bug fixes then you may want to lock your OS Guest version so that you don’t “age out” of the supported SDK, but in general, solutions that see even annual updates in the code base should easily be able to keep up with the Guest OS.
Although the example of the 2.16 Guest OS that I’ve mentioned is documented to support the 1.3 SDK and higher, it was only specifically tested against 2.0 and 1.8 versions of the SDK. Each new guest OS is tested against the current and previous SDK at the time the Guest OS is released. Support for previous SDKs is detailed in the online documentation.
How is retirement of OSes handled?
Eventually, Guest OS versions, and even OS Families, will retire. As technology changes, and new versions of the Windows Server Operating System emerge, the older systems will be deprecated. Microsoft has defined a retirement policy , and it is a good idea to be aware of it. For the most part it can be summarized as this:
- The two latest OS families are supported. When a new OS family is released, then customers will have a year to update their solutions to one of the two supported OS family versions. For example, OS Family 1, the one based on Windows Server 2008 R2 with .NET 3.5 and 4.0 (but not 4.5) is already being retired. The retirement started on June 1st 2013 and will be fully retired on June 1st 2014.
- The two latest guest OS versions within an OS family are supported. When a new guest OS version is released customers have 60 days to update to one of the supported versions. If you have elected to explicitly set your guest OS version you will need to test your solution against the newer versions and perform your upgrade before this 60 days is up. At the end of the 60 days the deployments will be updated to a supported guest OS version.
- As mentioned earlier, the two latest SDK versions will be supported. When a new version of the SDK ships customers have up to a year to upgrade their solutions to one of the supported SDKs.
Some enterprise IT organizations may be a little shocked at the pace of this support policy if they are used to keeping a Windows Server version around for many years; however, new version of the Windows Operating System have been historically fairly spread out. Also, you have a year beyond the retirement of an OS family to test your solution and upgrade as necessary.
If you are controlling the OS guest versions manually you’ll need to stay on top of things and be prepared to be able to test and upgrade a few times a year. For the most part this type of upgrade should mean simply testing your solution on machines running the new Guest OS to verify there are no issues and then updating the osVersion attribute in your product configurations to perform the upgrade.
When a new SDK is released, you may need to make code changes. This is likely to require a longer turn around cycle than testing a new OS guest version. When a new OS family is released you will have a year to upgrade to a newer SDK. For this reason, it is wise to have some good regression test suites available.
You can manually monitor releases by checking the online documentation of supported OS family, OS guest and SDK versions which is updated when new versions are shipped. Another way to keep up to date is to subscribe to the RSS feed of guest OS versions at http://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates. This feed will announce when changes are made to current releases and even when corrections are made to the information posted online.
You haven’t completely escaped the task of dealing with upgrades and patches by electing to move to the cloud. These upgrades are necessary to ensure both that your solutions are sitting on top of secure and reliable systems and that the platform is incorporating new technologies and innovations as they emerge. It is important to be aware of how these updates are applied, and how they can affect the availability and maintainability of your cloud-based applications