2187-just_azure.svg

We recently did a lot of work to improve blob transfers rates in Azure Explorer and Azure Management Studio. The majority of this improvement work was carried out on a library shared by both tools and resulted in improving the library’s integration testing. In the past these tests had only been run manually during development and there was no automated framework for running the tests against each build of the library. We wanted to fix that for a couple of reasons.

There are subtle differences in behavior between a live Azure storage container, the storage emulator and different versions of the Azure SDK (which affects the storage emulator). Continuously testing against these configurations means that further development on our blob transfer library can be carried out without performance impacts or bugs going unnoticed. We’ll also get quicker feedback about changes to the Azure API made by Microsoft during their ongoing development of Azure.

The problem

Integration tests would normally be run directly on our TeamCity build agents. However, these tests needed the storage emulator running locally which would tie up memory allocation and impact build times for other components (since our build machines are partitioned VMs).

We’d also be forced to manually update the configuration of these machines whenever we wanted to test against a new version of the Azure SDK and generally burden them with resources that weren’t relevant to the majority of their duties.

To meet these criteria, the general architecture would need to be changed to something like this:

2198-Figure1-IntegrationTest-Flow.png

Figure 1: Integration Test Flow

2198-Figure2-IntegrationTest-Description

Figure 2: Integration Test Flow Details

When a new build is detected, the build server deploys a controller app to a build agent. The server also copies the product library under test, the test fixture library and some configuration information (for example: “run the integration tests against v1.3.0.119 of the blob transfer library using v2.3 of the Azure SDK”).

The controller app uploads the product library and the test fixture to blob storage then creates a message from the configuration information. The controller app puts this message on the service bus then waits for a response.

This message is picked up a worker running on an Azure VM. The worker downloads the product library and test fixture and initiates the test run. When the test run finishes, the worker sends a message back to to the build agent (for example: “test run complete, the results file is called results-v1.3.0.119-SDKv2.3.xml”).

Once the agent downloads this file, the server parses it and the results can be displayed in the TeamCity UI. Writing the apps sitting on either end of the service bus was reasonably trivial, by far the trickiest part was the automating the state and behaviour of the VM.

Solution 1

The most obvious solution to fit these requirements was to use a cloud service. However, initially it didn’t seem like this would be possible as the VM needed the storage emulator installed and traversing its install wizard programmatically would be fraught with difficulty.

Solution 2

An alternative was to create a VM, install everything that the tests need and snapshot the VM in this state. We’d also need to automate the VM’s start up procedure or use a Windows Service to start the queue listener and storage emulator (otherwise we’d need to do this manually every time the VM was reallocated or a new instance added).

Using Windows services and scheduled tasks was tricky to get working; there’s a lot of subtlety to access permissions for remote processes and there appear to be differences in startup behaviour between deploying the VM for the first time and when it reboots. The feedback cycle for debugging problems was a major time sink as you have to wait for the VM to reboot or redeploy each time.

Instead of trying to coordinate several start up tasks, we had much more success using a single scheduled task that ran some PowerShell which handled everything. The logging was much more granular, it was easy to run and debug the script locally and the end result was much more robust.

An aside on resources in blob storage

One of these start up task was to download the worker app from blob storage. Snapshotting the VM with the worker app installed was a bad idea as we’d inevitably want to make changes to the app over time which would mean taking a new snapshot each time.

Downloading the app directly from blob storage would involve hardcoding the blob’s URL into the PowerShell script. Changes made to the worker app would mean either changing this URL in the script (then creating a new snapshot) or deleting the old version in blob storage and replacing it with the new version. This versioning would either need to be maintained manually or new build tasks added to handle updating blob storage. This scales poorly and has a lot of room for human error.

The approach we chose was for the PowerShell script to get a pointer to the latest version of the worker app’s blob URL from table storage. Changing these pointers in table storage meant that we could control which version of the app was downloaded from outside the VM. Updating the version of the worker app would require no new snapshots and a clear record of which version was being used.

Solution 2 continued

We had a working suite of tests at this point but there were a enough issues to make it feel too inflexible and fragile to be a long term solution:

  • Updating the version of the test run worker app would mean uploading it blob storage, updating table storage to use the new version then redeploying all the VM instances so they downloaded the new version.
  • The VM snapshots needed to be stored somewhere and maintained, adding a VM with a new version of the Azure SDK would mean creating a new VM with it installed and snapshotting the state. Old VM images would also need to be updated if there were configuration changes required by new tests.

Coordinating the interactions between this many subsystems without explicit version history would inevitably lead to problems that would be difficult to debug.

Solution 3

Initially we’d dismissed using a cloud service for the integration tests because attempting to automate the installation of the Azure SDK seemed like opening a can worms. However, with the discovery of Microsoft’s Web Platform Installer (WPI) command line interface, it actually turned out to be quite simple.

The WPI has a command line tool, which like the GUI version, can download and install Microsoft components. Except with the command line version the installs can be done silently, ideal for our needs. For example:

> WebpiCmd.exe /Install /Products:"Windows Azure SDK - 2.2" /AcceptEula

This will download and install v2.2 of the Azure SDK without any further input.

2198-Figure3-IntegrationTest-WPIDeploy.p

Figure 3: WPI Command Line install

It’s possible to bundle the WPI (and any other resources) into the cloud service’s deployment package by using role content folders. Files included in the content folder are deployed to the VM alongside the hosted service application when the service is published. Including resources inside the hosted service project like this is much neater, everything is encapsulated and version controlled.

The role content for our test runner service contained both the WPI and the NUnit command line runner (the framework our tests used). Role content folders are usually specific to a single role but when we came to add a second role (for a different version of the Azure SDK), it doesn’t make sense to have two copies of the WPI and NUnit runner in source control.

By manually editing the ccproj file it’s possible to ‘share’ content between roles. In this example, the SharedRoleContent folder contains the WPI and NUnit executables. The AzureSDKv22RoleContent and AzureSDKv23RoleContent role content folders only contain their respective diagnostic configuration files. The two role specific folders are auto generated when the role is created, the shared content folder was copied into the solution root directory.

Here we specify which files in our solution directory are role content. Currently only the diagnostics.wadcfg files are associated with a role so only these would be deployed to the VM when the service is published.

To bundle the SharedRoleContent folder in with both of our roles, we need to override the BeforeAddRoleContent target as well:

Role content folders can be quite flexible, it’s even possible to generate content for a role dynamically (if the content is only known at runtime).

Using a cloud service greatly simplifies the system and yields a range of other benefits:

  • The state is completely captured by version control making debugging issues much simpler.
  • Different versions of the SDK, the test run worker and start up procedure are all inextricably linked which rules out accidental conflict of incompatible versions.
  • It’s possible to see which version of the service is currently deployed and changes can be tested out simply via deployment to a staging environment.
  • The service can be also be deployed to any VM so there’s no disk images to create, deploy or maintain.
  • Scaling is trivial, if we need to get test results from multiple builds in quick succession, we can spin up more instances.
  • Running tests against a different version of the Azure SDK is simple: add a new role that supplies different arguments to the WPI and the desired version of the Azure SDK is installed for that role instance. Note that there is a limit to the number of roles allowed in a cloud service, but we weren’t hitting that.

Being able to utilise the Web Platform Installer removed the barrier to using a cloud service which allowed us to build a much more robust and sustainable test framework.

When using Azure to solve real world problems there are usually a range of different options available. By investigating two options to their full implementation, we were able to choose the best solution for our requirements. While at first it felt like a step backwards to start implementing the cloud service after we already had a working (but fragile) solution in the form of VM automation, the future time and effort saved by using a more appropriate solution made it a sensible decision.