22 March 2012

Windows Azure from a Data Perspective

Before creating a data application in Windows Azure, it is important to make choices based on the type of data you have, as well as the security and the business requirements. There are a wide range of options, because Windows Azure has intrinsic data storage, completely separate from SQL Azure, that is highly available and replicated. Your data requirements are likely to dictate the type of data storage options you choose.

Windows Azure is a Platform as a Service (PaaS), which means that you’re able to use it to run code and store data. As such, the possibilities of what you can do with it are wide open. You can also use the various components separately or together. This makes Windows and SQL Azure interesting choices for a data-centric application.

Note: If you’re new to Windows Azure, you can learn more about it in this video: http://channel9.msdn.com/Events/TechEd/Europe/2010/COS322

Architecture Fundamentals

As you read through this article, it’s important to keep in mind that the Web and Worker Roles in Windows Azure (the VM’s that run code) are stateless – meaning that they can go down and pop up somewhere else. The code you run on these types of Roles should be stateless as well, meaning that they don’t store data in memory or on the local disk that you need for successful completion of a program.

Storage, however, is another matter. Storage does not disappear from one location and move to another. It remains available at all times. In fact, Windows Azure storage has High Availability (HA) and Disaster Recovery (DR) attributes built in. When you create a storage object (more on that in a moment) it is stored in a particular datacenter. That object is replicated three times in the same datacenter, and then all three copies are replicated to a geographically separate datacenter again. These datacenters remain in a single geo-political region, so for instance data in the Americas region does not automatically replicate to Europe or Asia. You can store your data multiple times, it’s just that the data won’t automatically replicate without your control between regions.

When you develop using Windows Azure, you pull down a free Software Development Kit (SDK) that creates an Emulator for the various parts of Windows Azure on your PC. The SDK emulates the Compute functions (the Roles), and SQL Server Express emulates Windows Azure storage. That way you can develop the code without accessing the Windows Azure environment on the web. There are some differences between the two environments, which I’ll cover in another article.

When you’re ready to deploy the application for others to start using it, you create a Storage Account on the Windows Portal. This is a billing instrument, which is based on the amount of storage you use, the amount of data that flows out (ingress of data is free) and a storage access fee. You can find all the pricing numbers on the windowsazure.com site.

When you create that storage account, you’ll get a set of keys so that your code can create and access the data objects. Those keys can be used in everything from on-premises code, to Windows Azure applications running .NET, Java, C++ and other language stacks, all the way through a web-browser call using REST Application architectures.

There are three kinds of storage in Windows Azure: BLOBs, Tables and Queues. SQL Azure is under the umbrella of Windows Azure, but acts not only as storage but a relational engine for processing data as well. I’ll cover it as a separate type of storage in a moment.

Note: The Compute Roles (Web and Worker) have local storage that you can also use, but since the Roles are stateless, this storage should not be used for anything more than a temporary use. I will not include local storage in this discussion.

Architecting a Data Application

With those basics in mind, you can begin to design your application. I break up data applications as in-line and data movement systems.

In in-line data systems, the data is not moved from a source to a destination. You simply create the data, and then access it later from one or more systems. This type of application is what you might see in a typical On-line Transaction Processing (OLTP) system. This might indicate that there isn’t a lot of room here for a distributed system or analysis, but Complex Event Processing (CEP) systems like StreamInsight can actually read the data at it source and act on it without storing it again.

In data-movement systems, the data is transferred from a source system to another system for analysis. This is typical of an On-line Analytical Processing (OLAP). During, or perhaps after, the data movement, the original data is often changed, aggregated, cleansed and so on.

Windows Azure handles both of these designs. You can certainly architect systems that accept data from an input (human or machine) and store the data in various configurations of storage. And you can also take the data from one storage location and place it into another for analysis. CEP options are also available, as is Reporting Services or process it with Hadoop in Azure. As of this writing these latter two offerings are in Customer Technical Preview (CTP).

In any case, my approach with almost any application is to start with the data. My personal belief is that all computing is simply the re-arranging of data, so it follows that you need a process to figure out how that data will be stored, processed and other attributes. In some cases a developer will simply use the storage he or she has available – a particular Relational Database Management System (RDBMS) or a file store, or a “NoSQL” data platform, regardless of the datum they are working with. I think this is a mistake – the data needs to be examined on its own first, and then the proper storage and processing mechanism should be applied to that. To that end, I use the following process to create the application.

Classify the Data

The first step should be to classify the data the application works with. This might be data that currently resides in another system or data that the system will create and manipulate.

Data classification is similar to designing an RDBMS data model. I normally start with the purpose of that datum – for instance, if I want to collect the user’s name, that makes the decision for me on many of the other choices such as data type and even security. Even the collection method matters at this stage. If the name is passed to the application from a validated source, then enforcing a particular format or data type isn’t as important. I might be able to choose Programmatic Referential Integrity (PRI) over Declarative Referential Integrity (DRI) and so on.

Classifying data should always include setting the data type. Once again the meaning of the data is important, especially in numeric or date and time datum. This is not only a data integrity issue, but a computation issue for the system or systems that will process it later.

The data at this point may or may not group into a relational model. It might simply be a “bag of data”, including all de-normalized tuples. Classifying the data allows me to create a CRUD matrix (Create, Read, Update and Delete), a Business Continuity model, a security and privacy profile and so on.

Choose a Consistency Model

This stage begins to determine the storage options for the application, and will help in the choices for Windows Azure storage as well. Here you define the type of consistency you need. While there are several nuanced phases, there are two basic options: immediately consistent or eventually consistent. To oversimplify a bit, immediately consistent data is more transactionally secure, does not encourage “dirty reads” where a datum might have been written since the last time it was read, and does not scale infinitely. Eventually consistent data, on the other hand, biases towards more available reads, and might allow “dirty read” operations. There are ways to allow each type of consistency models, of course, but these definitions work for now. In many cases, OLTP systems require immediately consistent models, and analytical workloads can tolerate eventually consistent models.

Why is this important? There is a cost of immediately consistent data, both in speed and scale. When it’s absolutely required you should use it, but only for the data that needs it. This means that a developer might think more about breaking apart a given tuple than just using the entire row in a single type of storage, perhaps in an RDBMS. 

Determine the Data Path

The next step is to determine where the data originates, and where it is eventually used. This is important for Windows Azure, because there are use-cases when any “cloud” provider doesn’t fit. For instance, if you have a Petabyte of data that needs to transfer to another system daily, no remote storage is really possible with current Internet connections.

The point here is to lay out where the data starts and its path to the end use. If the data is “born” in Windows Azure, movement is quite simple among the various data storage options. If the data exists on-site and needs to be moved one time, Windows Azure storage still fits. If the data needs to be moved once and then updated periodically with small changes, still an option.

The data path also needs to include the systems that need to interact with the data, since those systems are impacted by data types and consistency models. One decision affects another, hence the need for designing the data first.

Define the Data Security

Last, but certainly not least, is the question of the security of a given datum. I say that because it’s common to speak of an entire data set as “secure” or “Privately Identifiable Information” (PII). But in the military I learned that classifying a given document or set of documents as “Secret” had a huge cost of storing, protecting and disposing of the data. We broke apart documents all the time into parts that needed that level of security and the parts that didn’t.

Many developers simply qualify an entire data set as PII simply because a name and government ID is associated with it. Pulling those two items out of the data renders the bulk of it non-sensitive. Data can be joined back up with the two PII elements later when needed – which makes the whole process much more secure, and could benefit even current on-premises system today.

The point here is to ensure that you know the security profile for each datum, and apply the appropriate access control for each of those. In some cases that may mean the data

Windows Azure Storage Options

With the data classified and qualified, you can now make intelligent decisions about which storage options you want to use. You have several to choose from based on the information you created in that process. You may find that even within a given process you store the same data more than once – storage isn’t that expensive, and if it makes the system more scalable and responsive it’s worth the cost to write it more than once. The key is the design, which you can create intelligently because you took the time up front to classify the data.

SQL Azure

SQL Azure is a form of SQL Server databases as a service. This means that you don’t start at the traditional RDBMS level, only the artifacts that it provides. After you have created your service account, you can create one or more traditional SQL Server databases to work with. You are billed by the size of the database per month, and the data egress.

Within a SQL Azure database, you have all the familiar objects to work with – schemas, tables and so on. You’re also able to use multiple indexes – in fact, it’s a requirement that you have a clustered index on every table you create. This makes SQL Azure a natural choice for any data that has a high degree of consistency, strongly-typed data, and declarative referential integrity.

Where SQL Azure may not fit as well is for large binary data, since (as of this writing) the database size is limited to 150 gigabytes. However, you’re able to circumvent these limits by using sharding logic in your code or using the built-in Federations to scale the data horizontally.

Since SQL Azure uses encrypted communications, works across a Control Fabric, copies your data three times in a consistent fashion, and is multi-tenant, you should not expect the same speeds as an on-premises SQL Server. If sub-second latencies are required for the data transfer, or the application is on-premises and requires a high level of speed, SQL Azure may not be a fit.

However, if the application needs to store data with ACID properties, is located in Windows Azure, and requires multiple indexes, then SQL Azure is a choice for the application’s data. You can access the data in SQL Azure using ODBC and ADO.NET classes, and manage it with traditional SQL Server tools. You can also use SQL Server Integration Services, the bcp program from SQL Server, or Data Synch to move data in and out of SQL Azure.

Blobs

Windows Azure Blob storage comes in two types: Block and Page. A Block-Blob is optimized for streaming data, and can send data in 4 megabyte chunks. You’re able to upload a Block Blob before it is committed, which can factor into an architecture’s speed versus consistency needs. You’re able to fit 200 gigabytes of storage in a single Blob object. Block Blobs are best used for sequential access.

A Page Blob can be up to 1 terabyte in size, and is better suited for random access. You can carve the pages down to smaller than 1 terabyte, of course, and have multiple pages. Both Page and Block Blobs are secured using your access key, and you can make the objects public or private. Normally the program calls out to the Blob data on behalf of a qualified user request, although videos, pictures and so on can simply be made public for access by a user or program.   

Windows Azure Blob storage can also make use of a Content Delivery Network (CDN). A CDN is a series of storage in datacenters closer to the users, which acts as a cache. You can simply mark a particular Blob as “CDN Enabled” and when the user requests the data, it will be transferred to the user from the main datacenter, and then “edged out” to a location closer to them. Any subsequent calls for that data will come from the edge location. A timeout value you set keeps the data in the edge location until it ages back from that location.

Windows Azure Blobs can store more than just binary data. To be clear, data in computer systems is based on a binary system, so technically all computation and storage is binary, what I mean here is that the information is not a text or ASCII representation of text.  That being said, there is no reason you can’t store ASCII data in a Blob – in fact, this is a common practice for large XML files and other textual information.

Tables

Table storage within Windows Azure is not the same as a relational Table object in SQL Azure. In SQL Azure, Tables are made up of columns that are the same in every row. In Windows Azure Tables, the structure is closer to a “NoSQL” offering, or a C#/Java hashtable. It’s a key-value pair system, where you identify a “row” by its key. The key is also used in Windows Azure Tables to physically partition the data in the datacenter.

An example is useful to make this type of storage more understandable. Even if you are not familiar with an RDBMS table, you can think of a spreadsheet example of a record of people (this would not be a heavily normalized example):

Name

Street

City, State, Zip

Category

John Smith

123 Here Street

Covington, WA 98042

Customer

Jane Doe

234 There Ave

Redmond, WA 98052

Customer

Jim Johnson

345 Over Blvd

Covington, WA 98042

Business Owner

 

The same data represented in a key-value pair system might look like this:

Key

Values

John Smith

Street: 123 Here Street

CityStateZip: Covington, WA 98042

Category: Customer

Jane Doe

Street: 234 There Ave

CityStateZip: Redmond, WA 98052

Category: Customer

Jim Johnson

Street: 345 Over Blvd

CityStateZip: Covington, WA 98042

Category: Business Owner

 

The advantage is that the key-value pair system can also look like this – something not structured the same way in an RDBMS:

Key

Values

John Smith

Street: 123 Here Street

CityStateZip: Covington, WA 98042

Category: Customer

Wellingtons

Company Code: 123456

Category: Customer

Location: West Region

Standing: Paid up

EmployeeCount: 2350

Wetsuit

ProductID: 34256789E

Category: Inventory Item

 

Table storage in Windows Azure is immediately consistent, although transactions are not directly supported outside of a single Table call. There are ways to ensure a higher level of consistency, but the best use-case for Table storage is the sheer scale of the data.

You can access Table storage using REST APIs, the APIs in the Windows Azure SDK, and you can also use Linq. Table access is also often queries using JSON.

Queues

Windows Azure Queues are a type of Table storage – but can go to 100 terabytes in size. They are used for passing messages in a distributed system, and although they share some similarities with (MSMQ), they are not quite the same. For instance, you are able to create a message that might be read more than once, so your application logic needs to handle that gracefully. Like any message-passing system, they are not intended for permanent storage, and users do not normally access Queues, only programs.

Although Queues are not a permanent method of storing data, they are useful for data that has a short lifespan.

As you can see, you have several choices for setting up your data layout using Windows and SQL Azure. The key is to approach the design of the application from the classification of the data. Then pick the technology that meets the requirements you have.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 13833 times – thanks for reading.

Tags: , , , , ,

  • Rate
    [Total: 10    Average: 3.9/5]
  • Share

Buck Woody

Follow Buck on

View all articles by Buck Woody

  • Chris Adkin

    analytics / querying requirements
    Hi Buck,

    This is a very good article, regarding the choice of storage, are the analytics / querying requirements for the data something that should be seriously considered ?. For example, if you you have complex reporting requirements, running such reports against the table storage would require lots and lots of hand rolled code, that could possibly be avoided if the data was stored under the relational engine.

    Regards,

    Chris

  • Anonymous

    Object Replication
    You reference the fact in the article that the object is replicated 3 times in the same datacenter and these 3 replications are then again replicated to a geographically separate datacenter. From the reading and research I have done, I understand the replication in the same datacenter occurs as part of the purchase, however the replication to another datacenter is an additional option that one must pay for. Is this true or has Microsoft changed this recently?