Click here to monitor SSC
  • Av rating:
  • Total votes: 9
  • Total comments: 2
Buck Woody

Windows Azure from a Data Perspective

22 March 2012

Before creating a data application in Windows Azure, it is important to make choices based on the type of data you have, as well as the security and the business requirements. There are a wide range of options, because Windows Azure has intrinsic data storage, completely separate from SQL Azure, that is highly available and replicated. Your data requirements are likely to dictate the type of data storage options you choose.

Windows Azure is a Platform as a Service (PaaS), which means that you’re able to use it to run code and store data. As such, the possibilities of what you can do with it are wide open. You can also use the various components separately or together. This makes Windows and SQL Azure interesting choices for a data-centric application.

Note: If you’re new to Windows Azure, you can learn more about it in this video: http://channel9.msdn.com/Events/TechEd/Europe/2010/COS322

Architecture Fundamentals

As you read through this article, it’s important to keep in mind that the Web and Worker Roles in Windows Azure (the VM’s that run code) are stateless – meaning that they can go down and pop up somewhere else. The code you run on these types of Roles should be stateless as well, meaning that they don’t store data in memory or on the local disk that you need for successful completion of a program.

Storage, however, is another matter. Storage does not disappear from one location and move to another. It remains available at all times. In fact, Windows Azure storage has High Availability (HA) and Disaster Recovery (DR) attributes built in. When you create a storage object (more on that in a moment) it is stored in a particular datacenter. That object is replicated three times in the same datacenter, and then all three copies are replicated to a geographically separate datacenter again. These datacenters remain in a single geo-political region, so for instance data in the Americas region does not automatically replicate to Europe or Asia. You can store your data multiple times, it’s just that the data won’t automatically replicate without your control between regions.

When you develop using Windows Azure, you pull down a free Software Development Kit (SDK) that creates an Emulator for the various parts of Windows Azure on your PC. The SDK emulates the Compute functions (the Roles), and SQL Server Express emulates Windows Azure storage. That way you can develop the code without accessing the Windows Azure environment on the web. There are some differences between the two environments, which I’ll cover in another article.

When you’re ready to deploy the application for others to start using it, you create a Storage Account on the Windows Portal. This is a billing instrument, which is based on the amount of storage you use, the amount of data that flows out (ingress of data is free) and a storage access fee. You can find all the pricing numbers on the windowsazure.com site.

When you create that storage account, you’ll get a set of keys so that your code can create and access the data objects. Those keys can be used in everything from on-premises code, to Windows Azure applications running .NET, Java, C++ and other language stacks, all the way through a web-browser call using REST Application architectures.

There are three kinds of storage in Windows Azure: BLOBs, Tables and Queues. SQL Azure is under the umbrella of Windows Azure, but acts not only as storage but a relational engine for processing data as well. I’ll cover it as a separate type of storage in a moment.

Note: The Compute Roles (Web and Worker) have local storage that you can also use, but since the Roles are stateless, this storage should not be used for anything more than a temporary use. I will not include local storage in this discussion.

Architecting a Data Application

With those basics in mind, you can begin to design your application. I break up data applications as in-line and data movement systems.

In in-line data systems, the data is not moved from a source to a destination. You simply create the data, and then access it later from one or more systems. This type of application is what you might see in a typical On-line Transaction Processing (OLTP) system. This might indicate that there isn’t a lot of room here for a distributed system or analysis, but Complex Event Processing (CEP) systems like StreamInsight can actually read the data at it source and act on it without storing it again.

In data-movement systems, the data is transferred from a source system to another system for analysis. This is typical of an On-line Analytical Processing (OLAP). During, or perhaps after, the data movement, the original data is often changed, aggregated, cleansed and so on.

Windows Azure handles both of these designs. You can certainly architect systems that accept data from an input (human or machine) and store the data in various configurations of storage. And you can also take the data from one storage location and place it into another for analysis. CEP options are also available, as is Reporting Services or process it with Hadoop in Azure. As of this writing these latter two offerings are in Customer Technical Preview (CTP).

In any case, my approach with almost any application is to start with the data. My personal belief is that all computing is simply the re-arranging of data, so it follows that you need a process to figure out how that data will be stored, processed and other attributes. In some cases a developer will simply use the storage he or she has available – a particular Relational Database Management System (RDBMS) or a file store, or a “NoSQL” data platform, regardless of the datum they are working with. I think this is a mistake – the data needs to be examined on its own first, and then the proper storage and processing mechanism should be applied to that. To that end, I use the following process to create the application.

Classify the Data

The first step should be to classify the data the application works with. This might be data that currently resides in another system or data that the system will create and manipulate.

Data classification is similar to designing an RDBMS data model. I normally start with the purpose of that datum - for instance, if I want to collect the user’s name, that makes the decision for me on many of the other choices such as data type and even security. Even the collection method matters at this stage. If the name is passed to the application from a validated source, then enforcing a particular format or data type isn’t as important. I might be able to choose Programmatic Referential Integrity (PRI) over Declarative Referential Integrity (DRI) and so on.

Classifying data should always include setting the data type. Once again the meaning of the data is important, especially in numeric or date and time datum. This is not only a data integrity issue, but a computation issue for the system or systems that will process it later.

The data at this point may or may not group into a relational model. It might simply be a “bag of data”, including all de-normalized tuples. Classifying the data allows me to create a CRUD matrix (Create, Read, Update and Delete), a Business Continuity model, a security and privacy profile and so on.

Choose a Consistency Model

This stage begins to determine the storage options for the application, and will help in the choices for Windows Azure storage as well. Here you define the type of consistency you need. While there are several nuanced phases, there are two basic options: immediately consistent or eventually consistent. To oversimplify a bit, immediately consistent data is more transactionally secure, does not encourage “dirty reads” where a datum might have been written since the last time it was read, and does not scale infinitely. Eventually consistent data, on the other hand, biases towards more available reads, and might allow “dirty read” operations. There are ways to allow each type of consistency models, of course, but these definitions work for now. In many cases, OLTP systems require immediately consistent models, and analytical workloads can tolerate eventually consistent models.

Why is this important? There is a cost of immediately consistent data, both in speed and scale. When it’s absolutely required you should use it, but only for the data that needs it. This means that a developer might think more about breaking apart a given tuple than just using the entire row in a single type of storage, perhaps in an RDBMS. 

Determine the Data Path

The next step is to determine where the data originates, and where it is eventually used. This is important for Windows Azure, because there are use-cases when any “cloud” provider doesn’t fit. For instance, if you have a Petabyte of data that needs to transfer to another system daily, no remote storage is really possible with current Internet connections.

The point here is to lay out where the data starts and its path to the end use. If the data is “born” in Windows Azure, movement is quite simple among the various data storage options. If the data exists on-site and needs to be moved one time, Windows Azure storage still fits. If the data needs to be moved once and then updated periodically with small changes, still an option.

The data path also needs to include the systems that need to interact with the data, since those systems are impacted by data types and consistency models. One decision affects another, hence the need for designing the data first.

Define the Data Security

Last, but certainly not least, is the question of the security of a given datum. I say that because it’s common to speak of an entire data set as “secure” or “Privately Identifiable Information” (PII). But in the military I learned that classifying a given document or set of documents as “Secret” had a huge cost of storing, protecting and disposing of the data. We broke apart documents all the time into parts that needed that level of security and the parts that didn’t.

Many developers simply qualify an entire data set as PII simply because a name and government ID is associated with it. Pulling those two items out of the data renders the bulk of it non-sensitive. Data can be joined back up with the two PII elements later when needed – which makes the whole process much more secure, and could benefit even current on-premises system today.

The point here is to ensure that you know the security profile for each datum, and apply the appropriate access control for each of those. In some cases that may mean the data

Windows Azure Storage Options

With the data classified and qualified, you can now make intelligent decisions about which storage options you want to use. You have several to choose from based on the information you created in that process. You may find that even within a given process you store the same data more than once – storage isn’t that expensive, and if it makes the system more scalable and responsive it’s worth the cost to write it more than once. The key is the design, which you can create intelligently because you took the time up front to classify the data.

SQL Azure

SQL Azure is a form of SQL Server databases as a service. This means that you don’t start at the traditional RDBMS level, only the artifacts that it provides. After you have created your service account, you can create one or more traditional SQL Server databases to work with. You are billed by the size of the database per month, and the data egress.

Within a SQL Azure database, you have all the familiar objects to work with – schemas, tables and so on. You’re also able to use multiple indexes – in fact, it’s a requirement that you have a clustered index on every table you create. This makes SQL Azure a natural choice for any data that has a high degree of consistency, strongly-typed data, and declarative referential integrity.

Where SQL Azure may not fit as well is for large binary data, since (as of this writing) the database size is limited to 150 gigabytes. However, you’re able to circumvent these limits by using sharding logic in your code or using the built-in Federations to scale the data horizontally.

Since SQL Azure uses encrypted communications, works across a Control Fabric, copies your data three times in a consistent fashion, and is multi-tenant, you should not expect the same speeds as an on-premises SQL Server. If sub-second latencies are required for the data transfer, or the application is on-premises and requires a high level of speed, SQL Azure may not be a fit.

However, if the application needs to store data with ACID properties, is located in Windows Azure, and requires multiple indexes, then SQL Azure is a choice for the application’s data. You can access the data in SQL Azure using ODBC and ADO.NET classes, and manage it with traditional SQL Server tools. You can also use SQL Server Integration Services, the bcp program from SQL Server, or Data Synch to move data in and out of SQL Azure.

Blobs

Windows Azure Blob storage comes in two types: Block and Page. A Block-Blob is optimized for streaming data, and can send data in 4 megabyte chunks. You’re able to upload a Block Blob before it is committed, which can factor into an architecture’s speed versus consistency needs. You’re able to fit 200 gigabytes of storage in a single Blob object. Block Blobs are best used for sequential access.

A Page Blob can be up to 1 terabyte in size, and is better suited for random access. You can carve the pages down to smaller than 1 terabyte, of course, and have multiple pages. Both Page and Block Blobs are secured using your access key, and you can make the objects public or private. Normally the program calls out to the Blob data on behalf of a qualified user request, although videos, pictures and so on can simply be made public for access by a user or program.   

Windows Azure Blob storage can also make use of a Content Delivery Network (CDN). A CDN is a series of storage in datacenters closer to the users, which acts as a cache. You can simply mark a particular Blob as “CDN Enabled” and when the user requests the data, it will be transferred to the user from the main datacenter, and then “edged out” to a location closer to them. Any subsequent calls for that data will come from the edge location. A timeout value you set keeps the data in the edge location until it ages back from that location.

Windows Azure Blobs can store more than just binary data. To be clear, data in computer systems is based on a binary system, so technically all computation and storage is binary, what I mean here is that the information is not a text or ASCII representation of text.  That being said, there is no reason you can’t store ASCII data in a Blob – in fact, this is a common practice for large XML files and other textual information.

Tables

Table storage within Windows Azure is not the same as a relational Table object in SQL Azure. In SQL Azure, Tables are made up of columns that are the same in every row. In Windows Azure Tables, the structure is closer to a “NoSQL” offering, or a C#/Java hashtable. It’s a key-value pair system, where you identify a “row” by its key. The key is also used in Windows Azure Tables to physically partition the data in the datacenter.

An example is useful to make this type of storage more understandable. Even if you are not familiar with an RDBMS table, you can think of a spreadsheet example of a record of people (this would not be a heavily normalized example):

Name

Street

City, State, Zip

Category

John Smith

123 Here Street

Covington, WA 98042

Customer

Jane Doe

234 There Ave

Redmond, WA 98052

Customer

Jim Johnson

345 Over Blvd

Covington, WA 98042

Business Owner

 

The same data represented in a key-value pair system might look like this:

Key

Values

John Smith

Street: 123 Here Street

CityStateZip: Covington, WA 98042

Category: Customer

Jane Doe

Street: 234 There Ave

CityStateZip: Redmond, WA 98052

Category: Customer

Jim Johnson

Street: 345 Over Blvd

CityStateZip: Covington, WA 98042

Category: Business Owner

 

The advantage is that the key-value pair system can also look like this – something not structured the same way in an RDBMS:

Key

Values

John Smith

Street: 123 Here Street

CityStateZip: Covington, WA 98042

Category: Customer

Wellingtons

Company Code: 123456

Category: Customer

Location: West Region

Standing: Paid up

EmployeeCount: 2350

Wetsuit

ProductID: 34256789E

Category: Inventory Item

 

Table storage in Windows Azure is immediately consistent, although transactions are not directly supported outside of a single Table call. There are ways to ensure a higher level of consistency, but the best use-case for Table storage is the sheer scale of the data.

You can access Table storage using REST APIs, the APIs in the Windows Azure SDK, and you can also use Linq. Table access is also often queries using JSON.

Queues

Windows Azure Queues are a type of Table storage – but can go to 100 terabytes in size. They are used for passing messages in a distributed system, and although they share some similarities with (MSMQ), they are not quite the same. For instance, you are able to create a message that might be read more than once, so your application logic needs to handle that gracefully. Like any message-passing system, they are not intended for permanent storage, and users do not normally access Queues, only programs.

Although Queues are not a permanent method of storing data, they are useful for data that has a short lifespan.

As you can see, you have several choices for setting up your data layout using Windows and SQL Azure. The key is to approach the design of the application from the classification of the data. Then pick the technology that meets the requirements you have.

Buck Woody

Author profile:

Buck Woody has been working with Information Technology since 1981. He has worked for the U.S. Air Force, at an IBM reseller as technical support, and for NASA as well as U.S. Space Command as an IT contractor. He has worked in most all IT positions from computer repair technician to system and database administrator, and from network technician to IT Manager and with multiple platforms as a Data Professional. He has been a DBA and Database Developer on Oracle systems running on a VAX to SQL Server and DB2 installations. He has been a Simple-Talk DBA of the Day

Search for other articles by Buck Woody

Rate this article:   Avg rating: from a total of 9 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: analytics / querying requirements
Posted by: Chris Adkin (view profile)
Posted on: Wednesday, March 28, 2012 at 7:00 AM
Message: Hi Buck,

This is a very good article, regarding the choice of storage, are the analytics / querying requirements for the data something that should be seriously considered ?. For example, if you you have complex reporting requirements, running such reports against the table storage would require lots and lots of hand rolled code, that could possibly be avoided if the data was stored under the relational engine.

Regards,

Chris

Subject: Object Replication
Posted by: Anonymous (not signed in)
Posted on: Friday, April 06, 2012 at 10:31 AM
Message: You reference the fact in the article that the object is replicated 3 times in the same datacenter and these 3 replications are then again replicated to a geographically separate datacenter. From the reading and research I have done, I understand the replication in the same datacenter occurs as part of the purchase, however the replication to another datacenter is an additional option that one must pay for. Is this true or has Microsoft changed this recently?

 

Top Rated

Migrating to Microsoft BPOS - Part II
 In his last article, Johan gave us a crystal clear guide to preparing to migrate from an on-premises... Read more...

Emulating the Exchange 2003 RUS for Out-of-Band Mailbox Provisioning in Exchange 2007
 Exchange's Recipient Update Service was important in Exchange 2000 or 2003 in order to complete the... Read more...

The Postmasters
 The Exchange Team introduces themselves, and keeps you up-to-date Read more...

For this Exchange Server Archiver, “Transparency” Fits
 Sometimes, it is a great relief when a user of your software gives it a tough test and then reports... Read more...

Hunting in Packs, Seamless-ness and Happy Holidays
 I attended DevConnections (Exchange) last month and was blown away by the technical talks. Speakers... Read more...

Most Viewed

Upgrade Exchange 2003 to Exchange 2010
  In this article, the first of two in which Jaap describes how to move from Exchange Server 2003... Read more...

Upgrade Exchange 2003 to Exchange 2010 - Part II
 In Jaap's second article on upgrading straight from Exchange Server 2003 to 2010, he explains how to... Read more...

Goodbye Exchange ExMerge, Hello Export-Mailbox
 ExMerge was a great way of exporting a mailbox to an Exchange PST file, or for removing all occurences... Read more...

Exchange E-mail Addresses and the Outlook Address Cache
 Because Exchange auto-complete cache uses X.500 addresses for e-mail sent to addresses within the... Read more...

Using Exchange 2007 for Resource Booking
 The process of booking various resources to go with a meeting room just got a whole lot easier with... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.