- SQL Data Warehouse – Part 1: Introduction
- SQL Data Warehouse – Part 2: Implementation
- SQL Data Warehouse – Part 3: Connecting
- SQL Data Warehouse – Part 4: Creating and Querying
- SQL Data Warehouse – Part 5: Loading Data
Once Microsoft had decided to take the cloud seriously, it then proceeded to engage in the cloud seriously, turning the Azure platform into a multi-faceted service factory in which analytics and data management played a central role. One service that Microsoft has been pushing with particular vigor is Azure SQL Data Warehouse, a distributed database management system that Microsoft bills as an elastic data warehouse with enterprise-class features.
SQL Data Warehouse is one of the newer offerings to hit the Azure stage. Microsoft first released the service for a limited preview in July 2015, and then followed up in September with a full public preview. Since then, Microsoft has released several updates. The service is still in preview, but grows more solid every day. In fact, you already have to pay for the privilege of using it, though at a reduced rate.
According to Microsoft, SQL Data Warehouse can support petabytes of data and be scaled up and down in a matter of minutes-or even seconds-with compute and storage resources separated to better control operations and costs. You pay only for what you need, when you need it, as long as you don’t leave things running after you meant to turn them off.
SQL Data Warehouse is a fully managed cloud-based service, just like the rest of the Azure offerings, which means Microsoft takes care of all the patching, upgrades, maintenance, and back-ups, while providing fault tolerance and self-service restore. Microsoft has made it as painless as possible to implement a data warehouse, especially when compared to deploying a solution on-premises.
SQL Data Warehouse is also integrated with several other Azure offerings, such as Machine Learning and Data Factory, as well as with various SQL Server tools and Microsoft products. In this way, SQL Data Warehouse can serve as an integral component in a comprehensive analytics and business intelligence (BI) solution that retrieves, processes, stores, analyzes, and presents a wide range of data. More importantly, it serves Microsoft’s larger strategy of providing an end-to-end, cloud-based, enterprise solution that can meet all of an organization’s big data needs and in so doing, allow Microsoft to control the big data universe.
For now, we’ll focus on SQL Data Warehouse.
SQL Data Warehouse architecture
At its most basic, a SQL Data Warehouse implementation consists of a control node, multiple compute nodes, and large-scale storage.
The control node provides the interface through which you connect to the data warehouse. It is a managed, cloud-based database, powered by the Azure SQL Database service. You can define schema, load data, and run queries using a variety of tools and technologies.
Beneath the surface, however, you get more than just SQL Database. The control node manages the computation and data movement operations distributed across the system. At the heart of this effort is the massively parallel processing (MPP) engine, which breaks down complex queries into parallel processes and coordinates that processing across multiple compute nodes.
The compute nodes, also powered by SQL Database, do all the heavy lifting. When you load data to your warehouse, the service distributes it across the compute nodes, which handle the processing necessary to add the data to storage. When you query data, MPP breaks the queries into parallel processes and spreads them across the compute nodes. Each node then goes to work with its own chunk of the processing. After the compute nodes finish processing the query, they return their results to the control node, which then aggregates the data.
The control node and compute nodes also rely on Microsoft’s Data Movement Services (DMS) to facilitate communication and data movement between the control node and compute nodes and between the compute nodes themselves, making it possible to perform joins and aggregations across multiple nodes.
The final piece of the SQL Data Warehouse puzzle is the storage component, in this case, Azure Blob Storage. The compute nodes write directly to and read directly from that storage. The data is safely persisted without being impacted by such operations as pausing the compute services, backing up or restoring data to the warehouse, or scaling the warehouse compute or storage services.
According to Microsoft, SQL Data Warehouse also incorporates sophisticated algorithms and techniques to assist DMS in moving data more efficiently. In addition, the service uses an advanced query optimizer and set of complex statistics to create query plans and optimize data distribution.
SQL Data Warehouse elasticity
One of the most talked about features in SQL Data Warehouse is its scalability. Not only can you scale up and down with relative ease and in quick order, but you can also control the compute and storage resources separately from each other, helping to ensure you use only what you need when you need it.
The compute side of the equation is based on the data warehouse unit (DWU), a measure of the used computational resources (CPUs, memory, and storage I/O), aggregated across the participating nodes. You can scale the number of DWUs used for your data warehouse up and down simply by moving a slider in the Azure Classic Portal.
The more DWUs you assign, the better the performance and the higher the costs. The preview rate for the compute resources currently starts at 70 cents per hour for 100 DWUs, which comes to about US $521 per month. One of the most interesting-and useful-features in SQL Data Warehouse is that you can pause the compute services when not in use, canceling out running queries and the DWUs assigned to the warehouse. In this way, you avoid compute fees for unnecessary services, yet still preserve all your data.
The flexibility of the service’s compute resources can be particularly useful in a data warehouse scenario. For example, you might need mega processing power only certain days of the week or only certain times of the day. The rest of the time, you can pause the service or lower the DWUs to accommodate smaller operations. Then, when you need the hard-core power, you can bump up the DWUs to a more appropriate level.
Then there’s the Blob Storage side of SQL Data Warehouse, which you treat as a separate entity. Changing the DWU assignment or pausing the compute services has no impact on the storage. You pay for the amount of data stored, no matter what’s going on with the compute resources or the type of warehouse operations you perform. The fees are based on the read-access, geographically redundant (RA-GRS) rate structure, which is backed by a service-level agreement (SLA) of 99.9% write availability and 99.99% read availability. Currently, those rates start at 6.1 cents per gigabyte.
The SQL Data Warehouse database
When we get into the database itself, we find many familiar features. For example, SQL Data Warehouse supports much of the same T-SQL syntax you find in SQL Server. You can create tables, views, stored procedures, and even temporary tables. You can also declare variables, create while loops, implement dynamic SQL, define transactions, and carry out a number of other operations. Although the T-SQL in SQL Data Warehouse is not quite as robust as you’ll find with SQL Server, it still supports an extensive number of data definition language (DDL) statements and data manipulation language (DML) statements, along with console commands and security statements.
Another important feature in SQL Data Warehouse is the clustered columnstore index, which provides an efficient mechanism for storing and querying the types of large fact tables you often find in a data warehouse. According to Microsoft, the columnstore index can provide up to five times the compression gains and 10 times the query performance gains over row-oriented storage. In fact, Microsoft recently announced that the clustered columnstore index is now the table default in SQL Data Warehouse.
Microsoft’s PolyBase technology has also been integrated into the data warehouse service. PolyBase was first introduced in SQL Server Parallel Data Warehouse (PDW) as a way to allow T-SQL queries to access non-relational Hadoop data. PolyBase was optimized for data warehousing workloads and intended for analytical query scenarios. Support for PolyBase continued when PDW morphed into the Analytics Platform System (APS). At some point, Microsoft also extended PolyBase to include Blob Storage as a data source and has integrated the technology into the upcoming SQL Server 2016.
SQL Data Warehouse also provides a number of options for managing warehouse databases. You can use the Azure Classic Portal, SQL Server Data Tools (SSDT) in Visual Studio, the sqlcmd command-line utility, or the PowerShell scripting shell, using the Azure PowerShell with ARM cmdlets. The service also provides a number of dynamic management views for monitoring warehouse workloads and query executions.
An important component of any data warehouse solution is the ability to efficiently load data. In this area as well, SQL Data Warehouse provides a number of options. For example, you can use the bcp command-line utility to move your SQL Server data into flat files and then move them into Blob Storage. Or you can use the Azure Import/Export service to get the data into storage. In addition, you can use SQL Server Integration Services (SSIS) or Azure Data Factory to load data.
Chances are, anyone considering a move to SQL Data Warehouse will also be wondering about security. Microsoft has put into place several mechanisms for protecting data. For example, you can help secure your connections to the database by configuring firewall rules and encrypting your connections. You can also create logins and user accounts to provide a foundation for controlling authentication and then authorize those accounts through role memberships and permissions.
SQL Data Warehouse Reliability
In the current preview of SQL Data Warehouse, Microsoft reports a reliability rate of about 98% for a typical workload. In other words, you may see an average of two failures for every 100 queries. The probability of failure increases with the execution time. A query that takes longer than a couple hours is more likely to fail than one that takes only a couple minutes. You might also run into query failure during the preview period, when critical fixes are applied to the service, which can occur up to five times per month.
Even if a query fails, the data itself is well protected. Microsoft stores the data using geo-redundant blobs, maintaining three synchronous copies in the local Azure region and three asynchronous copies in a remote Azure region. In this way, you are protected against both localized failures and regional failures to ensure complete disaster recovery.
SQL Data Warehouse backs up all data every four hours using Azure Storage Snapshots. The service maintains those snapshots for seven days, giving you 42 snapshots at any given time, should you run into issues. You can use PowerShell or the provided REST APIs to restore data from a snapshot. Microsoft does not charge for the backup service and includes it in your subscription fees.
The integrated world of SQL Data Warehouse
In Microsoft’s vision of data analytics and management, SQL Data Warehouse plays a pivotal role, but it is by no means a solo act. Microsoft has taken great pains to point out the service’s integrated nature. You can use SQL Data Warehouse in conjunction with your SQL Server tools: SSDT, SSIS, SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS). Also, according to Microsoft, a number of third-party providers have updated their tools to integrate with SQL Data Warehouse.
But such integration is only one aspect of the Microsoft vision. As part of the Azure platform, SQL Data Warehouse can also leverage Azure Data Factory, Azure Machine Learning, and Azure Stream Analytics. Data Factory offers a platform for creating complex data extract and load pipelines. Machine Learning provides a set of tools for building and deploying predictive analytic solutions. Stream Analytics is a platform for processing and consuming event data coming out of Azure Event Hub.
Then there’s Power BI, which provides the visual interface for all that warehouse data. With Power BI, you can create reports and visualizations that leverage the compute power of SQL Data Warehouse. You can also send queries directly to SQL Data Warehouse in real time, allowing you create dynamic reports based on terabytes of data.
In addition, Microsoft includes SQL Data Warehouse in the Cortana Analytics Suite, an integrated set of tools for building analytic solutions. The suite is made up mostly of Azure services, but also includes Power BI and the Cortana personal assistant.
Given Microsoft’s emphasis on SQL Data Warehouse integration, we can only assume that the company will do all that it can to make everything work together as painlessly as possible, but we’ll need to get past preview and see more large-scale production examples to know how this will all pan out. SQL Data Warehouse on its own might not seem that big a deal, though it does appear Microsoft has put a lot of thought into the service. It is SQL Data Warehouse, when considered along with all the other Microsoft services and tools, that will likely make the difference between its success or failure.