- SQL Data Warehouse – Part 1: Introduction
- SQL Data Warehouse – Part 2: Implementation
- SQL Data Warehouse – Part 3: Connecting
- SQL Data Warehouse – Part 4: Creating and Querying
- SQL Data Warehouse – Part 5: Loading Data
Chances are, if you’re moving to SQL Data Warehouse, or even just trying it out, you’ll want to load data from SQL Server to SQL Data Warehouse, even if it’s just test or sample data. Not surprisingly, Microsoft provides several options for carrying this out. The easiest of these, in my opinion, is to create a SQL Server Integration Services (SSIS) package in SSDT, using basic SSIS components to extract and load the data.
In this article, I demonstrate how to use SSIS to copy data directly from a SQL Server database to a SQL Data Warehouse database, providing specific details about the components I use along the way. If you want to try out these examples, you must have an active Azure subscription and have set up a resource group, V12 logical SQL server, and SQL Data Warehouse database.
Once you have the SQL Data Warehouse database set up, you can use SSIS to load data into the database, as we’ll do in this article. But before we start, let me give you my standard warning: Adding a database to SQL Data Warehouse will incur Azure subscription fees, unless you’re running the service under Azure’s 30-day free trial or you have extra MSDN credits. Be certain you understand your subscription type and the fee structure before you start. Also be sure to pause the database when you’re not using it to avoid unexpected or unnecessary charges.
Loading data into SQL Data Warehouse
Although this article focuses on using the basic SSIS components to load SQL Server data into SQL Data Warehouse, you should be aware that Microsoft offers several other options for copying your data over.
For example, you can use the Azure Blob Upload task in SSIS to facilitate the load process. The task is part of the SQL Server 2016 Integration Services Feature Pack for Azure, which is currently in preview. According to Microsoft, this is the fastest way to load SQL Server data into SQL Data Warehouse. However, you must use SSIS 2016, and you must have Azure Blob Storage set up.
From what I can tell, this is how the process works. You dump your SQL Server data to local files, use the Azure Blob Upload task to upload those files to Azure Blob Storage, and then run a PolyBase script that loads the data into SQL Data Warehouse. According to Microsoft, PolyBase can use the massively parallel processing (MPP) architecture in SQL Data Warehouse to load data in parallel from Azure blob storage, which SSIS alone cannot do.
I have not see any metrics that show how much faster it is to stage the data in Azure Blob Storage than it is to load the data directly from SQL Server. Undoubtedly, there is overhead involved when dumping the data to files, copying the files to Azure, and then loading the data into SQL Data Warehouse.
It is also possible that running parallel SSIS packages or Data Flow tasks against SQL Data Warehouse can help offset some of the limitations of using the basic SSIS components to load data. However, I have not see any metrics on this approach either.
At some point, it would be worth testing both scenarios and comparing results, using enough data to provide a meaningful comparison. In the meantime, your decision on which approach to take will likely depend on how much data you have, how much time you have, and whether you want to add the additional layers of complexity that come with using Azure Blob Storage, given you can achieve the same results using the basic components within SSIS.
If the PolyBase model sounds good to you, you can skip SSIS altogether and instead use AZCopy, as long as you’re talking under 10 TB of data. AZCopy is a command-line utility that you can download from Microsoft and use to copy your local data files to Azure storage. From there, you can then run a PolyBase script to load the data into SQL Data Warehouse. You still have to contend with exporting your data from SQL Server into flat files and then removing the files from Azure storage after they’ve been loaded into SQL Data Warehouse, but for smaller one-off operations, this might be an okay way to go.
If you have over 10 TB of data, you might consider Microsoft’s Import/Export Service. In this scenario, you again dump the SQL Server data to flat files, only this time, you copy the files to transferrable disks and ship those disks to Microsoft. The techies there will then load the data into SQL Data Warehouse for you.
On the other hand, if you’re at the other end of the spectrum and loading only a small amount of SQL Server data, such as lookup or sample data, you might consider the bcp utility. According to Microsoft, you should use the utility to first export the data out of SQL Server into flat files and then use the utility to import the data from the flat files into SQL Data Warehouse.
Depending on the scenario, any of these options might be a useful approach. However, the only approach that appears to let you move data directly and easily from a SQL Server database to a SQL Data Warehouse database is SSIS and the basic tools already built into SSDT. That means, if you have SQL Server up and running and SSDT set up in Visual Studio, you’re ready to load data into SQL Data Warehouse, without having to install any special components or create any new Azure accounts.
Preparing the data warehouse
Once you have an active SQL Data Warehouse database and have confirmed your ability to connect to the database, you should be ready to go, at least in terms of trying out the examples in this article. The examples are based on the AdventureWorks2014 database, which I have installed on a local instance of SQL Server 2014.
If you want to try out these examples exactly as presented here, you need to have access to a copy of the database on a SQL Server instance, as well as have SSDT set up for running T-SQL statements against SQL Data Warehouse and for building the SSIS package. In addition, you need the information necessary to connect to your SQL Data Warehouse database. In my case, I used the following information:
- Database name: s dwdb2
- Server name: sdwsrv1.database.windows.net
- Login account: sdwadmin
With these considerations in mind, let’s get started. The first step is to create a couple tables on the SQL Data Warehouse database. Both tables are fact tables that are nearly identical, except for one column. The first table is FactInternetSales, as shown in the following CREATE TABLE statement:
CREATE TABLE dbo.FactInternetSales ( ProductKey INT NOT NULL, CustomerKey INT NOT NULL, OrderDateKey INT NOT NULL, OrderNumber INT NOT NULL, OrderQty SMALLINT NOT NULL, SalesAmt MONEY NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN );
The second table is Fact Reseller Sales. This only differences between this table and the preceding one is that Fact Reseller Sales includes the SalesPersonKey column, as shown in the following CREATE TABLE statement:
CREATE TABLE dbo.FactResellerSales ( ProductKey INT NOT NULL, CustomerKey INT NOT NULL, OrderDateKey INT NOT NULL, OrderNumber INT NOT NULL, SalesPersonKey INT NULL, OrderQty SMALLINT NOT NULL, SalesAmt MONEY NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN );
Both tables are created as columnstore indexes and use the round-robin distribution method. To create the tables, I ran the two scripts against the SQL Data Warehouse database from within SSDT. Before I ran the scripts, I used SQL Server Object Explorer in SSDT to connect to the SQL Data Warehouse database and launch a new query. Again, if you have any questions about connecting to a SQL Data Warehouse database or creating tables within the database, refer to my previous articles, particularly the last two.
That’s all you need to do on the SQL Data Warehouse side to work through the examples in this article. From here, we’ll create an SSIS package with a single data flow that retrieves data from the AdventureWorks2014 database and loads the data into the two target tables. Normally, a data warehouse would include a lot more tables, and we would likely load data into the dimension tables before the fact tables, but for the examples in this article, these fact tables are enough to demonstrate loading the data.
Setting up your connections
This article assumes you know how to create an SSIS package in SSDT and add a Data Flow task, which you should do before going any further. If you’re not sure how to complete these tasks, refer to the following MSDN articles (or any number of other articles) to get a basic idea of how to proceed:
After you’ve created your SSIS package and added the Data Flow task, you should add two OLE DB connection managers: one for retrieving the source data and one for loading the data into the target database. On my system, I created the first OLE DB connection manager to point to AdventureWorks2014 database on my local instance of SQL Server 2014. The following figure shows how I configured the connection manager properties:
This should all be very straightforward. You specify the instance where the database resides, select the authentication type (providing a user name and password if using SQL Server authentication), and then select the AdventureWorks2014 database.
After I created the connection manager, I renamed it AdventureWorks2014 for the sake of brevity. I then added a second OLE DB connection manager for my SQL Data Warehouse database. We can, in theory, use an ADO.NET connection manager, but I have had no luck with this in the past and, according to Microsoft, an OLE DB destination “may provide slightly better performance,” so I stuck with the OLE DB connection.
Configuring the connection to the SQL Data Warehouse database is similar to configuring a connection to any instance of SQL Server, with a couple important differences, as shown in the following figure.
For the server name, you must specify the complete four-part name, as in sdwsrv1.database.windows.net. You must also use SQL Server authentication and provide the necessary user name and password, and then select the target SQL Data Warehouse database.
Be certain to test your connection. If everything is working properly, you should receive a message indicating that you successfully connected to the database. If the message states that the connection failed, be sure to confirm that you are using the correct credentials and that the correct client IP address has been permitted access to the Azure server.
Retrieving the data
The next step is to add an OLE DB source component to the data flow that retrieves data from the AdventureWorks2014 database. The source component should specify the following SELECT statement, which joins data from the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables:
SELECT sd.ProductID, sh.SalesOrderID, sh.CustomerID, CONVERT(INT, CONVERT(VARCHAR(8), sh.OrderDate, 112)) OrderDateKey, sh.SalesPersonID, sd.OrderQty, sd.LineTotal, sh.OnlineOrderFlag FROM Sales.SalesOrderHeader sh JOIN Sales.SalesOrderDetail sd ON sh.SalesOrderID = sd.SalesOrderID where sh.OnlineOrderFlag = 0 ORDER BY ProductID;
Notice that the statement converts the OrderDate column to an integer that can be used as the date key, giving us data such as 20160514. Doing the conversion within our T-SQL statement is much simpler than doing a lookup within the SSIS data flow or trying to convert the data there. SSIS makes this process far more complex that the SQL Server database engine. Ultimately, decisions such as this will depend on where you’re running SSIS and where you want your processing to occur.
After you add the OLE DB source component, open the component’s editor and select the connection manager you created for the SQL Server instance. Next, select the SQL command option and add the preceding SELECT statement, as shown in the following figure.
At this point, it’s a good idea to preview the data and view the Columns page to make certain you’re getting the results you expect. Given how easy SSIS makes all this, I suspect you’ll have little problem.
In a normal scenario, we would likely configure an error output to handle any anomalies that might come along, but for the purposes of this article, we’ll keep things simple. Now onto transforming the day.
Transforming the data
We’re not going to do any heavy-lifting, transformation-wise, but know that you can transform the data in any way you normally would when loading it into a data warehouse, using the full complement of SSIS tools for handling both fact tables and dimensions.
The first transformation I included is a Data Conversion component. I did this to emphasize the fact that you need to be aware of your data types when loading data into SQL Data Warehouse. The service does not support all the types that are supported in SQL Server. For example, SQL Data Warehouse does not support the geometry or geography types. If a SQL Server column is configured with one of these types, you’ll likely want to convert the column to a varb inary type.
You might also need to convert data if the source type is in a different type family from the target type, such as converting a numeric type before inserting it into a character column.
Although SQL Data Warehouse will often implicitly convert data for you, I tend to think it’s a good idea to do explicit conversions where practical (with the appropriate error handling) so data doesn’t get rounded or truncated in unexpected ways. To demonstrate how to explicitly convert data in SSIS, I added a Data Conversion transformation to convert the LineTotal column to a currency type.
In the AdventureWorks2014 database, the LineTotal column is configured with the numeric data type. When the column is extracted from SQL Server, SSIS changes it to the DT_NUMERIC type. However, the data from the LineTotal column is targeted for the SalesAmt column in each of our target tables. The column is configured with the MONEY data type, thus the explicit conversion. The following figure shows how I configured the Data Conversion component to handle the data.
The transformation converts the LineTotal column to the DT_CY data type and outputs the column as SalesAmt. Although this is only a simple example, it demonstrates an important concept: You must ensure all your data types are compatible with the SQL Data Warehouse types, and you must be aware if any data is being implicitly converted during the load process.
One other point worth noting. Some Microsoft documentation suggests that the numeric data type is not supported in SQL Data Warehouse. I was able to create a table using this type and I find evidence in much of the documentation to support its use. Keep in mind that SQL Data Warehouse is still very much a work in progress, so changes are happening by the minute, and some of the documentation might be slow to catch up.
Another transformation component I added to the data flow is Conditional Split. In this case, I want to send rows with an OnlineOrderFlag value of 1 to the FactInternetSales table and rows with an OnlineOrderFlag value of 0 to the FactResellerSales table. When configuring the component, I defined two expressions, one for each scenario, as shown in the following figure.
Notice that the expressions must use double equal signs ( ==) and use the values TRUE for 1 and FALSE for 0. Also note that the first expression is assigned to the output name Internet sales and the second expression is assigned to the output name Reseller sales. We can then use these outputs to create two data paths, one to each target table.
For this article, that’s all the transforming we’ll do. If you’re at all familiar with SSIS, you know that there are many other options for fixing and enhancing and massaging the data. But you get the point. You want to fully prepare your data for SQL Data Warehouse so when you go to load it, you can slip it right into your target tables.
Loading the data
Once you have your data the way you want it, you’re ready to load it into the SQL Data Warehouse database. The next step, then, is to add an OLE DB destination component to the data flow and connect a data path from the Conditional Split component to the destination component. When you connect the data path to the destination, you will be prompted to select the output from the Conditional Split transformation, as shown in the following figure.
In this case, we’ll use the Internet sales output because our first OLE DB destination will target the FactInternetSales table. After you’ve selected the output, you can configure the destination by selecting the connection manager and specifying the target table, as shown in the following figure.
Be sure to select the connection manager that points to the SQL Data Warehouse database. Also ensure that the Table or view – fast load option is selected to bulk-load the data into the target table. (This should be selected by default.)
Next, go to the Mappings page to map the source columns to the target columns. The mappings should be self-evident, but you can use the following figure as a reference to ensure that you get it right.
That’s all there is to configuring the destination for the FactInternetSales table. Now repeat the same process for the FactResellerSales table by adding a second OLE DB destination component. For the second destination, use the Reseller sales output and map the the SalesPerson ID column to the the SalesPersonKey column. The following figure shows what I came up with for my data flow.
As you can see, the data flow retrieves the data from the source database, converts the LineTotal values, splits the data flow based on the OnlineOrderFlag values, and loads the data in the two target tables. You might choose to create separate data flows or SSIS packages for each table, depending on the amount of data you’re moving, but those types of decision you can make on a case-by-case basis.
Running the SSIS package
The last step, of course, is to run the SSIS package. If all goes well, you should see a nice little series of green circles with white checkmarks, along with row counts showing the amount of data to have passed through each phase of the data flow. The following figure shows the execution results I received on my system.
If you run into any glitches, go back and check your connections, configurations, and column mappings. A package this basic should make it relatively easy to track down any issues. If you run into problems, refer to the Execution Results tab, as shown in the following figure.
In this case, everything ran smoothly, so there are no warning or error messages. But if you run into any, use them to track down any issues. It’s always a good idea to check the execution results, even if you think everything ran fine.
Using SSIS to load SQL Data Warehouse data
Although the SSIS package we’ve created here is as about as basic as them come, it demonstrates how easy it is to use SSIS to load data into SQL Data Warehouse, without having to install special components or create Azure accounts. Everything we did here used out-of-the-box functionality built into SSIS and SSDT. Chances are, you’re already set up to load data into a SQL Data Warehouse database, except perhaps, having created the database itself.
Despite the complexities your SSIS packages or how they might vary from one to the other, the basics are the same when it comes to adding data to a SQL Data Warehouse database. You extract the data, transform the data, and load the data, just like you would do with an on-premises data warehouse.
That said, you are sending data across the ethers, so you’ll be impacted by issues such as bandwidth and network availability, as well as by the amount of data and the processing power available to SSIS and SQL Data Warehouse. No doubt you’ll want to do some experimenting, and in some cases, look to other solutions for loading data, but with SSIS, you have a good place to start, and it’s certainly worth trying out before setting off in other directions.