16 August 2013

Getting Data between Excel and SQL Server using ODBC

With ODBC, you can summarise, and select just the data you need, in an Excel workbook before importing it into SQL Server. You can join data from different areas or worksheets. You can even get data from the result of a SQL Server SELECT statement into an Excel spreadsheet. Phil Factor shows how, and warns of some of the pitfalls.

Why Use ODBC?

It is reasonably easy to insert data from Excel into SQL Server, or the reverse, from any other ODBC database to any other, using PowerShell. The most important direction is from Excel to SQL Server, of course. It is quicker than automating Excel and you can do it without requiring a copy of Excel. It is neater than SSIS too, and more versatile. The most important thing, though, is that you can aggregate before you send the data. It is possible to do a lot of filtering and aggregation of data before it ever gets to SQL Server, since you can turn an existing Excel Workbook into a poor-man’s relational database, or even create one. This article will aim to show how this is done.

I always feel slightly awkward in talking about ODBC. It is a Once and Future technology, developed before its time, but now showing its value for processing large volumes of data, despite its quirks, poor documentation and lackluster support. If you use the ODBC driver, then your Excel workbook becomes a little SQL-based relational database. Worksheets, or areas within worksheets, become tables. There are some features missing, of course, but you can do joins between tables, filter rows to taste,  do aggregations and some string manipulations.  This means that you need pull far less data into SQL because you can do a lot of selection and pre-processing before the data gets anywhere near SQL server. If, for example, you only need the total, count, and variance of a day’s readings, then why on earth would you want to import more than those aggregated figures?  Even if you do, these aggregations, performed on the original data, can be used as a ‘reconciliation’ check that you’ve gulped all the data into their final destination without error.

 I also prefer to use ODBC and the sequential data reader to read data from Excel, or any other ODBC source, because it is fast; and I like to use the bulk copy library to insert  ODBC ‘reader’ data into a SQL Server table because it is extremely fast, so we’ll use that. When you have a large number of big spreadsheets to insert as a chore, then speed matters.

The ODBC Excel driver (ACE)

ODBC was conceived as a way of making it as easy to connect to a particular datasource such a relational database, text file, data document (e.g. XML), web-based data or spreadsheet

Currently, the state of the art in ODBC for Access and Excel is the Microsoft Access Database Engine 2010 Redistributable which can be downloaded here. This includes the more popular OLEDB drivers which run well in PowerShell too.  These drivers enable you to access a range of data files via SQL as if they were a relational database.  Formats include Access, CSV,  delimited, DBase and Excel 

For developing on a general-purpose 64-bit desktop computer, you’re likely to hit a very silly Microsoft muddle. Microsoft  recommends that you install the 32-bit version of Office 2010, even on 64-bit machines,  since many of the common Office Add-ins did not run in the 64-bit Office environment. This advice has become baked-in ‘best practice’.  If you are using 64-bit PowerShell, as most of us are, then you need to use the 64-bit version of the drivers. If you  only have the 32-bit Office on your machine, then it will already have the 32-bit drivers, which won’t be visible to 64-bit PowerShell, and won’t work.  You can’t install the 64 bit drivers when you already have the 32-bit drivers and I don’t think you can get anything good to happen by uninstalling the 32-bit drivers. Nope. All three (or four if you include Visual Studio) must be 64 bit. I gather that one way out of this Catch 22 is  to first install the 64-bit Office 2010 ODBC/OleDB drivers and after that the (32-bit) Office, but there is a better fix that involves tweaking the registry.  See this for the full frustrating story

The ODBC Excel driver in ACE works with the latest Excel spreadsheet format up to 2010 as well as the old ones. I suspect that the latest version will work with Office 2013, though I haven’t yet tried it.

This driver  is valuable because of the flexibility it gives. It actually executes ODBC SQL, which is a subset of SQL92,  so you can put in column aliases, change the column order,  and filter out rows that you don’t want to import. In effect, it presents you with a SQL tables which can be named ranges, if it is an existing worksheet that you’ve added named ranges to.

More commonly, you can specify with a delimited worksheet name followed by a range, the range being a specification of the area of the worksheet just sufficient to enable the driver to find the data you want. If you leave out the range spec entirely, the entire worksheet becomes the table.

If, for example, you wanted the data in the range from C3 to L8, you’d use the statement

In ODBC, if you specified, say,  row 8 as the end of the table, you can only select rows up to row 8, even if you have inserted more rows beyond that limit, as ODBC allows. If you use some flavours, such as  the old  MDAC ‘JET’ database engine,  then you cannot add new rows beyond the defined limits of a range, otherwise you will get the Exception: "Cannot expand named range" message

If you wanted to define your table as being between the columns C and L, starting at row 3 you’d use

If you do this, then there is no limit to the length of the table so you can insert as many rows as you like.  The ODBC provider adds new rows to the existing rows in the defined area as space allows

The dreaded connection string

Now, before we start doing interesting things with the ACE drivers, I ought to explain a bit about their connection strings. These contain the specification of the ODBC driver you wish to use, and the settings that you wish to transmit to the driver.

Ignoring, for the time being, the extended property settings, For Microsoft Office Access data, set the Connection String to

For Excel data, use

For dBASE data, use

For text data, use

But you’re likely to want some extended properties for the settings to add a few details about the way that the ODBC provider should tackle this particular connection. Because the defaults can be changed globally in the registry, it is rather better to specify these extended properties rather than to rely on the defaults.

These extended properties are only relevant for the driver that you’re using. They are not always reliable and are poorly documented by Microsoft. I’ll only mention the essentials.

The driver needs to know if the first row of the table holds the name of the column. “HDR=Yes;” indicates that the first row contains column names, not data. It will actually just use the first 64 characters of the header.   ”HDR=No;” treats the first row as data, but then the columns are named F1  onwards and you’d want to  alias them in your SQL statements to give them meaningful column names.

The Excel ODBC doesn’t keep a detailed schema definition of the tables. (the Text and Access  drivers by contrast do)  The ODBC Excel  driver will try to make sense of the data it finds by testing it to see what datatype it can use for the result. It does so by testing a number of rows before doing the import, and you can specify how many rows it tests before deciding the data type of the column by using MaxScanRows in the extended properties.  By default the value of this is 8. You can specify any value from 1 – 16 for 1 to 16 rows. You can also make the value to 0 so that it searches all existing rows before deciding the data type, but this slows things down.

This is fine for a relational table but Excel often has mixed types in a column The ODBC Provider will try to return the data of the majority type, but return NULL values for the rest that won’t convert. If the two types are equally mixed in the column, the provider chooses numeric over text, and you lose all the text. Also, it will judge the length of the character datatype in the column from the first rows and if the first rows are less than 255 characters long it will truncate all the subsequent data to 255 characters even if cell values below are longer.

By setting the Import Mode (IMEX=1). You can force mixed data to be converted to text, but only when it finds mixed values on the rows that it checks.

You can also open the Excel workbook in read-only mode by specifying ReadOnly=true; By Default Readonly attribute is false, so you can modify data within your workbook. However, this will lock the entire workbook from access until you close the connection.

Let’s try it out.

Just so you can prove all this to yourself, I’ve supplied an Excel workbook that represents the old PUBS database that used to be distributed with SQL Server and Sybase. This means that you can use SQL from old examples that use PUBS and see what works. All you need to do is to convert the SQL Server version slightly by altering the names of the tables slightly to tell the driver that you want the entire worksheet of that name (the $ is the separator between the worksheet name and the range specification)

So let’s pop together a very simple test-rig to try things out in PowerShell. Be warned, I’ve set this up in read-write mode so it will update your spreadsheet in some circumstances (CUD). To play along, you’ll need to download my Excel version of the PUBS database and  alter the path to the excel file.

All these work

Inner joins

Left or right outer joins

Expressions using columns

Simple GROUP BY expression

More complex aggregation with ORDER BY clause and a WHERE clause

String functions

UNION and UNION ALL

One could go on and on; even subqueries work, but I think I’ve made the point that there is far more power in this ODBC Excel driver than just the facility for pulling out raw data. The same is true of the TEXT driver for OLEDB. It will do all this as well. To conform with the minimum syntax for ODBC, a driver must be able to execute CREATE TABLE, DELETE FROM (searched), DROP TABLE, INSERT INTO, SELECT, SELECT DISTINCT, and UPDATE (searched). SELECT statements can have WHERE and ORDER BY clauses. ACE does a bit better than this, since even the text driver allows SELECT INTO, and SELECT statements allow GROUP BY and HAVING. 

Creating a spreadsheet

You can, of course use the ODBC driver to create an Excel spreadsheet and write data into it. Here is the simplest working demo I can write without blushing. Be careful to ensure that the spreadsheet doesn’t exist as the whole point of the demo is to prove to you that it can create an entire spreadsheet workbook with several worksheets.

Notice that I can’t create the table and do the insert in one batch as a command. One statement only can be used in the commandText.

Exploring your Excel metadata

You can find out what datatypes are available for any ODBC source, by using the OdbcConnection.GetSchema(string) method.

Which with my connection gives only the LOGICAL, CURRENCY, NUMBER, VARCHAR and DATETIME datatypes. More useful is..

 … that gives you a list of the available worksheets . The complete list, if you wish to peep at them, is

Hmm. This is beginning to look a bit more like a database. With the Columns MetadataCollection, you can find out as much as you’d ever want to know about the data that is available in the spreadsheet so if you want to read all the worksheets straight into SQL Server, this is a wide-open goal.

Creating Worksheets

Going back to the PUBS Excel database, let’s create a peoples table and populate it with both authors and salespeople. This has to be done in three gulps since the driver seems to dislike the idea of doing a batch, and it kicks when I try to UNION the two results.

You’ll find you can UPDATE, INSERT and DELETE data perfectly happily this way.  If you connect up a spreadsheet to a SQL Server database, then you can have a lot of fun copying entire databases into spreadsheets, and back again. Robyn and I show how to do this here.

The problem is in the Workbook you create. Whether you name it XLS or XSLX it produces an XLSX spreadsheet, in the latest zipped Office Open XML form.  The trouble is that, with my version of the driver,  I can only get  Excel to read it with the XLS filetype, since it says that there is an error if you try to open it as an .XLSX file. I suspect that the ODBC driver hasn’t been that well tested by Microsoft.

Getting data into SQL Server from Excel using PowerShell

Now, what about using PowerShell to copy the data, maybe filtered, sorted and aggregated, into SQL Server, using PowerShell and ODBC. In this direction we can save a lot of time by using the BCP library. We’ll now describe the routine.

We’ll keep this unpacked, as a script rather than a function, since this is designed to illustrate the process.

We’ll start by defining our credentials, preferences, sources and destinations. We’ll read in the data from and excel spreadsheet and then spit it out into SQL Server, creating a table if necessary. To create the destination table (some of these spreadsheets are rather wide and therefore easier to import automatically), we’ll need to examine the metadata, and to interpret this to the SQL Server equivalent, so we’ll do that. To use the BCP library, it is good to have an indication of progress so I’ll show how you do that.

I’ve provided the sample data so that you don’t have to scramble around to find something suitable. This is some climate data, which is handy for checking things like date conversion.

You will notice that although you can render numbers in a variety of ways, there is only one way of storing numbers in Excel, in the ‘NUMBER‘ datatype (the other datatypes in Excel are LOGICAL, CURRENCY, VARCHAR and DATETIME).  I’ve therefore had to specify the precision of numeric data, which is tough if you have some columns with integers and others with real decimal data with numbers after the decimal point (scale). Remember that this routine is just creating a staging table, not the final destination. All you need to do is to add your own statements to transfer the data to their final table with the CAST to the correct internal data type!

OK, but does it work with real data? Off to the Health and Social Care Information Centre for some realistic data in spreadsheet form. I’ve included some data just so you don’t have to go to the site to play along, but it is far better to use the latest version of this data from the site. I’m sure I don’t have to tell you how easy this is to do in a script via PowerShell.

Also

 …and

 Try it. Whoosh. In it goes. If you were doing this as a routine, you’d be wanting to wrap this script into a function with parameters by now, but you know how to do this already, I’m sure. I’m trying to give you the ‘workbench’ narrative here.

Writing to Excel from SQL Server.

The process of going from SQL Server to excel via ODBC is, I think, needlessly complicated, especially if you use parameterised queries (excellent for SQL Server but they add very little for writing to Excel).In this example, I’ll do the old and horrible approach of using insert statements. There are other ways, including even using a dataset, but this is the most obvious.

I’m not particularly happy with this sample because Excel whines a bit when it opens it, saying that it is in the wrong format, (which it is, but you try naming it XLSX) but it deigns to open it.

“The file you are trying to open, ‘MyExcelFile.xls’, is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a trusted source before opening the file. Do you want to open the file now?”

More seriously, it complains that the numbers in the columns are ‘formatted as text’. It turns out that the data is saved in the correct format, but the next time the file is opened, all columns revert to varchar. Seasoned users of ODBC gets used to the bugs, but if anyone knows of a workaround to this, I’d be grateful.

CSV and Delimited ODBC Sources: Text AdventureWorks.

Although the ACE drivers are used more by people reading Excel files, I must emphasize that there are drivers for a number of other formats. It is pretty easy, for example, to turn a bunch of CSV files into a relational database. Just to prove it, I’ve created a CSV/Text version of AdventureWorks, together with its schema.ini. This was originally created in this article The TSQL of CSV: Comma-Delimited of Errors. With this text-based database, you can do a lot of the sample AdventureWorks SQL examples with only a minor modification.

Once you’ve installed the ACE drivers, you’ll can use a modified version of the routine I showed you or exploring the PUBS Excel database to play along.

All you have to do is to unzip Text Adventureworks into a new directory with the name of your database (AdventureWorks) and point your connection string at the directory by giving it the full path to the directory. I just altered two lines

… and

Now you should be ready with your text-based relational database.

You can, of course, create tables and write to them using the INSERT statement.

…and do insert statements into it. You can SELECT INTO as well, which is new to me. I didn’t notice this in previous incarnations of this driver.

With CREATE statements, you can use ‘BIT, BYTE , LONGCHAR, CURRENCY, INTEGER, SMALLINT, REAL, FLOAT, CHAR or DATETIME

(Out of curiosity, the OLEDB driver allows  Long, Single, Double, Currency, DateTime , Bit, Byte, GUID, BigBinary, LongBinary, VarBinary, LongText, VarChar char and Decimal)

And the schema

 Here are a few of the SQL Statements that work

 

Conclusions

If only Microsoft put some energy into their whole range of ODBC drivers, including all the possible datastores that can be mapped to relational databases,  they’d be the obvious way of transferring data, and would put Microsoft in great shape for providing ‘big data’ solutions.. As it is, they are extraordinarily useful, but marred by  quirks and oddities.

For me, ODBC is the obvious way to  script  data from Excel or Access into SQL Server, for doing data imports.  

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 141824 times – thanks for reading.

Tags: , ,

  • Rate
    [Total: 54    Average: 4.5/5]
  • Share

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Follow on

View all articles by Phil Factor

  • paschott

    Excellent
    I really appreciate this. We get a bunch of ad-hoc Excel sheets in our workplace that we need to convert into SQL Server. The mixed-data columns issue has hit us several times and caused a lot of grief along the way. I wish MS would address this and offer us a way to specify the data type of a column when pulling data from Excel.

    One question on the IMEX setting, my understanding was that it forced everything to be of type nvarchar(255). That means that any Memo types would be truncated. Was that changed recently? If so, that makes the IMEX setting a lot more attractive, if painful for building the ETL processes.

  • Phil Factor

    Re: IMEX
    Thanks for the appreciation.
    Yes, I’ve just run a test on it and it is definitely reading in past the 255-character-length mark, using IMEX. Like you, I don’t much like IMEX, but the setting can cut out some of the grief from mixed columns, but not all.
    Could you be falling foul of the problem I mentioned where it will judge the length of the character datatype in the column from the first rows and if the first rows are less than 255 characters long it will truncate all the subsequent data to 255 characters even if cell values below are longer.
    Try putting in a long line in one of the first rows and see if that cures the problem.

  • Phil Factor

    Re: Mixed data columns
    By using SQL expressions, you can tidy up mixed datatype columns in data that you are importing. precisely how you frame these queries depends on the exact ‘donkeys breakfast’ in the table you are dealing with. SQL statements of a number of SELECTs with WHERE clauses and UNIONs can usually produce a clean result which can then be squirted into SQL server via BCP. The method I show can be infinitely refined.

  • Stephanie Locke

    OLEDB – being dropped?
    http://blogs.msdn.com/b/sqlnativeclient/archive/2011/08/29/microsoft-is-aligning-with-odbc-for-native-relational-data-access.aspx

    I recall seeing this a few years back and have yet to peruse the SQL Server 2014 CTP – have they moved ahead with this? I haven’t heard anything to that effect, since this seems to be a database oriented release only, and the amendments needed for SSIS to utilise ODBC instead of OLEDB for a number of components would have made a splash.

    It would be great if they do actually have the cohones to move to ODBC properly, and iron out the interoperability issues with 86/64 bit processors.

  • Phil Factor

    OLEDB
    I happily use OLEDB when I can’t use ODBC, but its use is deprecated in the long-term because Microsoft is adopting ODBC as the de-facto standard for native access to SQL Server and Windows Azure SQL Database from Linux (Red Hat or SUSE) and Windows. In fact, this has been a gradual process since SQL Server Native Client, BCP and SQLCMD already all use ODBC.
    In January, a new ODBC driver for SQL Server was released which will access SQL Server 2005, 2008, 2008 R2, SQL Server 2012 and Windows Azure SQL Database. This is now the recoommended way to take advantage of new SQL Server 2012 features and accessing Azure.
    See Introducing the new Microsoft ODBC Drivers for SQL Server
    It makes sense to use it, particularly for Azure, since it supports Asynchronous operation, handles connection pooling better, and performs with more resilience with internet connections that are more likely to break intermittently.

  • Stephanie Locke

    OLEDB
    Excellent. Cheers Phil for the link.

  • ElijahGagne

    32-bit 64-bit issue
    I’ve found with that 32-bit Office, 64-bit PowerShell issue before. I’m currently on Windows 8 x64 with Office 2013 x86. I was happy to find out that I had no issues installed AccessDatabaseEngine_x64.exe

  • Anonymous

    Type
    Great article. Oh by the way, there’s a typo error in one of the paragraph titles…

    "Getting data into SQL Server from Excel using Powershall"

    Should read as, "Getting data into SQL Server from Excel using PowerShell"

    😉

  • Phil Factor

    Typo
    fixed!

  • Anonymous

    Another typo…
    to open the destination workbook or create it if not exiast

  • Rupertsland

    SQL Server and Excel – our story
    We use the data connection feature in Excel 2010 to connect to Views in SQL Server 2008. No programming required, other than preparing the t-sql for the views in advance. We showed staff how to connect to the available Views. This simplified access to the data. Of course, staff sometimes have to come to us for specialized queries.

    Getting data from Excel into SQL Server was a different matter. We used the import wizard in SSMS to read the worksheets from Excel data template files.

    I will have to try Phil’s approach.

  • Rupertsland

    SQL Server and Excel – our story
    We use the data connection feature in Excel 2010 to connect to Views in SQL Server 2008. No programming required, other than preparing the t-sql for the views in advance. We showed staff how to connect to the available Views. This simplified access to the data. Of course, staff sometimes have to come to us for specialized queries.

    Getting data from Excel into SQL Server was a different matter. We used the import wizard in SSMS to read the worksheets from Excel data template files.

    I will have to try Phil’s approach.

  • Phil Factor

    Re: SQL Server and Excel – our story
    You can automate the creation of spreadsheets that collect data from SQL Server that you describe. I contributed a script to Brad’s article How to Document and Configure SQL Server Instance Settings that does just that. You can also automate the creation of the DSNs on the users’ machines.

  • Jeff Moden

    Why Powershell?
    I apologize for sounding the part of a Luddite but, if the Ace drivers can do everything you say they can, why is there any need at all for Powershell in any of this? It truly seems to be an unnecessary complication.

  • Phil Factor

    Re: Why PowerShell
    Jeff,
    PowerShell is by no means essential for this task. It can be done in other ways.
    The purpose of this article is really to illustrate the way you can use the intrinsic SQL language within ODBC to do much of the preparation, aggregation, filtering and checking of the data before you send it to the database. You are using a subset of SQL92 which is a Lingua Franca to any Database developer.
    If you only use a daily total, why import the entire day’s data into the database? From this perspective, PowerShell is irrelevant, you can do it a number of ways. Excel, Access or text data can even be accessed on the same server as an instance of SQL Server directly via OLEDB or ODBC as a linked server or by TSQL. No PowerShell required here. I’ve written articles on how to do it. I suspect that it is possible to access this file-based data as a linked server across a network via ODBC, though I haven’t tried it and don’t want to. If, as is more likely, you are transferring data regularly from one or a number of remote machines, then SSIS works ok for this sort of task, but by using PowerShell, or any good .NET language, you can do more to fine-tune your ODBC SQL and deal with some quite esoteric data transfer problems..

  • mlarionov

    Automatic job on the server
    Thank you for this great article!
    My question: it it safe and supported to use the Office drivers on the servers for automated jobs, e.g. loading data from Excel files into SQL and vice versa? I know that Microsoft does not recommend Office automation on the server, but I also found a bunch of KT articles about using Office drivers for SSIS jobs. So…confused… Thanks again!

  • awheeler_01

    Office 2013
    First as always excellent article.
    I am having a problem. I have Windows 7 with office 2013. The Excel file is created as an xlsx file but when I use office to open it I am told it is corrupted.
    When I change the extension to .xls the file opens in excel with a warning
    As I typed I realize my problem I have both office 2010 and office 2013 installed. This is most likely the cause of my error.
    Again thanks for the article. One compliant what is with the word whilst from your picture I am assuming you are at least a baby boome but seems a little archaic to me. Oh Well.

  • awheeler_01

    Question. Missing first row
    I am using this articles contents to move the data from many spreadsheets to SQL Server. Both BulkCopy and row by row. I am not sure why but when I use the following code I never get the value for the first row in the spreadsheet. None of the columns have data over 255 characters.

    $Connection = New-Object system.data.odbc.odbcconnection
    $Connection.ConnectionString = ‘Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=’+$fileToProcess+’; Extended Properties="Mode=ReadWrite;ReadOnly=true; HDR=NO"’
    $Header = $True
    $connection.open()
    $worksheets = $connection.GetSchema("TABLES")
    $sheet = $worksheets.Rows[0].Table_Name
    $Sheet = $sheet -replace "’", ""
    $Query.Connection = $connection
    $Query.CommandText = ‘select * from [‘+$sheet +’]’

    The first value I get below is from row #2. I have set $Hearder = $False with same results

    while ($Reader.Read())
    {
    $Reader.GetValue(0).ToString() ….

    I have a work around but am wondering if I am overlooking something.