31 January 2011

Bulk Inserts via TSQL in SQL Server

The trouble with using BCP for reading data into SQL Server is that BCP is a command-line program. Fortunately, the same input functionality is available from TSQL within SQL Server. In fact there are two ways, and it is not always obvious which to use.

In a past article on the bcp command-line utility, I demonstrated how to use the utility to bulk load external data into a SQL Server database. The advantage of using bcp is that you can quickly and easily import data from a windows command-line process without having to open an Integrated Development Environment (IDE) such as SQL Server Management Studio, or use SQLCMD, to issue a Transact-SQL statement. However, there might be times when you want to use Transact-SQL to bulk load data, in which case, you can use one of two statements supported by SQL Server:

  • BULK INSERT-available in SQL Server 2005 and 2008
  • INSERT...SELECT * FROM OPENROWSET(BULK...)– available in SQL Server 2005 and 2008.

In this article, I show you how to use both statements to bulk import external data, and then I discuss some of the performance considerations to take into account when using these methods. I’ve included a number of examples, which I created on a local instance of SQL Server 2008 (using the AdventureWorks2008 sample database). To develop these examples, I first used the following Transact-SQL to create and populate the dbo.Employees table:

In addition to creating the Employees table, I used the bcp utility to export data from the vEmployee view in the AdventureWorks2008 database. I saved the data to a folder on my local drive (C:\Data\). I then created format files based on the newly created Employees table.

The first bcp command I ran exports the data as character data to a file named EmployeeData_c.dat:

(For these commands, you’ll need to change the location of the file and the name of the server, you may also need to change  ’AdventureWorks2008′ to the current name of the database: The BusinessEntityID, introduced for Entity Framework, has now been changed back to EmployeeID so you might need to rename this too!)

The second bcp command creates a format file named EmployeeFormat_c.fmt, based on character data:

 Next I exported the data in the wide native format and saved the data to the EmployeeData_n.dat file:

Finally, I created a format file based on the wide native format and named the file EmployeeFormat_n.fmt:

You now have all the setup information you need to try out the examples in this article. Even if you don’t run the examples, the Transact-SQL and bcp commands will at least help you better understand how the external data was prepared in order to import that data back into the database.

NOTE: For most of the examples, you’ll need to truncate the Employees table after each bulk load to prevent primary key errors when loading the data. This is because you’re loading the same source data into the same target table for each example.

BULK INSERT

The first statement we’ll look at is BULK INSERT, which lets you import data from a data file into a table or view. You can specify the format of the imported data, based on how that data is stored in the file. In the following example, I import the data from the EmployeeData_c.dat file into the Employees table:

Notice that I begin the statement by first specifying the BULK INSERT keywords, followed by the name of the target table, that is, the table in which the data will be loaded. In the next line, I define a FROM clause that includes the full path to the source data file.

The last part of the statement is the optional WITH clause, which is where you define any additional arguments, such as those related to formatting. In this case, I specify that the data file type is char data, which means that the data has been saved in the character format. Next, I specify that each field is terminated with a comma and each line terminated with a newline (\r\n) character.

That’s all there is to using the BULK INSERT statement to import data from a file. In some cases, however, you might want to use a format file when importing your data. In such situations, rather than specifying the formatting details in the WITH clause, you specify the name of the format file, as shown in the following example:

As you can see, when you use a format file to import data, you don’t need to know the format of the source data; you need only to know the location of the format file.

In the previous examples, I imported data that had been saved in the character format. However, when possible, you should import your data in native format to help make your bulk load more efficient. In the following example, I specify the data file that’s been saved in the wide native format:

As you can see, I specified only one argument in the WITH clause and used the widenative option to indicate that the source data is stored with native data types, except char, varchar, and text columns, which are stored as Unicode.

I can also use a format file to import data saved in the wide native format, as shown in the following example:

In this case, my only argument in the WITH clause is the location of the format file, just like you saw when I used a format file to import character data.

That’s all there is to using the BULK IMPORT statement to bulk load data into a SQL Server database. There are, of course, additional arguments you can specify in the WITH clause, some of which I cover later in the article. However, for a description of all arguments available to the WITH clause, check out the topic “BULK INSERT (Transact-SQL)” in SQL Server Books Online.

INSERT…SELECT

Starting with SQL Server 2005, the OPENROWSET function supports bulk operations through the use of the BULK provider. You can use the function and provider within an INSERT…SELECT statement to bulk load data from a data file into a table or view, just like you saw with the BULK INSERT statement. In the following INSERT…SELECT statement, I call the OPENROWSET function and BULK provider in order import data from the EmployeeData_c.dat file:

I start the statement as I would any other INSERT…SELECT statement: I first specify the INSERT INTO keywords, followed by the name of the target table. I then define the SELECT statement.

Within the SELECT statement, I use the OPENROWSET function in the FROM clause. Notice that in the function’s first argument I invoke the BULK provider, followed by the name of the source file. For the second argument, I specify the path to the format file. When you use OPENROWSET to bulk load data, you must specify a format file as the second argument. Unlike the BULK INSERT statement, you cannot simply specify formatting information.

Another important point to note about the example above is that I provide an alias (e) for the output of the OPENROWSET function. The alias is required in order to use the function in this way.

In the example above, I import data that’s been saved in the character format. However, I can also use INSERT…SELECT to import data saved in the native format, as shown in the following example:

As you can see, there’s little difference between this statement and the preceding one. In the case of the latter, I simply specify the source data file that was saved in the native format and the format file that works with the native data. For more details about using the OPENROWSET function in your statements, see the topic “OPENROWSET (Transact-SQL)” in SQL Server Books Online.

Performance

When bulk importing data into a SQL Server database, there are several steps you can take to maximize performance. One of these steps I mentioned earlier, and that is, when possible, work with data in SQL Server native format. Native format retains SQL Server’s native data types. Even if you use wide native format (preserving the Unicode), as I did in several of the examples above, you’re still taking advantage of the non-character native data types.

Of course, you can’t always control whether the source file is in native or wide native format, but there are other steps you can take to maximize performance. In this section, we’ll look at several of those steps and review examples that demonstrate how to implement them.

NOTE: Microsoft documentation on bulk loading operations can often be confusing, especially when trying to understand the BULK INSERT and the SELECT…INTO statements-and the differences between them. In researching these statements, you might find contradictory or inaccurate information, or information that simply doesn’t make sense. Whenever you’re developing a bulk load solution, you’re best bet is to test different scenarios and see for yourself what works best.

Table Locking

One way you can help to improve performance in your bulk load operations is to specify table locking for the duration of the data load. You can do this by including the TABLOCK argument with your statement. The argument issues a bulk update (BU) lock on the table. In the following BULK INSERT statement, I’ve included the TABLOCK argument in the WITH clause:

Notice that I simply specify the TABLOCK keyword in the WITH clause. However, to use the TABLOCK option in an INSERT…SELECT statement, you must include it as part of the INSERT clause, as shown in the following example:

As you can see, I’ve simply added a WITH subclause, with the TABLOCK hint enclosed in parentheses. Now when I run either of these statements, a BU lock will be issued on the table.

The TABLOCK argument can be particularly useful if you want to perform parallel data loads into a single table. However, you should use TABLOCK for parallel operations only if the table is unindexed (is a heap). For nonparallel operations, TABLOCK is useful whether or not the table is indexed. The lock prevents operations that are not related to bulk loading data from accessing the table.

Minimal Logging

Another method you can use to maximize performance is to minimally log the operation when bulk loading data. Minimal logging can make bulk load operations more efficient and minimize the risk that the transaction log will fill up. To minimally log a bulk load operation, the TABLOCK option must be specified and the table must not be being replicated. In addition, the operation can be minimally logged only under one of the following conditions:

  • If the table has no indexes, the data pages can be minimally logged.
  • If the table has no clustered indexes, has nonclustered indexes, and is empty, data pages and index pages can be minimally logged.
  • If the table has no clustered indexes, has nonclustered indexes, and has data, data pages can be minimally logged but index pages cannot.
  • If the table has a clustered index but is empty, data pages and indexed pages can be minimally logged. (Both types of pages are fully logged whenever the table contains data.)

To support minimal logging, you must change the database’s recovery mode to BULK_LOGGED. You should then change the recovery mode back to its original state as soon as the bulk load operation is over. In the following example, I set the recovery mode before and after I issue a BULK INSERT statement:

Notice that I issue an ALTER DATABASE statement before and after my BULK INSERT statement. The ALTER DATABASE statement includes the SET RECOVERY clause, which first sets the recovery mode to BULK_LOGGED, and then sets it back to FULL

You can do the same thing when you issue an INSERT…SELECT statement, as shown in the following example:

Note that, if you bulk load data in batches, the table is considered empty only at the start of the first batch, which means that minimal logging will apply to only the first batch. For this reason, it is better to bulk load your data in a single batch, when possible. If it’s not possible, then you should maintain a close watch on your transaction logs. Again, you want to be sure to test different scenarios when preparing to bulk load data.

Dropping Indexes

Bulk load operations can be affected by the indexes defined on a table, not only with regard to minimally logging your operations, but also when you’re importing a large amount of data compared to the amount already in the table. In such cases, it is sometimes useful to drop the indexes before you load the data, and then re-create those indexes after the operation has completed.

Dropping indexes can significantly improve the performance of a bulk load operation. However, it’s not always a clear-cut decision when to drop them. If you’re importing a small amount of data compared to the amount of data in the table, then you probably shouldn’t drop the indexes; otherwise, you’ll spend more time rebuilding the indexes than it takes to load the data. However, when the situation is reversed, you’ll often benefit from dropping the indexes.

In the following example, I first include an ALTER TABLE statement that drops the primary key constraint (and its associated clustered index) and then include a second ALTER TABLE statement that re-creates the constraint and its index:

As you can see, I drop the clustered index, run the BULK INSERT statement, and then re-create the clustered index. I take the same approach in the next example, except that I issue an INSERT…SELECT statement:

If a table has an index other than one associated with a primary key or unique constraint, you should use the DROP INDEX statement to remove the index and then use the CREATE INDEX statement to re-create the index.

Ordering

By default, SQL Server assumes that the source data is unordered when you perform a bulk load operation. However, in some cases, you can improve the performance of that operation by specifying the sort order of the imported data. But you can do this only if the target table has a clustered index and the source data file is sorted in the same way as defined by the index. If these conditions are met, you can add an ORDER option to your bulk load statement to help increase the speed of the operation. The ORDER option specifies the sort order of the incoming data. The columns listed along with that option should match the columns as defined in the clustered index and be in the same order.

In the following BULK INSERT statement, I’ve added an ORDER option to the WITH clause and specified the EmployeeID column as the sort column:

The EmployeeID column, as it’s listed with the ORDER option, matches how the clustered index is defined on the Employees table. Because the data in the source file matches the clustered index, I’m able to use the sort order to improve the performance of the bulk load operation, without having to drop the index.

I can do the same thing with the SELECT…INTO statement, as shown in the following example:

Notice that I’ve included the ORDER option in the FROM clause and, as with the BULK INSERT statement, specified the EmployeeID as the sort column.

There’s one other action you can take with an INSERT…SELECT statement when bulk loading data into a table with a clustered index. You can turn on trace flag 610, which controls minimal logging of a bulk load operation. Under certain circumstances, you can achieve minimal logging when inserting data into an indexed table. For instance, when a new page is allocated, all the rows added sequentially to that page are minimally logged, which can lead to improved performance.

In the following example, I turn on trace flag 610, issue the INSERT…SELECT statement, and then turn off the trace flag:

As you can see, I simply issued a DBCC statement and specified the TRACEON option to turn on the trace flag, and then issued a DBCC statement and specified the TRACEOFF option to turn off the trace flag.

Constraints and Triggers

When running BULK INSERT and INSERT…SELECT statements, you can control whether check constraints and triggers are enforced during the bulk load operation. By disabling the constraints and triggers, you can see an improvement in the bulk load operations. However, if you do, you must be sure to follow up on any critical operations the constraints and triggers perform, after you’ve completed your bulk load operation.

When it comes to check constraints and triggers, the default behavior of the BULK INSERT and INSERT…SELECT statement is different:

  • By default, the BULK INSERT statement ignores constraints and disables triggers.
  • By default, the INSERT…INTO statement executes check constraints and triggers.

For both statements, you can override the default behavior. For instance, in the following BULK INSERT statement, I override the default behavior in order to run the check constraints and triggers:

Notice that I’ve added the CHECK_CONSTRAINTS and FIRE_TRIGGERS arguments to the WITH clause. Now when I use the BULK INSERT statement to load data, the check constraints and triggers will run, which can affect the performance negatively of the bulk load operation, but it does ensure that the constraint and trigger operations are performed.

I take an opposite approach in the following INSERT…SELECT statement by specifying that constraints and triggers be ignored:

Notice that I’ve added the IGNORE_CONSTRAINTS and IGNORE_TRIGGERS arguments to the WITH subclause of the INSERT clause. Now check constraints and triggers will be ignored. However, whenever ignoring constraints and triggers, be sure that the time and effort it takes to revalidate the data and take any other necessary actions doesn’t outweigh the gains you made during the bulk load operation.

Conclusion

As you can see, trying to determine the best way to bulk load data can be a complicated process. It often takes testing various scenarios to come up with the best approach. In addition, it’s not always clear, based on Microsoft documentation, whether to use BULK INSERT or INSERT…SELECT * FROM OPENROWSET(BULK…) when importing external data. Some information indicates that INSERT…SELECT will not support concurrent bulk load operations into a single table. (I haven’t tested this scenario.) It also appears that INSERT…SELECT doesn’t let you specify multiple batches in a single operation (no BATCHSIZE option). Even so, for most bulk load operations, you’ll probably do fine with either statement, as long as you take other steps to improve performance. You’ll have to take each operation on a case-by-case basis and determine what works best for your specific circumstances.

 Appendix

Here are the railroad diagrams for Bulk Insert and OpenRowSet. The PDF versions are attached as files to this article. For completeness, the OpenRowSet diagram shows all the available syntax as well as the BULK functionality.

BULK INSERT

1226-BulkInsert.png

OpenRowSet

1226-OpenRowSet.png

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 237187 times – thanks for reading.

  • Rate
    [Total: 137    Average: 4.4/5]
  • Share

Robert Sheldon

View all articles by Robert Sheldon

Related articles

Also in Learn SQL Server

SQL Server System Views: The Basics

When maintaining or refactoring an unfamiliar database, you'll need a fast way to uncover all sorts of facts about the database, its tables, columns keys and indexes. SQL Server's plethora of system catalog views, INFORMATION_SCHEMA views, and dynamic management views contain all the metadata you need, but it isn't always obvious which views are best to use for which sort of information. Many of us could do with a simple explanation, and who better to provide one than Rob Sheldon?… Read more

Also in SQL

Relational Algebra and its implications for NoSQL databases

With the rise of NoSQL databases that are exploiting aspects of SQL for querying, and are embracing full transactionality, is there a danger of the data-document model's hierarchical nature causing a fundamental conflict with relational theory? We asked our relational expert, Hugh Bin-Haad to expound a difficult area for database theorists.… Read more

Also in SQL Server

SQL Server System Functions: The Basics

Every SQL Server Database programmer needs to be familiar with the System Functions. These range from the sublime (such as @@rowcount or @@identity) to the ridiculous (IsNumeric()) Robert Sheldon provides an overview of the most commonly used of them.… Read more

Also in TSQL

Releasing Databases in VSTS with Redgate SQL CI and Octopus Deploy

You can still do Database Lifecycle Management (DLM) workflows in the hosted version of Team foundation Server (TFS) in Visual Studio Team Services (VSTS) . If you are doing your database development in SSMS, you can use a mix of tools to set up the functionality in VSTS. Jason Crease demonstrates how to build and deploy a simple database into the test environment using SQL CI, SQL Source Control, Octopus Deploy, tSQLt, SQL Cop and SQL Release, all with the minimum of hassle and effort.… Read more
  • Paul Hunter

    Where’s the BULK EXPORT?
    I keep BCP around (and therefore xp_cmdshell enabled) because there is no simple T-SQL export mechanism. SSIS can be used, but if all you’re trying to do is export the contents of a table/view then SSIS is a bit like using a sledge hammer to kill a fly. If Microsoft can create a routine with access to the file system (BULK INSERT) then why not use a similar metaphor for a BULK EXPORT?

  • Anonymous SQL Guy

    PDF versions
    Both attachments are BULK INSERT, please post OPENROWSET

  • Andrew Clarke

    re: PDF version
    Now fixed!  Heaven only knows how that happened!

  • Jeanine

    Great!
    One of the best learning aids I’ve seen yet on bcp!

  • Andre Guerreiro

    Thank you!
    I will bookmark this article and read it as often as I can to assimilate its content. Bravo. Very detailed explanations.

    One question: on the first code of the article, why do you need to check if the “Employees” table already exists even after you have dropped the database to which it belongs in the previous commands?

    Thanks a lot for such a fine contribution. Now I’m going back to study. 🙂

    Regards,

    Andre Guerreiro Neto
    MCP-MCTS SQL Server 2008 Database Development
    http://www.novosis.com.br

  • Andre Guerreiro

    Thank you!
    I will bookmark this article and read it as often as I can to assimilate its content. Bravo. Very detailed explanations.

    One question: on the first code of the article, why do you need to check if the “Employees” table already exists even after you have dropped the database to which it belongs in the previous commands?

    Thanks a lot for such a fine contribution. Now I’m going back to study. 🙂

    Regards,

    Andre Guerreiro Neto
    MCP-MCTS SQL Server 2008 Database Development
    http://www.novosis.com.br

  • server management

    sq
    This article gives the light in which we can observe the reality. This is highly nice one and gives in-depth information. Thanks for this nice article.
    <a href=”http://www.seeksadmin.com”>server management</a>

  • Andrew Freeman

    Railroad Diagrams
    Great Article!
    I find the railroad diagrams you use to illustrate your articles very easy to read:
    Do you generate them yourself? If you do, do you generate them by hand or do you use a tool. I’d love to have a complete set for all SQL server commands, is this available anywhere?

    Regards
    Andrew

  • Andrew Freeman

    Railroad Diagrams
    Great Article!
    I find the railroad diagrams you use to illustrate your articles very easy to read:
    Do you generate them yourself? If you do, do you generate them by hand or do you use a tool. I’d love to have a complete set for all SQL server commands, is this available anywhere?

    Regards
    Andrew

  • Sreyas MN

    Very Good Article..
    Hi

    It is an amazing article…Great.I am waiting more and more article like this

    Thanks and Regards
    Sreyas MN

  • Hank Freeman – Senior SQL DBA/Architect

    Excellent Article
    Robert,

    You have taken the time to work with this arduous subject matter that will befuddle less senior technicians. Meaning your article is really helpful for most and hopefully the comment of others and mine below can bring in another real-world set of constraints that you did not discuss for it may have been out of scope. The following is my concise synopsis of the problem I had to solve, for which your instructions in the article are absolute SPOT ON.

    I my specific case, I had 146 million and 196 million rows of data to load into a table for subsequent updating of master table in a SEPERATE database for separate sequential load and update requirements. The word SEPARATE is key here, for I did this so that I could do the work in the separate database and set the recovery mode to SIMPLE and keep it that way and never reset it so FULL. Meaning, I knew what I was doing and I did not need to create log files that would be 1.4 Terabyte in size for just the BULK Insert alone not even taking into consideration the updates.

    The whole concept of truncating the target table and then loading it was complicated by so many foreign key constraints on the table. Additionally, the whole process was incredibly CPU (93% utilization) and disk I/O (95%) bound for over 30 minutes. So to solve the problem I had to perform the BULK INSERT on an ancillary server (a test server) so as to never interfere with the production server’s performance.

    In short I used BULK INSERT as you have discussed above with my separate database on a separate server to nullify the performance issue. The update was complicated as well, for I had to efficiently copy the information from the production server (141 million rows), update the copy on the separate server and then reverse direction and update the production server… Needless to say there was a whole bunch of table locking going on at the right time.

    Additionally, I had to use dreaded Linked Server to get the data and update it… This required some simple select into statements to get the table from the production server and then another from the production server to put the update table back as a separate table so that an efficient update between the Updated table could be performed on the production table at the production server. For those reading these post, NEVER EVER perform an update on a large table using LINKED Servers. You may lose your job because you locked out the production or target system during the operation for an absurd amount of time.

    Bottom line I had to test every part of this operation and tweak it for the best performance and then I isolated the BULK INSERT operations to avoid the assault on the production server. This was accomplished for the final update of millions of rows of data was performed on the production table With UPDLOCK in less than one minute.
    I hope this help those who read the article and these subsequent post.
    Hank Freeman – Metro Atlanta, GA

    Hfreeman@msn.com
    End

  • Anon

    Where’s the BULK EXPORT?
    I can support Paul Hunter… how about a TSQL bulk export.

    I wrote one ages ago using DMO, specifically SQLDMO.BulkCopy, which would be SMO today. I’m sure that I saw a discussion of using OPENROWSET for exporting tables without the need for either xp_cmdshell or for sp_OACreate.

    Is the use of OPENROWSET for exporting data possible? Can anyone demonstrate it?

    Thanks in advance.

  • Phil Factor

    Export yes, but no BULK export.
    It is certainly possible to export data from a table with OPENROWSET, but not using the BULK feature (AFAIK).
    Here is an example from an old article that Robyn and I wrote four years ago!
    SQL Server Excel Workbench

    UPDATE OPENROWSET(‘Microsoft.Jet.OLEDB.4.0’,
      
    ‘Excel 8.0;DATABASE=c:CambridgePubs.xls’,
      
    ‘Select * from CambridgePubs’)
        
    SET Address=’34 Glemsford Road’
        
    WHERE Address = ’65 Cavendish Road’

    INSERT INTO OPENROWSET (‘Microsoft.Jet.OLEDB.4.0’,
      
    ‘Excel 8.0;DATABASE=c:CambridgePubs.xls’,
      
    ‘Select * from CambridgePubs’)
      (
    Pubname, Address, Postcode)
        
    SELECT ‘The Bull’, ‘Antioch Road’,‘CB2 5TY’

    for a few other techniques for exporting stuff, try
    The TSQL of Text files

  • gooram

    INSERT INTO SELECT … with parallel updates.
    According to Microsoft (and as you say their documentation can be less than succinct), using the WITH (TABLOCK) hint in an INSERT INTO SELECT … statement, issues an exclusive lock (X) on the table and not a (BU) lock.

    See this statement from http://msdn.microsoft.com/en-us/library/ms174335.aspx:

    “Specifying the TABLOCK hint on a table that is the target of an INSERT statement has the same effect as specifying the TABLOCKX hint. An exclusive lock is taken on the table.”

    My testing does seem to produce different results when trying to do bulk inserts into the same table using multiple concurrent threads, but it’s inconclusive.

    Does anyone have any experience here they could share? Thanks

Join Simple Talk

Join over 200,000 Microsoft professionals, and get full, free access to technical articles, our twice-monthly Simple Talk newsletter, and free SQL tools.

Sign up

See what's happening behind the scenes

Take a peek at the bowels of the ship – the lower decks – the actual servers of SQL Server Central itself.

See what's happening