05 November 2012

Database Deployment: The Bits – Getting Data In

Quite often, the database developer or tester is faced with having to load data into a newly created database. What could be simpler? Quite a lot of things, it seems.

2091-stage4.png

DATABASE LIFECYCLE MANAGEMENT PATTERNS & PRACTICES LIBRARY

Automated Deployment

This article is about how to stock an empty database with data from files.

Why?

There are a number of reasons for wanting to do this. I’ll explain why I do it. When I’m developing databases, source control will contain everything in the way of scripts that I need to build the databases, but I like to build them regularly, using an automated process, from the components. I do this in order to make sure that nothing has broken the build, that what’s in source control represents the true state of the database, and to do integration testing. Sure, I develop on a full system with SSMS, rather than go anywhere near the ‘disconnected model’ of SQL Server Data Tools (SSDT), so this regular build isn’t a vital process because the use of the shared development server will avoid many integration problems,  but I might want to build from a number of versions. This needs the ‘test’ data. When I say that source control contains everything I need, I generally store the test data separately because it’s big, it is binary, and it changes less frequently than the database versions. When this happens, I archive off the data and replace it (I’ll describe this in another article). This is no big deal compared with the main task of a refactoring, which is having to change the integration tests. I like to have at least one current set of the data of all the tables saved as BCP Native format files on disk, in a directory for each server, with a subdirectory for each database. As well as building the integration databases, it has a range of other uses in the long road through the database deployment process.

For a wide-scale test, you would build your database on a virtual server that you can then spin up to run each of your tests on. This means that, once you have all your integration databases built from the current version in source control, you will have no further need for messy data imports. You just run your tests on the virtual server. You run your tests against the test environment to do a full integration

For much testing, a virtual environment is possibly overkill. Normally, you would want to run your integration tests overnight as a daily process. The tests can each build the database as part of the build-up process. AdventureWorks, for example takes less than a minute to build entirely from scripts to a known data state. All third-party interfaces can be ‘mocked’. The databases can be torn down very quickly.

To create, copy, modify and otherwise manipulate a database, there are several activities that need to be automated. We’ll here deal with one of these: getting data in. I’ll be talking about other ‘bits’ in other articles.

I’ll be providing sample scripts. They are intended to illustrate and they work, but they aren’t industrial strength. For effective automation, we’ll tend to use PowerShell with SQL, but I’ll provide alternatives where possible if you are restricted to SQL.  If you want a good pile of data to use, then go to the companion article ‘Database Deployment: The Bits – Copying Data Out’ to find out how. I use a build script for AdventureWorks and a directory with all the data in order to try it out.

Deleting the existing data

If you are running a series of tests, and are not using virtualization, you will want to get the database to a known state of data consistent with the tests. You will build the database just the once and then import the data for each test rather than adding it. This means that, unless you do this as part of the teardown process of the previous test, you will want to delete the existing data. Deleting all data from databases should be easy, but referential integrity constraints will stand in your way to prevent you iterating through a list of tables, deleting all the data from each one. To delete all the existing data in a database, you need to temporarily disable referential integrity constraints and any triggers that are there to prevent you doing precisely this. Then we execute the following code. Before you even attempt to do this, make sure your data is all successfully read out and you have a backup. Otherwise this will end in tears. Make sure you know what you’re doing and you’re in the right database and server before you run this.

So what are we doing here? We are first disabling all the triggers and constraints in the database, then we are deleting all the tables, and finally we are re-enabling both triggers and constraints. You can’t use TRUNCATE this way to speed things up because you’d need to actually drop the constraints first. See TRUNCATE TABLE (Transact-SQL). There are other ways of doing this which are more complicated but which don’t involve disabling constraints, which is a rather blunt knife in the database surgeon’s set of instruments. I will show later on in this article how you can usually delete data without touching constraints.

Importing the data.

You have a wide choice of techniques for getting the data in, but if you have taken the wise choice of storing the data in native format, then BCP, or one of the other techniques that use the bulk copy library, is best. BCP is good for a number of formats, but native format is best, and is fastest.

It is possible to do bulk import in SQL Server by a variety of methods that use the same underlying Bulk copy library. There is BCP, BULK INSERT and INSERT….Select * from OPENROWSET(BULK…). The .NET System.Data. SqlClient also provides a SQLBulkCopy class to do it. It is also reasonably easy to make a PowerShell script that uses this library, or you can merely execute the BCP utility to load the data into the remote server.

Tweaking the performance of the import

These approaches are all fast. The BCP command-line is a good alternative but is out-of-process. This is a bit more insecure, but may give you better performance if you are short of CPU grunt.

When you’re using TSQL, the BULK INSERT statement is the obvious way to use Bulk Copy functionality. There are a number of knobs you can twiddle to improve Bulk Import performance

For performance reasons, the best place to have your data is on a local drive within the integration database server, though it is perfectly possible to have the BCP process on a different server to the database, or the data file, if you don’t mind waiting. You can use BCP across a network to a remote SQL server, with the files being either local with a pathname, or remote with a UNC name. However, don’t do it across a network unless you have to . It will usually be faster if you first copy the files to reside on the same server as the BCP application and the destination server (the integration server in our case). Ideally the source data should be on a different physical  drive to the destination database

Both The BCP command and the BULK INSERT statement disable triggers by default. The INSERT ... SELECT * FROM OPENROWSET(BULK...) statement will, by default, execute triggers an this can cause problems.

As well as disabling constraints or triggers, and killing indexes, you can get an edge in the import process if you use ‘Native format’ and you can also speed the process by ensuring that you use minimal logging. If your database is using the simple recovery model, then you’re there already, otherwise you’ll need to switch the database to the bulk-logged recovery model just for the import session, remembering to switch the recovery model back to the full recovery model after you’ve finished: (see this too). You can import into a table from more than one client process in parallel. You can use batches , disable triggers or order the data in the file in the same order as that which is imposed by the table’s clustered index. You can control the locking behaviour

The problem of constraints

One thing that you’ll may discover when you try to insert bulk data into a table that has foreign key constraints is that you can get errors if the relevant rows in the other table aren’t there.

You’ll have noticed that I specified 'CHECK_CONSTRAINTS': That means both referential and check constraints. That is a great way of ensuring that bad data never reaches the database.

Importing by disabling constraints

By default, both the BCP command and the BULK INSERT statement temporararily disables constraints, whereas INSERT ... SELECT * FROM OPENROWSET(BULK...) doesn’t. If you opt to let the bulk copy process disable check constraints, as happens if you do BCP/BULK INSERT in their default setting, then it will enable the constraints so that they work for subsequent insertions or updates, but won’t check existing rows that have been imported. You’ll be left to check the data and set the constraints to being trusted afterwards (see Controlling Constraint Checking by Bulk Import Operations). If you neglect to enable constraints with a retrospective check, then you will get subsequent performance loss in queries since these constraints can’t be used in query plans until they are set to ‘trusted’. In this case, if you populate a table without a check, by temporary disabling the constraint, then it is set to being ‘untrusted’ until some process checks those constraints, including both referential and check constaints.

After you’ve done an import, you can check to make sure everything is trusted by running this query.

…or on old versions of SQL Server…

Please see sys.check_constraints (Transact-SQL)) for more details.

Before setting the constraints to being trusted, you might want to make sure you’re not going to hit an error. You can check that there are no constraint violations before you set them to being trusted by executing DBCC CHECKCONSTRAINTS

You can check all your tables in one go by executing this

you can ensure that a single contraint is checked and, if successful, trusted by doing this…

To do all the constraints for the table, you do this

To switch them all back on, for every table in the database run:

So lets put this together with an import routine

This runs in 40 seconds for AdventureWorks on most of my database servers.

If you want to run this in PowerShell, using BCP, something like this will do the trick. (your windows account must have the required permission on all the tables to use this)

This seems good and simple, doesn’t it? This is why BCP and BULK INSERT default to doing it this way. Why not just always opt to ignore constraints? The answer is that, in general, It isn’t a good idea to modify the metadata unless it is necessary. When constraints are disabled, a schema modify lock might be taken to update the metadata. This can interfere with other commands such as an online index build or affect transactions. In an active OLTP system you will need to avoid doing this, but for our operation, this is safe.

There are alternatives, of course.

Kill ‘n Fill

Some articles advise ‘Kill and fill’, a technique whereby all indexes and constraints are deleted before the BCP operation and then reinstated. This makes the whole operation a lot easier and faster, but even if we aren’t going to suffer a schema modify lock, I dislike the idea because it is altering the metadata, which isn’t good practice for a test that is reliant on the metadata being at a known state. You don’t avoid the errors on replacing the DRI if there is an error when you try to reinstate the indexes and constraints.

What could possibly go wrong that would cause a constraint violation? There is always the risk that data is accidentally inserted twice into a table, or incorrect data can get in. If you originally exported the data from a database that had active connections modifying the data, then you can get referential integrity errors.

Filling Tables in the right order

Copying tables in the right order is the most trouble-free approach since you import the tables in such an order that any conflicts with a foreign key constraint are avoided without disabling them. You just do a topological sort on the tables. This code will need the temporary stored procedure that I provide in the subsequent listing.

So how do we do a topological sort? Well, all we need to do is to get a list of tables in order of their dependencies. High at the top of the list are tables that have no foreign key relationships, followed by those that are only referenced by other tables but do not themselves refer to any tables. These are then followed by tables that only refer to tables higher in the list. Easy really if one does it with little bits of paper before trying any coding.

Here, we use the topological sort from within PowerShell to import the database’s data without having to upset any constraints. All these variants run at around the same speed.

Now we have a topological sort, we can also use it to perform surgery on a database. I started this article by showing you how to clear out all the data in a database in order to re-fill it with a different set of data. We just build a batch of deletes in the right order and execute it. (Don’t try this unless you have a good backup and you’ve double-checked that you are in the right database!)

What this is doing is to start with all those tables that aren’t referenced by any other table other than themselves, and then chipping away at the others, picking those that are only referenced by tables that have had all their data deleted. On my test server, MyDatabase data is all deleted within 25 seconds. On some databases, this won’t work because of horrible circular references, or mutual references. In these cases, you can use the blunter instrument of disabling constraints.

Wrapup.

From my experience, the four different methods of  reading data into a database take roughly the same amount of time.  The bulk load process is so efficient when you give it a chance that the process is governed more by the speed of accessing the data from file, and the network speed. I prefer a topological sort even if the code looks a bit intimidating, since it is less intrusive and no checks are ever disabled. However, I can appreciate  that the simpler system makes a lot of sense if the user is aware of the need to mop up afterwards.

When I started writing this article, I felt it would be easy since we all know how to import data into a database. As I was intending on also writing about getting data out, scripting out the metadata and copying databases, I decided to do the entire process of copying a database, with extracts from my own scripts, using a range of databases and testing before-and-after with data and schema comparison tools. I also tried to double-check everything I wrote, even when I was sure of my facts. The exercise convinced me that I had started out knowing far less than I should have done about the process. I hope that the lessons I learned will be of benefit to others!

Further Reading

This article is part of our Database Lifecycle Management patterns & practices.

Return to the library to find more articles on DLM, or visit red-gate.com/dlm for more information about SQL Server products for database lifecycle management.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 12769 times – thanks for reading.

Tags: , , ,

  • Rate
    [Total: 20    Average: 5/5]
  • Share

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Follow on

View all articles by Phil Factor