Click here to monitor SSC
  • Av rating:
  • Total votes: 18
  • Total comments: 1
Phil Factor

Database Deployment: The Bits - Getting Data In

05 November 2012

Quite often, the database developer or tester is faced with having to load data into a newly created database. What could be simpler? Quite a lot of things, it seems.

This article is about how to stock an empty database with data from files.

Why?

There are a number of reasons for wanting to do this. I’ll explain why I do it. When I’m developing databases, source control will contain everything in the way of scripts that I need to build the databases, but I like to build them regularly, using an automated process, from the components. I do this in order to make sure that nothing has broken the build, that what’s in source control represents the true state of the database, and to do integration testing. Sure, I develop on a full system with SSMS, rather than go anywhere near the ‘disconnected model’ of SQL Server Data Tools (SSDT), so this regular build isn’t a vital process because the use of the shared development server will avoid many integration problems,  but I might want to build from a number of versions. This needs the ‘test’ data. When I say that source control contains everything I need, I generally store the test data separately because it’s big, it is binary, and it changes less frequently than the database versions. When this happens, I archive off the data and replace it (I’ll describe this in another article). This is no big deal compared with the main task of a refactoring, which is having to change the integration tests. I like to have at least one current set of the data of all the tables saved as BCP Native format files on disk, in a directory for each server, with a subdirectory for each database. As well as building the integration databases, it has a range of other uses in the long road through the database deployment process.

For a wide-scale test, you would build your database on a virtual server that you can then spin up to run each of your tests on. This means that, once you have all your integration databases built from the current version in source control, you will have no further need for messy data imports. You just run your tests on the virtual server. You run your tests against the test environment to do a full integration

For much testing, a virtual environment is possibly overkill. Normally, you would want to run your integration tests overnight as a daily process. The tests can each build the database as part of the build-up process. AdventureWorks, for example takes less than a minute to build entirely from scripts to a known data state. All third-party interfaces can be ‘mocked’. The databases can be torn down very quickly.

To create, copy, modify and otherwise manipulate a database, there are several activities that need to be automated. We’ll here deal with one of these: getting data in. I’ll be talking about other ‘bits’ in other articles.

I’ll be providing sample scripts. They are intended to illustrate and they work, but they aren’t industrial strength. For effective automation, we’ll tend to use PowerShell with SQL, but I’ll provide alternatives where possible if you are restricted to SQL.  If you want a good pile of data to use, then go to the companion article 'Database Deployment: The Bits - Copying Data Out' to find out how. I use a build script for AdventureWorks and a directory with all the data in order to try it out.

Deleting the existing data

If you are running a series of tests, and are not using virtualization, you will want to get the database to a known state of data consistent with the tests. You will build the database just the once and then import the data for each test rather than adding it. This means that, unless you do this as part of the teardown process of the previous test, you will want to delete the existing data. Deleting all data from databases should be easy, but referential integrity constraints will stand in your way to prevent you iterating through a list of tables, deleting all the data from each one. To delete all the existing data in a database, you need to temporarily disable referential integrity constraints and any triggers that are there to prevent you doing precisely this. Then we execute the following code. Before you even attempt to do this, make sure your data is all successfully read out and you have a backup. Otherwise this will end in tears. Make sure you know what you’re doing and you’re in the right database and server before you run this.

USE MyDatabase

EXECUTE sp_msforeachtable 'ALTER TABLE ? WITH CHECK NOCHECK CONSTRAINT ALL'

--now create a batch that deletes all the tables

EXECUTE sp_msforeachtable 'ALTER TABLE ? DISABLE TRIGGER all'

DECLARE @Command NVARCHAR(MAX)

SELECT @Command=

    (select 'Delete from ' + QUOTENAME(Table_Catalog) + '.'

        +QUOTENAME(Table_Schema)+'.' + QUOTENAME(Table_Name) + '

'

     FROM information_schema.Tables WHERE table_Type='BASE TABLE'

     FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')

EXECUTE (@Command)

EXECUTE sp_msforeachtable 'ALTER TABLE ? ENABLE TRIGGER all'

EXECUTE sp_msforeachtable 'ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL'

So what are we doing here? We are first disabling all the triggers and constraints in the database, then we are deleting all the tables, and finally we are re-enabling both triggers and constraints. You can’t use TRUNCATE this way to speed things up because you’d need to actually drop the constraints first. See TRUNCATE TABLE (Transact-SQL). There are other ways of doing this which are more complicated but which don’t involve disabling constraints, which is a rather blunt knife in the database surgeon’s set of instruments. I will show later on in this article how you can usually delete data without touching constraints.

Importing the data.

You have a wide choice of techniques for getting the data in, but if you have taken the wise choice of storing the data in native format, then BCP, or one of the other techniques that use the bulk copy library, is best. BCP is good for a number of formats, but native format is best, and is fastest.

It is possible to do bulk import in SQL Server by a variety of methods that use the same underlying Bulk copy library. There is BCP, BULK INSERT and INSERT….Select * from OPENROWSET(BULK…). The .NET System.Data. SqlClient also provides a SQLBulkCopy class to do it. It is also reasonably easy to make a PowerShell script that uses this library, or you can merely execute the BCP utility to load the data into the remote server.

Tweaking the performance of the import

These approaches are all fast. The BCP command-line is a good alternative but is out-of-process. This is a bit more insecure, but may give you better performance if you are short of CPU grunt.

When you’re using TSQL, the BULK INSERT statement is the obvious way to use Bulk Copy functionality. There are a number of knobs you can twiddle to improve Bulk Import performance

For performance reasons, the best place to have your data is on a local drive within the integration database server, though it is perfectly possible to have the BCP process on a different server to the database, or the data file, if you don’t mind waiting. You can use BCP across a network to a remote SQL server, with the files being either local with a pathname, or remote with a UNC name. However, don’t do it across a network unless you have to . It will usually be faster if you first copy the files to reside on the same server as the BCP application and the destination server (the integration server in our case). Ideally the source data should be on a different physical  drive to the destination database

Both The BCP command and the BULK INSERT statement disable triggers by default. The INSERT ... SELECT * FROM OPENROWSET(BULK...) statement will, by default, execute triggers an this can cause problems.

As well as disabling constraints or triggers, and killing indexes, you can get an edge in the import process if you use ‘Native format’ and you can also speed the process by ensuring that you use minimal logging. If your database is using the simple recovery model, then you’re there already, otherwise you’ll need to switch the database to the bulk-logged recovery model just for the import session, remembering to switch the recovery model back to the full recovery model after you’ve finished: (see this too). You can import into a table from more than one client process in parallel. You can use batches , disable triggers or order the data in the file in the same order as that which is imposed by the table’s clustered index. You can control the locking behaviour

The problem of constraints

One thing that you’ll may discover when you try to insert bulk data into a table that has foreign key constraints is that you can get errors if the relevant rows in the other table aren’t there.

BULK INSERT [MyDatabase].[MySchema].[MyTable] from MyDirectory\MySchema_MyTable.bcp'

  WITH (KEEPIDENTITY, DATAFILETYPE = 'native', CHECK_CONSTRAINTS)

/*

Msg 547, Level 16, State 0, Line 1

The INSERT statement conflicted with the FOREIGN KEY constraint "FK_MyTable_MyColumn_MyColumnID". The conflict occurred in database "MyDatabase", table "MySchema.MyOtherTable", column 'MyColumnID'.

The statement has been terminated.

*/

You’ll have noticed that I specified ‘CHECK_CONSTRAINTS’: That means both referential and check constraints. That is a great way of ensuring that bad data never reaches the database.

Importing by disabling constraints

By default, both the BCP command and the BULK INSERT statement temporararily disables constraints, whereas INSERT ... SELECT * FROM OPENROWSET(BULK...) doesn’t. If you opt to let the bulk copy process disable check constraints, as happens if you do BCP/BULK INSERT in their default setting, then it will enable the constraints so that they work for subsequent insertions or updates, but won’t check existing rows that have been imported. You’ll be left to check the data and set the constraints to being trusted afterwards (see Controlling Constraint Checking by Bulk Import Operations). If you neglect to enable constraints with a retrospective check, then you will get subsequent performance loss in queries since these constraints can’t be used in query plans until they are set to ‘trusted’. In this case, if you populate a table without a check, by temporary disabling the constraint, then it is set to being ‘untrusted’ until some process checks those constraints, including both referential and check constaints.

After you’ve done an import, you can check to make sure everything is trusted by running this query.

SELECT sys.objects.name AS [Table Name],

  sys.check_constraints.name AS [Constraint],

  is_not_trusted,

  is_disabled

FROM sys.objects

  INNER JOIN sys.check_constraints

  ON sys.check_constraints.parent_object_ID = sys.objects.object_ID

WHERE sys.check_constraints.is_not_trusted = 1

  OR sys.check_constraints.is_disabled = 1

ORDER BY sys.objects.name, sys.check_constraints.name

…or on old versions of SQL Server…

SELECT name,

  OBJECTPROPERTY(id,'CnstIsNotTrusted') as is_not_trusted

FROM sysobjects

  WHERE COALESCE(OBJECTPROPERTY(id,'CnstIsNotTrusted'),0)<>0

 

Please see sys.check_constraints (Transact-SQL)) for more details.

Before setting the constraints to being trusted, you might want to make sure you’re not going to hit an error. You can check that there are no constraint violations before you set them to being trusted by executing DBCC CHECKCONSTRAINTS

You can check all your tables in one go by executing this

EXECUTE sp_msforeachtable 'DBCC CHECKCONSTRAINTS("?")'

you can ensure that a single contraint is checked and, if successful, trusted by doing this…

ALTER TABLE [MyDatabaseCopy].[MySchema].[MyTable]

  WITH CHECK CHECK CONSTRAINT
FK_MyTable_Contact_ContactID

To do all the constraints for the table, you do this

ALTER TABLE [MyDatabaseCopy].[MySchema].[MyTable]

  WITH CHECK CHECK CONSTRAINT ALL

To switch them all back on, for every table in the database run:

-- enable all constraints
EXECUTE sp_msforeachtable ‘ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL’

So lets put this together with an import routine

USE MyDatabase

--the path to your root directory where you store the files

DECLARE @PathToBCPFileDirectory VARCHAR(100)

SELECT @PathToBCPFileDirectory= 'MyRootDirectory\'

--the filetype you use (there is no standard so I don't bother)

DECLARE @Filetype VARCHAR(10)

SELECT @FileType= ''

--Your database (leave as a null to get this automatically set to the database name)

DECLARE @DatabaseSourceDirectory VARCHAR(100)

SELECT @DatabaseSourceDirectory= 'MyDatabase'

--set this to null if you use the current database name

DECLARE @Directory VARCHAR(255)

SELECT @Directory=@PathToBCPFileDirectory--construct a valid path

  +REPLACE(REPLACE(@@ServerName,'\','-'),'/','-')

  +'\'+REPLACE(REPLACE(COALESCE(@DatabaseSourceDirectory,DB_NAME()),'\','-'),'/','-')+'\'

 

DECLARE @ImportCommand NVARCHAR(MAX)

SELECT      @ImportCommand=

      (select 'BULK INSERT ' + QUOTENAME(db_name()) + '.'

        +QUOTENAME(TABLE_SCHEMA)+'.' + QUOTENAME(TABLE_NAME) + ' from ''' +@Directory

        +REPLACE(REPLACE(TABLE_SCHEMA+'_'+ TABLE_NAME,'.','-'),'\',':') + @FileType + '''

WITH (KEEPIDENTITY, DATAFILETYPE = ''native'')

' FROM information_schema.Tables WHERE table_Type='BASE TABLE'

  

  FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')

EXECUTE (@ImportCommand)

EXECUTE sp_msforeachtable 'ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL'

This runs in 40 seconds for AdventureWorks on most of my database servers.

If you want to run this in PowerShell, using BCP, something like this will do the trick. (your windows account must have the required permission on all the tables to use this)

 # you'd need to fill these in with the details
 # this routine assumes that the BCP files already exist
$PathToBin='' #put this in only if you hit trouble locating BCP.
 # often 'C:\Program Files\Microsoft SQL Server\100\Tools\Binn\'
$directory='MyRootDirectory' # the directory where you want to store them
$SourceServer = ("MyServerInstance") # used to find the subdirectory where the files are
$SourceDatabase='MyDatabase' #where we take the data and build script from
$DestinationServer = ("MyServerInstance") #the destination instance
$DestinationDatabase='MyDatabase' #the destination database

 #Load SMO assemblies
$MS='Microsoft.SQLServer'

#now load in the SMO DLLs
@('.SMO','.SmoExtended') |
foreach-object {
  if ([System.Reflection.Assembly]::LoadWithPartialName("$MS$_") -eq $null)
    {"missing SMO component $MS$_"}
  }
set-psdebug -strict
$ErrorActionPreference = "stop" #

 # now log into the server and get the server object
$My="$MS.Management.Smo" #
$s = new-object ("$My.Server") $DestinationServer
if ($s.Version -eq $null ){Throw "Can't find the instance $DestinationServer"}
$SQL=@'
SET NOCOUNT ON;
select '"' +TABLE_CATALOG + '"."'+Table_Schema+'"."' + Table_Name+'"' QualifiedTableName,
Table_Schema+'_' + Table_Name FileTableName
FROM information_schema.Tables WHERE table_Type='BASE TABLE'

'@

 #now get the destination database object
$Destination = New-Object ("$My.Database") ($s, "$DestinationDatabase")
if ($Destination.name -ne $DestinationDatabase) #on failure
   {Throw "Can't find the database '$DestinationDatabase' in $DestinationServer"};
 #now get the message handler to get SQL messages
$handler = [System.Data.SqlClient.SqlInfoMessageEventHandler] {param($sender, $event) Write-Host $event.Message};
 #and assign it to the message pipeline. We just choose to show it in the output
$s.ConnectionContext.add_InfoMessage($handler);
 #execute the SQL to get the tables into a dataset
$result=$Destination.ExecuteWithResults("$SQL") #execute the SQL
 #now, for each table in the pipeline
$result.Tables[0] |
foreach {
   $filename = "$($_.FileTableName)" -replace '[\\\/\:\.]','-' #work out the name of the filename
    #and now we get the whole path to the file we want to import
   $TableSource = "$directory\$($SourceServer -replace '[\\\/\:\.]','-')\$($SourceDatabase -replace '[\\\/\:\.]','-' )"
    #and we execute the BCP command
   $WhatHappened=&"$($pathToBin)BCP.exe" "$($_.QualifiedTableName)" in "$TableSource\$filename" -q -n -T -E "-S$($DestinationServer)"
   if ($WhatHappened -like '*Error *') {throw ($WhatHappened)}
   }
 #all is now in place. We just have to set the contraints to trusted
$Destination.ExecuteNonQuery(@"
EXECUTE sp_msforeachtable 'ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL'
"@ ) #flag constraints as trusted
$s.ConnectionContext.remove_InfoMessage($handler);

This seems good and simple, doesn’t it? This is why BCP and BULK INSERT default to doing it this way. Why not just always opt to ignore constraints? The answer is that, in general, It isn’t a good idea to modify the metadata unless it is necessary. When constraints are disabled, a schema modify lock might be taken to update the metadata. This can interfere with other commands such as an online index build or affect transactions. In an active OLTP system you will need to avoid doing this, but for our operation, this is safe.

There are alternatives, of course.

Kill ‘n Fill

Some articles advise ‘Kill and fill’, a technique whereby all indexes and constraints are deleted before the BCP operation and then reinstated. This makes the whole operation a lot easier and faster, but even if we aren’t going to suffer a schema modify lock, I dislike the idea because it is altering the metadata, which isn’t good practice for a test that is reliant on the metadata being at a known state. You don’t avoid the errors on replacing the DRI if there is an error when you try to reinstate the indexes and constraints.

What could possibly go wrong that would cause a constraint violation? There is always the risk that data is accidentally inserted twice into a table, or incorrect data can get in. If you originally exported the data from a database that had active connections modifying the data, then you can get referential integrity errors.

Filling Tables in the right order

Copying tables in the right order is the most trouble-free approach since you import the tables in such an order that any conflicts with a foreign key constraint are avoided without disabling them. You just do a topological sort on the tables. This code will need the temporary stored procedure that I provide in the subsequent listing.

DECLARE @Filetype VARCHAR(10)

DECLARE @DatabaseSourceDirectory VARCHAR(100)

DECLARE @Directory VARCHAR(255)

DECLARE @PathToBCPFileDirectory VARCHAR(100)

DECLARE @Command NVARCHAR(MAX)

SELECT @FileType= ''--the filetype you use e.g. .DAT (there is no standard so I don't bother)

SELECT @PathToBCPFileDirectory= 'MyRootDirectory\'

SELECT @DatabaseSourceDirectory= 'MyDatabase'

--set this to null if you use the current database name

DECLARE @tables TABLE (TheObject_ID INT NOT NULL,

  TheName SYSNAME NOT NULL,

  TheSchema SYSNAME NOT NULL,

  HasIdentityColumn INT NOT NULL,

  TheOrder INT NOT NULL)

  INSERT INTO @tables

  EXECUTE #AllTablesInDependencyOrder

SELECT @Directory=@PathToBCPFileDirectory--construct a valid path

  +REPLACE(REPLACE(@@ServerName,'\','-'),'/','-')

  +'\'+REPLACE(REPLACE(COALESCE(@DatabaseSourceDirectory,DB_NAME()),'\','-'),'/','-')+'\'

  SELECT    @Command=

      (select 'BULK INSERT ' + QUOTENAME(db_name()) + '.'

        +QUOTENAME(TheSchema)+'.' + QUOTENAME(TheName) + ' from ''' +@Directory

        +REPLACE(REPLACE(TheSchema+'_'+ TheName,'.','-'),'\',':') + @FileType+ '''

WITH ('+ CASE WHEN HasIdentityColumn<>0 THEN 'KEEPIDENTITY, '

  ELSE '' END

  +'DATAFILETYPE = ''native'', CHECK_CONSTRAINTS)

'

  FROM @tables ORDER BY theOrder

  FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')

  if @@error=0 EXECUTE sp_ExecuteSQL @Command

 

So how do we do a topological sort? Well, all we need to do is to get a list of tables in order of their dependencies. High at the top of the list are tables that have no foreign key relationships, followed by those that are only referenced by other tables but do not themselves refer to any tables. These are then followed by tables that only refer to tables higher in the list. Easy really if one does it with little bits of paper before trying any coding.

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

IF OBJECT_ID (N'TempDB..#AllTablesInDependencyOrder') IS NOT NULL

  DROP PROCEDURE #AllTablesInDependencyOrder

GO

 

Create PROCEDURE #AllTablesInDependencyOrder

/**

summary:  @'

  This routine returns a table containing all the tables in the current

  database, their schema, object_ID, whether that have an identity

  column in them, and their dependency level. You would use this for

  deleting the data from tables or BCPing in the data.

'@

Author: Phil Factor

Revision: 1.1 dealt properly with heaps

Created: 25th November 2011

example:

    ~ @'

      Declare @tables Table( TheObject_ID INT NOT null,

      TheName SYSNAME NOT null,TheSchema SYSNAME NOT null,

      HasIdentityColumn INT NOT null,TheOrder INT NOT null)

      insert into @tables  

         Execute #AllTablesInDependencyOrder

      Select * from @Tables

'@

 

returns:  @'

        TheObject_ID INT,--the tables' object ID

        TheName SYSNAME, --the name of the table

        TheSchema SYSNAME, --the schema where it lives

        HasIdentityColumn INT, --1 if it has identity column

        TheOrder INT) --Order by this column

'@ **/

 

AS

SET NOCOUNT ON;

DECLARE @Rowcount INT, @ii INT

CREATE TABLE #tables (

  TheObject_ID INT,--the tables' object ID

  TheName SYSNAME, --the name of the table

  TheSchema SYSNAME, --the schema where it lives

  HasIdentityColumn INT, --whether it has an identity column

  TheOrder INT DEFAULT 0) --we update this later to impose an order

 

--let's do a Topological sort

--firstly we read in all the tables from the database.

INSERT  INTO #tables (Theobject_ID, TheName, TheSchema, HasIdentityColumn)

    SELECT  TheTable.OBJECT_ID, TheTable.NAME, TheSchema.NAME,

        CASE WHEN identityColumns.Object_id IS NULL THEN 0

           ELSE 1

        END

    FROM  sys.tables TheTable

        INNER JOIN sys.schemas TheSchema

          ON TheSchema.SCHEMA_ID = TheTable.schema_ID

        LEFT JOIN (SELECT DISTINCT Object_id

               FROM   sys.columns

               WHERE  is_identity = 1) identityColumns

          ON Thetable.object_id = identityColumns.object_id

 

/* We'll use a SQL 'set-based'  form of the topological sort

First, find a list of "start nodes" which have no incoming edges

and insert them into a set S; at least one such node must exist

in an acyclic graph*/

--flag all the immediately safe tables to insert data in

UPDATE  #tables SET TheOrder = 1

FROM  #tables parent --do not reference any other table and aren't referenced

    LEFT OUTER JOIN sys.foreign_Keys referenced

     ON referenced.referenced_Object_ID = parent.Theobject_ID

    LEFT OUTER JOIN sys.foreign_Keys referencing

     ON referencing.parent_Object_ID <> parent.Theobject_ID

WHERE

   referenced.parent_object_ID IS NULL AND referencing.parent_object_ID IS NULL

 

--do not reference tables but might be referenced by others

UPDATE  #tables SET TheOrder = 2

FROM #tables parent

    LEFT OUTER JOIN sys.foreign_Keys referencing

     ON referencing.parent_Object_ID = parent.Theobject_ID

     AND referencing.referenced_Object_ID <> parent.Theobject_ID

WHERE   referencing.parent_object_ID IS NULL

AND TheOrder = 0--i.e. it hasn't been ordered yet

 

SElECT @Rowcount=100,@ii=3

--and then do tables successively as they become 'safe'

WHILE @Rowcount > 0

  BEGIN

  UPDATE  #tables

  SET   TheOrder = @ii

  WHERE   #tables.TheObject_ID IN (

      SELECT  parent.TheObject_ID

      FROM  #tables parent

          INNER JOIN sys.foreign_Keys

             ON sys.foreign_Keys.parent_Object_ID = parent.Theobject_ID

          INNER JOIN #tables referenced

             ON sys.foreign_Keys.referenced_Object_ID = referenced.Theobject_ID

            AND sys.foreign_Keys.referenced_Object_ID <> parent.Theobject_ID

      WHERE   parent.TheOrder = 0--i.e. it hasn't been ordered yet

      GROUP BY parent.TheObject_ID

      HAVING  SUM(CASE WHEN referenced.TheOrder = 0 THEN -20000

               ELSE referenced.TheOrder

            END) > 0--where all its referenced tables have been ordered

  )

  SET @Rowcount = @@Rowcount

  SET @ii = @ii + 1

  IF @ii > 100

    BREAK

END

SELECT TheObject_ID,TheName,TheSchema,HasIdentityColumn,TheOrder

 FROM #tables order by TheOrder

IF @ii > 100 --not a directed acyclic graph (DAG).

  RAISERROR ('Cannot load in tables with mutual references in foreign keys',16,1)

IF EXISTS ( SELECT  * FROM #tables WHERE TheOrder = 0 )

  RAISERROR ('could not do the topological sort',16,1)

 

GO

Here, we use the topological sort from within PowerShell to import the database’s data without having to upset any constraints. All these variants run at around the same speed.

 # you'd need to fill these in with the details
 # this routine assumes that the BCP files already exist
$PathToBin='' #put this in only if you hit trouble locating BCP.
 # often 'C:\Program Files\Microsoft SQL Server\100\Tools\Binn\'
$directory='MyRootDirectory' # the directory where you want to store them
$SourceServer = ("MyServerInstance") # used to find the subdirectory where the files are
$SourceDatabase='MyDatabase' #where we take the data and build script from
$DestinationServer = ("MyServerInstance") #the destination instance
$DestinationDatabase='MyDatabase' #the destination database

 #Load SMO assemblies
$MS='Microsoft.SQLServer'

 #now load in the SMO DLLs
@('.SMO','.SmoExtended') |
foreach-object {
  if ([System.Reflection.Assembly]::LoadWithPartialName("$MS$_") -eq $null)
    {"missing SMO component $MS$_"}
  }
set-psdebug -strict
$ErrorActionPreference = "stop" #

 # now log into the server and get the server object
$My="$MS.Management.Smo" #
$s = new-object ("$My.Server") $DestinationServer
if ($s.Version -eq $null ){Throw "Can't find the instance $DestinationServer"}
$SQL=@'
SET NOCOUNT ON;
DECLARE @Rowcount INT, @ii INT
CREATE TABLE #tables (
TheObject_ID INT,--the tables' object ID
TheName SYSNAME, --the name of the table
TheSchema SYSNAME, --the schema where it lives
HasIdentityColumn INT, --whether it has an identity column
TheOrder INT DEFAULT 0) --we update this later to impose an order

--let's do a Topological sort
--firstly we read in all the tables from the database.
INSERT INTO #tables (Theobject_ID, TheName, TheSchema, HasIdentityColumn)
SELECT TheTable.OBJECT_ID, TheTable.NAME, TheSchema.NAME,
CASE WHEN identityColumns.Object_id IS NULL THEN 0
ELSE 1
END
FROM sys.tables TheTable
INNER JOIN sys.schemas TheSchema
ON TheSchema.SCHEMA_ID = TheTable.schema_ID
LEFT JOIN (SELECT DISTINCT Object_id
FROM sys.columns
WHERE is_identity = 1) identityColumns
ON Thetable.object_id = identityColumns.object_id

/* We'll use a SQL 'set-based' form of the topological sort
First, find a list of "start nodes" which have no incoming edges
and insert them into a set S; at least one such node must exist
in an acyclic graph*/
--flag all the immediately safe tables to insert data in
UPDATE #tables SET TheOrder = 1
FROM #tables parent --do not reference any other table and aren't referenced by anything
LEFT OUTER JOIN sys.foreign_Keys referenced
ON referenced.referenced_Object_ID = parent.Theobject_ID
LEFT OUTER JOIN sys.foreign_Keys referencing
ON referencing.parent_Object_ID = parent.Theobject_ID
WHERE referenced.parent_object_ID IS NULL and referencing.parent_Object_ID is null

UPDATE #tables SET TheOrder = 2
FROM #tables parent --do not reference any other table but might be referenced
LEFT OUTER JOIN sys.foreign_Keys referencing
ON referencing.parent_Object_ID = parent.Theobject_ID
AND referencing.referenced_Object_ID <> parent.Theobject_ID
WHERE referencing.parent_Object_ID is null
AND TheOrder = 0--i.e. it hasn't been ordered yet
SElECT @Rowcount=100,@ii=3
--and then do tables successively as they become 'safe'
WHILE @Rowcount > 0
BEGIN
UPDATE #tables
SET TheOrder = @ii
WHERE #tables.TheObject_ID IN (
SELECT parent.TheObject_ID
FROM #tables parent
INNER JOIN sys.foreign_Keys
ON sys.foreign_Keys.parent_Object_ID = parent.Theobject_ID
INNER JOIN #tables referenced
ON sys.foreign_Keys.referenced_Object_ID = referenced.Theobject_ID
AND sys.foreign_Keys.referenced_Object_ID <> parent.Theobject_ID
WHERE parent.TheOrder = 0--i.e. it hasn't been ordered yet
GROUP BY parent.TheObject_ID
HAVING SUM(CASE WHEN referenced.TheOrder = 0 THEN -20000
ELSE referenced.TheOrder
END) > 0--where all its referenced tables have been ordered
)
SET @Rowcount = @@Rowcount
SET @ii = @ii + 1
IF @ii > 100
BREAK
END
SELECT TheObject_ID,TheName,TheSchema,HasIdentityColumn,TheOrder
FROM #tables order by TheOrder
IF @ii > 100 --not a directed acyclic graph (DAG).
RAISERROR ('Cannot load in tables with mutual references in foreign keys',16,1)
IF EXISTS ( SELECT * FROM #tables WHERE TheOrder = 0 )
RAISERROR ('could not do the topological sort',16,1)

'@

$Destination = New-Object ("$My.Database") ($s, "$DestinationDatabase")
if ($Destination.name -ne $DestinationDatabase){Throw "Can't find the database '$DestinationDatabase' in $DestinationServer"};
$handler = [System.Data.SqlClient.SqlInfoMessageEventHandler] {param($sender, $event) Write-Host $event.Message};
$s.ConnectionContext.add_InfoMessage($handler);
$result=$Destination.ExecuteWithResults("$SQL") #execute the SQL
$s.ConnectionContext.remove_InfoMessage($handler);
$result.Tables[0] |
foreach {
   $filename = "$($_.TheSchema)_$($_.TheName)" -replace '[\\\/\:\.]','-'
   $TableSource = "$directory\$($SourceServer -replace '[\\\/\:\.]','-')\$($SourceDatabase -replace '[\\\/\:\.]','-' )"
   $WhatHappened = &"$($pathToBin)BCP.exe" "`"$DestinationDatabase`".`"$($_.TheSchema)`".`"$($_.TheName)`"" in "$TableSource\$filename" -q -h "CHECK_CONSTRAINTS" -n -T -E "-S$($DestinationServer)"
   if ($WhatHappened -like '*Error*') {throw ($whatHappened)}
}
      

Now we have a topological sort, we can also use it to perform surgery on a database. I started this article by showing you how to clear out all the data in a database in order to re-fill it with a different set of data. We just build a batch of deletes in the right order and execute it. (Don’t try this unless you have a good backup and you've double-checked that you are in the right database!)

Declare @tables Table( TheObject_ID INT NOT null,

  TheName SYSNAME NOT null,TheSchema SYSNAME NOT null,

  HasIdentityColumn INT NOT null,TheOrder INT NOT null)

insert into @tables

  EXECUTE #AllTablesInDependencyOrder

DECLARE @Command NVARCHAR(MAX)

SELECT      @Command=

      (select 'Delete from ' + QUOTENAME(db_name()) + '.'

        +QUOTENAME(TheSchema)+'.' + QUOTENAME(TheName) + '

'

  FROM @tables ORDER BY theOrder desc

  FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')

EXECUTE sp_msforeachtable 'ALTER TABLE ? DISABLE TRIGGER all'

EXECUTE sp_ExecuteSQL @Command

EXECUTE sp_msforeachtable 'ALTER TABLE ? ENABLE TRIGGER all'

What this is doing is to start with all those tables that aren’t referenced by any other table other than themselves, and then chipping away at the others, picking those that are only referenced by tables that have had all their data deleted. On my test server, MyDatabase data is all deleted within 25 seconds. On some databases, this won’t work because of horrible circular references, or mutual references. In these cases, you can use the blunter instrument of disabling constraints.

Wrapup.

From my experience, the four different methods of  reading data into a database take roughly the same amount of time.  The bulk load process is so efficient when you give it a chance that the process is governed more by the speed of accessing the data from file, and the network speed. I prefer a topological sort even if the code looks a bit intimidating, since it is less intrusive and no checks are ever disabled. However, I can appreciate  that the simpler system makes a lot of sense if the user is aware of the need to mop up afterwards.

When I started writing this article, I felt it would be easy since we all know how to import data into a database. As I was intending on also writing about getting data out, scripting out the metadata and copying databases, I decided to do the entire process of copying a database, with extracts from my own scripts, using a range of databases and testing before-and-after with data and schema comparison tools. I also tried to double-check everything I wrote, even when I was sure of my facts. The exercise convinced me that I had started out knowing far less than I should have done about the process. I hope that the lessons I learned will be of benefit to others!

Further Reading

Phil Factor

Author profile:

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Google + To translate this article...

Search for other articles by Phil Factor

Rate this article:   Avg rating: from a total of 18 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Deleting all data
Posted by: Anonymous (not signed in)
Posted on: Wednesday, November 21, 2012 at 6:17 AM
Message: It might be a good idea to use batch deletes, otherwise the database logs might grow too big or even fillup the disk.
Here is a good article on batch deletes: http://sqlconcept.com/2011/09/20/the-smart-delete/

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.