When it comes to the subject of putting databases into source control, Troy Hunt is no shrinking violet. “Database source control is no longer negotiable”. In this article, published originally in his Blog, Troy explains how to do Continuous Integration from a database that is in a Version Control System (VCS).
Note, this article was updated in December, 2015.
Almost every development team uses Source Control Management for application development. There are several excellent Version Control Systems (VCS) which make the task a breeze. Likewise, Continuous Integration is also increasing in importance. By contrast, the adoption of both practices with database development has been slow.
Of course there are special difficulties that account for this. One problem is that database objects don’t exist as simple files that can be versioned; you can't just pick them up and place them in a target location when you want to deploy them. You’ve got to consider the very nature of databases: You’ve got real live data to deal with, and the ramifications of screwing up a deployment are pretty severe.
Some special tooling is in order and fortunately the planets are starting to align in such a fashion that some of my favourite products work very nicely together to serve just the purpose we’re after. Last year I wrote about Redgate’s SQL Source Control as a very excellent way of versioning databases and followed up later in the year about automating deployments with TeamCity.
Let’s take these tools – plus a few more from the Redgate product suite – and, at last, make one-click database deployment a reality.
Get your database under source control
First and foremost, none of this is going to make any sense whatsoever unless you get your database under source control. As I wrote a while back in The unnecessary evil of shared database development,
Databases are an essential component of many of the applications we build and so it is crazy talk to argue that they don't need to be in a VCS. Without the DB in source control we end up with a fragmented, partially complete picture of what an application is. We lose the ability to say “Here – this is the state of code over time”, as only part of the picture exists. We also lose all the productivity advantages; not just of being able to rollback when required, but of being able to integrate with the work of the rest of the team. Sometimes we can get around the integration problem by using a shared development databases but, well, go back and read the article I've just mentioned for everything that’s wrong with that.
For us, VCS is the “source of truth” for automated deployments from a continuous integration environment. It’s where application life begins and it’s where TeamCity is going to turn to when it publishes both the web application and the database.
Pretty much everything in the database needs to be in the VCS (in this example the VCS is Subversion, but SQL Source Control supports all version control systems). This includes database objects such as stored procedures, views, triggers and of course, tables. Because SQL Source Control neatly files each object type away into its own folder, all we need to do is take a quick look in the Subversion repository for the project and we’ll see just what sort of stuff is going to go in there:
In order to publish the changes, we need a way to
- pull the scripts for each database object from source control
- create a script that will make the changes to the target database
- execute the script to make the required changes against the target database.
Depending on the changes, we could be dropping objects, altering permissions, changing data lengths or potentially any other conceivable database change. The main thing is that VCS is our source of truth and we need to ensure the target environment is changed to match this exactly.
This is where SQL Compare comes into play. I’ve long been a user of this on the desktop as a means of automating releases and ensuring environments were absolutely identical. But there are two features the current version has which make it perfect for our purposes here; unattended command line execution and syncing from a VCS source, the latter of which necessitates the Professional version.
For the purpose of this post, I’m going to run the process against a data model which looks like this:
In theory, the “Customer” and “Order” tables would accumulate transactional data and the “Country” table would contain a static set of reference data. The importance of this distinction is about to become clear but for now, let’s just work on the assumption that everything you see in the image above is happily tucked away in VCS via SQL Source Control.
Data ain’t data
Here’s a contentious issue for you; versioning of data. But there’s data, and then there’s data. What should never go into source control is transactional data accrued during the normal function of the application. It shouldn’t go in as SQL scripts, it shouldn’t go in outputted to CSV and it definitely shouldn’t go in as a database backup.
But there is a class of data that should go into source control and that’s any data required for the application to actually function, such as reference data. For example, if a registration form contains a mandatory drop down list of countries driven from a “Country” table, we have to get that data in there in order for things to work.
In principal, we want VCS to contain enough information about the application to allow us to deploy directly from there to a target environment and everything just works. To do this, we need a means of generating the data in a scriptable format, storing it in source control then pulling it back out and synchronising it with the target environment.
But before we do that we need to populate my local development database (because a shared one would be evil), with a list of countries (thank you Text Fixer). Once this is done we can move on to actually versioning the data.
Before version 2 of SQL Source Control came out, SQL Data Compare could be used to version the reference data. It involved checking out the database schema to a folder path and generating the reference data into that. It was a cumbersome task to maintain another working copy of the DB outside of the one SQL Source Control manages, having different processes (and subsequently different revisions) to get everything into VCS and perhaps most frustratingly, requiring another product just to keep data versioned.
SQL Source Control v2 (and later) recognises this problem and in the usual Redgate way, makes it extremely simple to version the data. You just right-click on the table and “hey presto” instant data versioning:
With this option we can now explicitly identify that the “Country” table contains our reference – or “static” in SQL Compare nomenclature – data:
This will ensure that once we go back to the familiar old “Commit Changes” tab, we’ve got a rich set of static data ready for versioning. Let’s take a look at what we’re committing:
We can now go ahead and just commit that directly into source control. The beauty of versioning data this way is that the one transaction can also contain changes to the schema. To give you an idea of just how much easier this makes it than the original SQL Data Compare approach I took, the few paragraphs above replaced over 1,100 words and half a dozen illustrations. Very nice!
As with the schema, we ultimately need a means of automating the deployment of this data to a target environment. It’s going to mean not only inserting a bunch of records but later on, potentially updating or deleting them as well. We need a synchronisation mechanism and this is where we come back to SQL Data Compare Professional, but it will now do all its work from the build server.
Let’s get started with the nuts and bolts of how the deployment will work. Firstly, I’m going to be using the TeamCity project which served me so well during the whole You’re deploying it wrong series. However, I’ve upgraded TeamCity so although things look pretty familiar, we’ll have some new functionality available that will come in very useful a little later on.
I’m also going to use Redgate’s DLM Automation, which includes all the command line tools that developers normally licence on their workstations (such as SQL Compare, SQL Data Compare and so on), but is designed specifically for build servers. DLM Automation also includes the sqlCI command line tool, for automating database CI workflows, and the TeamCity plug-in for setting up database builds in TeamCity, both of which I'll use in this article.
Let’s start with the general settings of the build:
As I’ve said in previous posts, I always like to keep the VCS revision in the build number as it can make life a whole lot easier later on. The “Report” path in our artifacts will hold the output of the deployment so we can inspect it after the build.
Moving on to the VCS (Subversion) configuration, I’ve left all the main settings in their default positions and only customised a VCS checkout rule:
One thing I’ve changed about the original solution is the structure in VCS. Rather than sitting the ASP.NET app directly in the trunk, I’ve moved it into a folder called “Application” and placed a “Database” folder right next to it (you can see this in the screen-grab in the “Schema and objects” section above). This meant I needed to go back and change the VCS checkout rule in the existing builds to pull from one level deeper than the trunk.
Moving on to the build steps, we’re going to need to tackle this in two parts; build then sync. This is the reason why the newer TeamCity makes things a bit easier because it now provides for multiple steps in a single build.
The first step will be to create a release artifact. The idea is that this artifact can be moved around and deployed multiple times but it will always result in the same schema and static data being deployed. If there is a problem, we can throw the package away and build a new one but we never edit the package in flight. This way we know that when we come to our production deployment this package has been deployed in exactly the same way throughout the testing and deployment pipeline, reducing the chances of a failed deployment on production. Let’s get started:
In order to do this, we’ll need to download and install DLM Automation then go to the Add-ons page to download the TeamCity plugin file: sqlci-teamcity.zip.
Copy sqlci-teamcity.zip into the plugins folder of the TeamCity data directory. By default it will be located at either:
... or ...
Now re-start the “TeamCity Server” service so that TeamCity can start to use the plug-in.
We are now ready to configure the build step. Select the “Redgate SQL CI Build” build step.
This is going to check-out our database source code and use it to create a NuGet package containing all the information we need to create the database from scratch, including our static data. It will name the package based on whatever we enter in the ‘Package ID’ field appended with the TeamCity build number. Every commit to the VCS generates a new package, versioned sensibly. This means that we can link each release artifact all the way back to source control. Auditors love this stuff!
NuGet, by the way, is a technology adopted by Microsoft specifically with deployment in mind. If you've ever downloaded a Visual Studio extension, you've downloaded a NuGet package that understood its dependencies as well as how to unpack itself – pretty handy.
The reason we need database credentials is that TeamCity is now going to use SQL Compare technology to create a temporary database based on that package. If we have any invalid syntax or references in our source code this will error out and alert us to the problem (more on that later).
Next, we want to deploy this package to our AutoDeploy environment. For this we are going to use the “Redgate SQL CI Sync” step:
All I need to do is tell TeamCity which package to use (it will pick the package with the corresponding name and which has the build number that matches the current build – in other words, it will use the package we just created) and it will use SQL Compare and SQL Data Compare technology to deploy schema and data changes to the AutoDeploy database on MyServer. Naturally, we need to provide TeamCity with enough credentials to make schema and data modifications to the database.
Making the magic happen (aka running the build)
Now that we have all the schema under source control, all the reference data we’ll want to refer to and a build with deployment steps for both, let’s run it. Just so you know, my AutoDeploy database is already in sync with source control.
Green is good, but let’s look at what has actually happened here. Firstly, it only took 13 seconds. Yes, the changes are very small but they’re also running inside a VM on my PC including the whole gamut of TeamCity, IIS, SVN, SQL Server and of course, the Redgate tools.
Let’s take a look inside the build log. There’s a lot of info in there, but here’s the essential bit in the first build step to build the scratch database:
Object type Name DB1 DB2
Table [dbo].[Country] >>
Table [dbo].[Customer] >>
Table [dbo].[Order] >>
Table data [dbo].[Country] 244 >>
Here we see the tables listed and the double right-angle brackets pointing from DB1 (which is our VCS checkout) to DB2 (which is the scratch database). The angle brackets mean the structure on the left will be replicated to the structure on the right. We also see the data being pushed across – all 244 records.
And now let’s look a little further down the logs – what happened when we tried to sync the data?
Object type Name DB1 DB2
Table [dbo].[Country] == ==
Table [dbo].[Customer] == ==
Table [dbo].[Order] == ==
Table data [dbo].[Country] 249 == ==
This time TeamCity tells us that the tables and data were already in sync.
So that’s it – fully automated deployment of schema and objects with just the click of a button. But of course that was easy, the databases were already in sync. Let’s make it a bit more challenging.
How the build handles changes
We want to see a more lifelike example so let’s make some changes:
- Add a “PostCode” field to the “Customer” table
- Change the “Name” column in the “Country” table from a varchar(100) to an nvarchar(150)
- Delete the “New Zealand” record from the “Country” table (obviously they’ll simply become another state in Australia)
- Remove the word “Democratic” from “Korea, Democratic People's Republic of”
- In SQL Source Control 2, all of these changes can go in as one VCS transaction. This is perfect for keeping the database atomically correct – now if only the application layer could go in at the same time!
The great thing about SQL Compare 2 is that all the above can go in as one VCS transaction. This is perfect for keeping the database atomically correct – now if only the application layer could go in at the same time!
Let’s run it all up again and take a look at the build log. Firstly, what it finds with the schema:
Object type Name DB1 DB2
Table [dbo].[Country] <> <>
Table [dbo].[Customer] <> <>
The two opposing angle brackets indicate the objects aren’t the same in source and target which is true for both tables. Browsing over to the artifacts we can see an ALTER COLUMN statement on the “Country” table followed by an ADD statement on the “Customer” table. Cool.
Now let’s look at the data:
Table [dbo].[Country] 242 == ==
Table [dbo].[Country] 1 <> <>
Table [dbo].[Country] 1 <<
So 242 records sync up perfectly, one is different in both sources and one only exists in the target. Correct, correct and correct. Back over to the artifacts and the data sync script shows a DELETE statement for New Zealand (sorry guys), and an UPDATE statement for North Korea. Perfect!
It’s the change process which really illustrates the power of this mechanism and in particular, the Redgate tools. You can let environments go through all sorts of schema and data changes then just simply let the build sync it all up. Redgate takes care of the usual issues such as referential integrity and data retention throughout the change so for the people responsible for the environment, it really is as simple as clicking the run button and waiting about a dozen seconds.
This is a fantastic example of continuous integration and the ability for new work to head off in a different direction without fear of being able to integrate it all back together again. The great things about this is that it also makes reverting changes a breeze; it’s easy to run a build against an earlier revision because you know Redgate can simply pick up the differences and script them appropriately.
Breaking the build
So far all this has been smooth sailing but what happens when things go wrong? What would happen, for example, if I deleted a country from the static reference data which was being used by a customer in the target environment? Let’s find out:
And if we now drill down into the build log:
The DELETE statement conflicted with the REFERENCE constraint
"FK_Customer_Country". The conflict occurred in database "Autodeploy", table
"dbo.Customer", column 'CountryId'.
This is precisely what we’d expect to see happen and it raises an interesting point; there are going to be times where 100% automation may not be achievable. If, for example, New Zealand really did become an Australian state (quit mocking the suggestion – I didn’t come up with it), and we wanted to change all our customer records accordingly before deleting the country, there’s going to potentially be a several stage process involved.
Sure, we could start scripting the whole thing out and execute just that process in a build via sqlcmd, and there may be no other way in some environments, but it’s starting to become a bit of an agility anti-pattern. It would arguably become a simpler exercise to skip the build process in favour of working with whoever is playing the DBA role. After all, the whole point of this process is to make publishing easier, not to constrain people by rigid processes.
I’m nervous; can I see the change before it runs?
Yes, and it’s supremely simple to do; just drop the /Sync parameter. You see, doing this will give us every piece of information we previously got in the artifacts – namely proposed change scripts and the report for SQL Compare – but it won’t actually run them.
The real easy thing about this approach is that TeamCity gives you a “copy” feature to simply replicate an existing build configuration so we can take the one we set up earlier then just remove the parameters from each build step. Once this is done, you can run the new build then just sit back and soak in the results without committing to anything. Easy!
The tricky trigger problem – solved!
In the You’re deploying it wrong series, I had the build sequence triggered by a VCS change. As I said at the time, you might not always want to do this but it was a pretty neat party trick and it works well in certain circumstances (i.e. small team, deployment to an integration environment).
It was about here I’d previously lamented the fact that it took two commits to get both the schema and data into VCS and caution needed to be exercised to do this in the right order lest you try and insert data into a table which hadn’t yet been created. But SQL Source Control 2 solves this problem as everything goes in as a single transaction. So trigger away!
Single mother-of-all-builds versus linked builds
Earlier on we constructed a build configuration with two steps so that both the schema and data could be deployed. But of course it makes sense to tie this into the application layer as well which would mean associating it with the web deploy build. It’s just a question of how explicit this is.
In the one corner we could keep entirely autonomous build configurations and simply trigger one off the other with a snapshot dependency to ensure they pull from the same revision. On the other hand, the multiple build steps support of TeamCity 6 (or the classic MSBuild with multiple tasks pattern), could roll this all into the one build.
Personally, the linked builds model sits a bit better with me. To begin with, you’re pulling from different VCS roots so if it all goes into the one build you have to pull from a level higher then ensure all your references link to the appropriate “Application” or “Database” path.
Then you’ve got the fidelity of build failures; just seeing red in the mother-of-all-builds doesn’t tell you a lot about what went wrong without delving into it. And after that, what if you just want to run the DB deployment again and not necessarily the app build?
Of course your build log also becomes a bit of an epic and more thought needs to go into the artifact management. Furthermore, you can’t distribute web app builds and DB builds to different agents which could be useful in the future.
The only gotcha with linked builds is that if they’re triggered by a VCS commit and there’s a revision snapshot dependency you’re going to have trouble when you need to make both web app and DB commits. Actually, you'll end up running both builds twice as the second one requires the first to build against the same revision. In this situation I’d probably just decouple them and let changes to the web app root trigger web deploy and changes to the database root trigger the Redgate tools.
Working with test data versus a clean slate
The model described above is great for syncing the schema and the data in the reference table. But what about the other tables? What do we want to do with our customers and orders?
There are a few different use cases to consider depending on the target environment and how it’s being used. Production, for example, is easy – this mechanism works perfectly as it retains all the transactional data the system has accrued.
A test environment midway through user acceptance testing is another place this rule applies. If you absolutely have to push out a change in the middle of a test cycle the last thing you want happening is for the data state of the transactional tables to suddenly and unexpectedly change.
Likewise in the development environment if you’re working in a team and collectively evolving the schema and reference data. You don’t want to be happily working away on your own tasks then take a change from a peer and suddenly you’ve lost a bunch of dummy customer data.
But there are times where the clean state approach is desirable. Between test cycles, for example, being able to clear the transactional data is desirable in terms of beginning testing from a known data state. And if you’re running integration tests anywhere with a data dependency, it’s absolutely essential to run them against a predictable set of data.
But this also doesn’t mean you want the transactional tables to be 100% clean – you may need some initialised data such as user accounts in order to even be able to logon. What you really want is the ability to create an initialisation script to prepare the database to your own personal tastes.
SQL Server actually makes this very easy by providing the sqlcmd utility which comes with the management studio. This allows us to remotely execute a script that can easily be pulled direct from VCS. As an example, I’ve put the following script into an “InitialiseData” folder at the trunk of the repository:
I then created a new build configuration in TeamCity calling the command “C:\Program Files\Microsoft SQL Server\100\Tools\Binn\SQLCMD.exe” and passing the following parameters:
As with the previous builds, the build agent in this case has the necessary rights over the SQL Server so everything just works. It obviously also requires the sqlcmd utility to be installed. Check out the command syntax in the earlier link if you actually need to pass credentials over.
Obviously this is very much an “only on demand” build but it now means we can automate a totally clean state of the database each time I want to reset the target environment. Or alternatively, I could trigger it off a successful deployment to an integration environment before running automated tests. Very cool.
Testing against large data volumes
A couple of weeks back I was involved in a project where a release was made to the production environment which resulted in a sudden and unexpected spate of SQL timeouts. After looking at the offending stored procedure, it didn’t take long to realise it was never going to play nice with large volumes of data. But how could this be? It worked fine on the developer’s machine and there were no findings during user acceptance testing, what was going on?
The simple difference between these environments was the volume of data. Even though the test environment had previously been seeded with a large number of records, the production environment exceeded this by an order of magnitude. And just before we go any further, it’s never ok to pull production data down into any other environment for the purpose of testing performance. There has to be a better way…
And here it is; Redgate’s SQL Data Generator. This puppy can turn around and seed your enterprise grade database with 10 million realistic records in the blink of an eye. It’s absolutely perfect for initialising a DB with enough data to demonstrate performance well into the future growth cycle of the application.
How does it do this? There are some great demos on the Redgate site which go into detail but just briefly, it looks at the column names and data types and then uses a combination of regexes and static data for things like cities and uses these to create a realistic set of data. You can also provide your own of each of these if you want to get a bit more specific (cities in China only, perhaps), and the whole thing is conscious of referential integrity constraints so that “CountryId” foreign key will always get a valid value based on the “Country” table records.
As with SQL Compare and Data Compare, we’re going to be executing this on the build server. However, we can’t just feed in a command line with some parameters; this one takes some configuring directly within the Data Generator app. It makes sense though; you could simply never pass the sort of configuration the tool needs to run successfully via the command line.
So it’s into the TeamCity server and we’ll fire up Data Compare. The project configuration is very simple and we’ll steer away from the other tabs for the moment:
Here’s where the magic happens; configuring the data generator:
As you can see from the “Preview of data to be generated”, this is pretty realistic looking data. You’ll also see I’m going to generate one million records – I’m not one to do things by halves and to that effect I’m also going to generate two and a half million orders. Very cool.
Once this has been configured I’m going to save the solution file into a new “TestData” folder in the trunk of the repository and call it “Generate.sqlgen”. I’ll just check that folder out to a location on the build server and save the configuration there so it’s easy to come back to and reconfigure later on. The most important thing is that we need to get it committed back to VCS as this has now become an essential component of the project.
This activity doesn't have to be done on the build server; the project could just as easily be created on a PC. But of course this necessitates an additional license so assuming you’re happy to provide remote access to the build server – and this often won’t be the case – this makes things a little more cost effective.
I won’t go into the detail of every screen in the build (you’ve seen enough of that earlier on), but essentially this works in a near identical fashion to the earlier ones. Call it “Generate test data”, instruct the VCS checkout rule to grab the file from the “TestData” folder in the trunk and save it in a local path of the same name, then configure the build to run the “C:\Program Files\Red Gate\SQL Data Generator 1\SQLDataGenerator.exe” command.
The parameters are pretty basic for this one:
There are no artifacts to be generated and nothing else worth configuring for this build. All that’s left to do is to run it. Everything goes green and we get the following in the build log:
End time: Tuesday, 8 February 2011 06:53:16 AM
Rows inserted: 1000000
Rows inserted: 2500000
Have a good look at this – we just generated three and a half million records of test data in two and a half minutes. In a virtual machine running on a PC. Whoa. Is this for real?
And we see:
That should last us for a while. So this is real-ish test data, generated on demand against a clean deployment by a single click of the “Run” button from a remote machine. Test data doesn’t get any easier than that.
Achieving a reliable, repeatable continuous integration build for the database was never going to be straight forward without investing in some tools. The fact that purely on its own merits, SQL Source Control stood out as the best way of getting the DB schema into Subversion – even before they added the static data capabilities – made the selection of the other Redgate tools a very easy decision.
It’s a similar story with SQL Compare and Data Compare; I’ve used these for years (since 2004, by my records), and they’re quite simply the best damn thing you can do with your money if you’re regularly involved in deployments. The ability for these tools to integrate with SQL Source Control and run autonomously on the build server makes them an easy choice.
Purchased separately, the build server dependencies – SQL Compare, Data Compare and Data Generator – cost $1.5k. Spend that instead on the SQL Developer Bundle and you get a spare license of SQL Source Control (normally $300) and a few other bits as well (SQL Prompt is especially nice – decent intellisense for SSMS), that you can give to your favourite developer rather than buying a standalone license.
To be honest, the price is a bit inconsequential in the context of any rational look at what the tools are doing. Manual database migrations can be quite laborious, very error prone and extremely bad news when they go wrong. The ability to do all this automatically in a fraction of the time and in unison with web app deployment just by a single button click is absolute gold and in my humble opinion, worth every cent of the asking price.