First published in July 2010
Database Source Control- The Crib Sheet
For things you need to know rather than the things you want to know
As part of our long-running Cribsheet series, we asked William to come up with a brief summary of what was involved in bringing database development work under source control. What are the advantages it brings, and are there disadvantages?
Why have source control in a database?
Source Control is a requirement for any team development projects. In fact, it is essential for any development work. When an application depends on the coordination of a development that involves both database and application development, the scripts must be treated together, with the same version numbers applied, so that any working build can be reproduced. To do this, it is important to be able to tie database changes to the code changes that they relate to, by checking them into source-control as a part of the same changeset transaction. This makes it far easier to make consistent builds and deployments.
Databases are unusual in that they often serve several applications and are involved in other data processing and services. Where this is the case, it is customary to have a defined interface for each one, implemented by views, procedures and functions, so as to decouple the development of the database from the individual applications. If this is the case, the code for the interface needs to be held with the application it serves. Although the use of a predefined interface between the application and the database will help to keep conflicts to a minimum, it does not obviate the need for source control, but just minimises the pain: It is likely that this interface will need to change when the requirements of the applications change. A good source control system makes it possible to change the interface definitions as requirements change. Without it, change becomes so difficult that stick rigidly to an interface definition that is too archaic and inflexible to meet the needs of the application. The same is true of an Object-relational mapper (ORM). The conflicts just become more subtle and hard to eradicate if source control is not used.
It is relatively easy to deploy a new database application if all the code modules and resources are held in source control. The complications come with deployment of new versions from source control to existing live applications. Where one database serves several applications, then a great deal of teamwork is required.
With most developments, there are several points in the process where a consistent build that works should be available. For databases, there must be a correct version of the build script, not just for the database structure and routines, but also for at least the key ‘static’ data, and configuration information. Everything that goes towards the build must be in source control, including configuration files, and it should be possible to build the entire working application from these scripts. Where a deployment needs to be made to a live database, there must sometimes be additional 'data-migration scripts' or 'change scripts' to enable existing data to fit in any changed schema. Any deployment must have the means to easily roll back the changes to the previous version if the deployment fails. If data-migration scripts are part of the deployment, there must be backward-migration scripts so as to preserve any transactions made in the meantime..
Any build of a database from source control is primarily an exercise in creating an effective database script from component 'object' scripts. Normally, Source Control systems will save database scripts at the 'object' level. Each file then corresponds to a table or routine. These will have to be collated with scripts for such DDL as permissions, database properties, extended events, synonyms, or user-defined statistics, and executed in the correct dependency order. Subsequently, any static data must be loaded in such a way as to avoid referential constraints being triggered. Additionally, any server properties that would affect the way that a database will run will need to be checked.
Source Control allows developers more freedom to create branches of the codebase for testing or for releases, thereby making the role of the testers, configuration managers and releasers much easier. The alterations in the branches can then be merged back into the main trunk of development. This makes it easier for the team to work concurrently.
Traditional source control systems are not always an easy fit with databases, because of the diffuse nature of the code and the close interdependency of DDL and data. Problems have emerged when teams have assumed that database source code consists merely of a number of ‘objects’. A glance at the diagram of the interrelationships of the SQL Server 2008 system views will tell you why. The dependencies within a working system are complex: for example, one needs only to change one item in the database configuration (e.g. the collation) to stop the database working.
However, source control brings with it obvious benefits
Protecting production systems from ‘uncontrolled’ changes and fraud.
Because of the ease with which alterations can be made to the code, structure, or configuration of a production database, there must be checks against ‘uncontrolled’ changes. The code that has been tested must be compared against the code that is released to make sure it is identical, and that the team knows about the release, can see the changes and have agreed to the release. It must be possible to check that no unplanned and untested fixes or functionality are made directly on the live system. If the code on the production system is identical to what is in source control, then it is easy to see who did what, and when it was done. Hopefully, also, it will tell you why it was done.
Coordinating and monitoring the development work
Everything that is to be released should be kept in source control so that there is a complete record of all changes that have been made to the live system. It must be possible to see which developer is working on which particular module, to see which changes have been applied and when they were applied, which modules are available for release, the current state of the production system and the current state of any test systems
Maintaining a history of changes made to the source
At all times, the scripts within the Source Control system represent the master version of the source, and provide a historical archive of all the changes so that
- The database structure can be rolled back to a previous version
- Older versions can be created to find when a subtle bug was introduced.
- A code review can check coding standards and conventions
- The reason for an obscure change done in the past can never be lost.
- A series of changes done to a module for a temporary problem can be reversed out when no longer needed.
Source control is the most effective way of ensuring that each developer is always working on the latest version of a script for a given object (procedure, table etc)
Making more effective use of time and code
Where developments result in branches and merges, the complexity of database code quickly precludes the use of informal approaches to archiving source. Most commonly, branches are used for a release that requires a code-freeze so that development can continue. Merges are often required if a change is added to the release fork, (normally as a result of a bugfix) which weren’t also added to the development fork. It may also be necessary to develop two versions of the software at the same time. This could be where one version, the branch, has bugs fixed, but no new features, while the other version, the trunk is being worked on to apply new features. A Database is likely to have several versions being worked on simultaneously. Any 'emergency' fixes may be applied to the production system so the source changes must be committed to that version. This must a separate branch to the development branch. If the test requires test points or a test harness, then this would require a test branch,
Source control also makes it easier to release code to the various stages of testing. For example, the continuous-integration server must be able to build and update its own copy of the database, so that it can run automated tests of code and scripts that are checked in at the same time.
What Goes into Source Control
You need to store in source control everything that is needed in order to build, test or deploy. For a database, this would include the Server configuration script, the database creation and configuration script, the requirements document, the individual scripts for tables and routines, the static data, test scripts, network configuration scripts, the technical documentation, libraries, the full scripts for database creation, the upgrade of the production system, data-migration scripts, and rollback scripts. Normally, the build scripts, and upgrade scripts, would be generated from the integration build after it has passed its integration tests, so as to ensure a clean deployment
Building Databases from Source Control
The Disconnected model
In the disconnected model, the developers do not built against a live database. Instead, each developer works with their version of an XML ‘model’ of the database schema that describes all the objects inside the database schema. The model is built based on a source-code representation of the database in T-SQL DDL statements in source control rather than from the database itself. The SQL from one developer can define just a part of the database, and is allowed to be in an inconsistent or incomplete state. Whereas, it makes a database build seem more like a C# build, it is alien to a database developer. The work of each developer becomes part of the source control for the application. The TSQL DDL code has to be parsed before build so as to check that the syntax is valid. All the object relationships must then be added to the database model. However, the model is not yet in a valid and consistent state. A ‘Semantic Validation’ process then has to check rules and errors; It needs to check for all the internal rules that are normally enforced when DDL batches are executed (as when an INSTEAD OF trigger is made part of a transactional replication publication). Softer ‘coding policies’ can be then enforced over the model through the use of T-SQL Static Code Analysis.
The model that was used is then saved on disk as an XML serialized representation of the database model. The model can then be used as a 'blueprint ' for the database schema when performing builds of the database. The model describes what is required inside the database, in much the way that DDL build scripts do. This Model is then compared with the target database’s model, which has been generated from the database that needs to be changed to the required version. As with any database deployment, some of those changes that are due to refactoring the schema, such as splitting or renaming a column, will lead to problems as to where the existing data should be rehoused. This information is taken from the REFACTORINGLOG file and the deployment settings. The deployment engine then identifies all the necessary changes to the target database, fold them together where possible, separate them if needed, and sequence them in the correct dependency order. It then generates the DDL SQL Statements to make the changes. Finally, a command-line-based deployment engine deploys the results of a build. There are advantages to this system, particularly for inexperienced database developers, and the project managers will be able to track jobs easily, but it isn’t all positive. Instead of integration issues being flagged up the moment the database developer makes the crucial mistake, the discovery is postponed. Any performance issues with the code are likely to be left undiscovered until testing or staging.
The Connected Model
Almost all database development is done by connecting to a database and making changes to it. This brings immediacy to integration issues. The development databases must always reflect the revision being worked on. Whereas new database builds can be made directly from build scripts that are generated from the source for the revision being built, this is not true of existing databases. If you are modifying an existing production database to bring it to a particular revision level, development effort must go into writing, maintaining, and testing database deployment scripts that modify the database schema with DDL scripts to bring a database to the required version level. You can’t escape using scripts; there is no other viable way to alter a database than via DDL Scripts. Other methods, such as those of SMO, or an ORM, use DDL scripts under the covers. Where possible, it makes sense to use schema comparison tools to generate these scripts, even if you need to subsequently modify the script to handle the preservation of data around DDL refactoring.
Schema comparison tools all work by comparing the schemas of two databases, by looking at the live database, a backup, or a set of scripts, working out the difference, and then generating a script that changes the ‘version’ of the target database to be the same as the source. This isn’t difficult with an empty database with the database objects such as tables and routines, but without data. The simplest technique is to drop everything that exists, perhaps after archiving what is there as a build script as a precaution, and to replace it with the version you require.
If you already have data, you have a limited means to make alterations to a set of related tables whilst preserving the data. If it can’t be done, then you can copy out the data to temporary tables, recreate the tables in their new form, and copy the data back in, into its new intended places. Your script should be able to avoid common mistakes such as executing the script more than once, or executing the script on a database of the wrong version. Scripts can have little in the way of ‘dynamic’ intelligence, but if scripts are completed by hand, you can pack quite a lot of information into the extended properties of databases, or maintaining a ‘version table’ to prevent some of the worst accidents such as running the script on the wrong version of the database or re-running the script on what is now a different version of the database.
Ideally, Databases should be able to roll forward or backwards whilst maintaining its data. This will allow production systems to retreat gracefully from a catastrophic bug that somehow wasn’t detected before deployment to production. Where previous versions did not hold particular attributes or entities, this will be difficult without introducing data structures that were not in the original build.
This way of developing databases requires each developer to work on a separate database. There is only one database developer for each development database. The work of each developer is kept in source control. The contributions from each developer have to be integrated to create a database for integration testing. This is done by a separate integration process that works from the code in source control. Normally, the development database cohabits on the workstation, but for any measure of testing with realistic sets of data this will need to be a dedicated server. For a team of database developers, there will be considerable scripting to apply all the changes to the separate development databases to keep them up to date
The commonest model for database development involves the creation of a development database which all the developers access. This does not preclude a separate development database for each developer to use as a sandbox. Since an RDBMS is designed from the ground-up to be a multi-user system, able to accommodate all manner of development activities, this is a more natural way for a Database Developer to work. It avoids the problem of the hardware requirements of allowing each developer to have their own development server become immense if the developer is working on several databases simultaneously or if large databases with a great deal of test data are being worked on. As SQL Server is inherently multi-user, it is natural to work in a team whilst developing, and it requires no separate integration process. There are obvious dangers where two developers work on the same database object, but this is avoided by the use of source control. However, because of the complex interdependencies that build up in a database, these problems can be more subtle, and involve working on related objects, and creating difficulties by doing alterations before considering dependencies or other repercussions. Developers can also create problems for each other with long-running transactions. SQL Server can assist this model of working by the use of ‘schemas’ that partition large databases into fairly discrete units. Also, functionally can be moved into separate databases that cohabit the same instance of SQL Server. Even so, It pays for the developers to have IM, or be in shouting-distance.
Source control clients
SSMS, the most commonly-used IDE for developing SQL Server databases, allows integrated source control within the Query Window. The client plugin varies according to the source control provider. It is not supplied with SSMS but purchased separately. After the Source Control components are installed, they can be configured via the tools menu (Options ->Source Control-> Plug-in selection) which allows you to select a source control product. The problem with this approach is that it does not enforce any restriction on the database, and so does not prevent alterations from being made outside source control. SQL Source Control The SSMS plug-in from Red Gate, SQL Source Control, does not use this architecture as it aims for a much closer-integrated source control with SSMS. The tool allows you to connect SQL Server to your existing source control system – SVN, TFS, Git, Mercurial, Vault, Perforce and all others with a command line are supported
Types of Source Control Providers
Although the most attention is generally given to the client that interacts with the Source control system in use, such as VisualSVN, TortoiseSVN , SQL Connect or SQL Source Control, the most important part is the actual Source control system , such as SubVersion ,Vault, GIT, SourceSafe or TFS. These can use either the centralized or distributed model
Centralised Source Control Model
A large number of revision-control systems exist and are used for Development source Control. The majority are centralized, and based on the ‘library’ or ‘repository’ model, and based on CVS. These use the concept of check-out and check-in and rely on file or module locking to prevent clashes.
‘Check-out’ refers to the process of getting exclusive write-access to that module. Others can then read it but not alter it. ‘Check-in’ tells the source control system that you are relinquishing write access and are updating the source so as to make your changes available to all co-workers. This process is usually ‘atomic’ in that it either succeeds in its entirety or is rolled-back. It is easy to subvert this method of revision control.
A more subtle approach uses version-merging, which attempts to deal with simultaneous edits by merging the results of the edits of different developers. This works well, and creates less distraction for developers, until several people work on the same code, at which point it can easily result in inconsistent code that does not compile or run.
Distributed Source control
A distributed model no longer has the concept of a single ‘master’ source, and is used for large projects such as Linux and Mozilla where a number of developers need to co-work, sometimes in cases where network-contact isn’t permanent. As all co-workers keep a full copy of the project, it is very resilient. It also allows developers a great deal of freedom to safely ‘sandbox’ their work. Most of these systems allow a centralised control of the ‘release version’ of a project if necessary.
Distributed source control systems give each developer a local copy of the entire development history, and changes are copied from one such repository to another. These changes are imported as additional development branches, and can be merged in the same way as a locally developed branch. Merging in many of these systems can be extremely sophisticated, based on the histories of the common ancestors, or merit of previous changes, and users are informed if a merge is impossible.
Team Working Database Applications with source control
Application developers are used to agreeing up-front on standards so that they can work together on code effectively, without friction or misunderstandings. These standards usually include
- Styles and structure for commenting and documenting code
- check in/check out from source control
- Coding best-practices and the definition of ‘code smells’.
- Sharing a consensus Object model
- development tools and software, and when they are used
- Naming conventions
- Procedures for defining the features and fixes that will be included in each build of the application
Development teams will have processes for developing against several versions of the code-base, testing builds, packaging builds for deployment to test and production, and for rolling back a build.
Experience has shown that attempts to couple the development of the database too closely to this paradigm become fraught. It certainly can lead to application developers criticising, or interfering with, base tables that aren’t even accessible to the application, and with only a partial understanding of the performance issues. It can lead to bizarre naming conventions, and tortuous attempts to bend the relational model to the object-oriented world-view.
Where tables, views or routines (TVFs or stored procedures) make up the application-interface, they not only need to be in source control but should be commented and documented to project standards. Although a mechanism exists to attach comments to tables and other database objects for which no script is stored within the database, (extended properties) it is rare to find it used properly. The trickiest problem has always been to make this documentation available for the application developers in visual studio who are using ASP.NET, in the form of intellisense, though Entity framework and Linq have made inroads on this problem.
Database Development Methodologies.
One True Build Script
The most draconian approach to source control involves the use of a single build-script for the entire database or schema, and to use it for all DDL and DCL as well as DML. This means that there is only one source control ‘object’ to check in or out per schema. (a database can have as many schemas as you wish) This allows permanent comments and comment-blocks anywhere, including tables, columns, and parameters. This ensures that build scripts will compile as the order is predetermined in the script. This approach has serious disadvantages, though: It allows only one person to work on a schema at a time, and it has to include the BCP routines for the insertion of test data since it will destroy all existing versions of tables as part of the script. This approach fits comfortably with the idea of the schema in SQL Server.
A more generally satisfactory approach for the larger team is to work at ‘object’ level, where scripts for individual objects can be checked in and out independently. This allows work at a more granular level, but it is easily possible to make database changes via the object browser or in script that affect one or more ‘objects’. The use of the term ‘object’ has a different connotation to the actual way that database entities are related. Columns, indexes and parameters are not ‘objects’ within the system views, whereas constraints are. Build scripts can easily have different parts of a ‘table object’ built at different parts of the script. If database ‘objects’ are stored independently of a complete build script, then these will need to be combined into a build script. Unfortunately, the order of scripting is important, and so there is an added risk at the point of creating a build. This has led to the practice of maintaining a complete build script in addition to the object scripts.
Many developments use a shared development server. This is often used where the test database has to run against large test data, sometimes even an obfuscated copy of the actual data. This can be made to work but the chances are not always good. It relies on structured headers and information in extended properties to keep a tally of the work in progress and the author of an alteration. The entire database script is then checked into source control at regular intervals, as well as all occasions when the team are likely to require a consistent build. The code must be put in source control as a series of table-scripts and routine-scripts, as a complete build script, or both. Unless the developers are particularly orderly in their approach to Source Control, it is usually better to use the hybrid approach as described in the next section.
Database developers need to have a ‘private’ version of the current database for doing the bulk of their development work, but they will need to be able to access a shared database that is built from the definitive source in source control for testing with realistic data sets and creating test data, and for making make ‘uncontrolled’ experimental changes that are overwritten from the master source in source control. It should be impossible to check in source directly from this shared database
Each database developer should either have a local version of SQL Server on their own workstations, or have access to a ‘sandbox’ server with instances for each developer. This allows them to test out routines, database configurations, and the effects of different database/server properties on performance, even to the extent that it crashes a server, or locks up the database without affecting others. Additionally, there ought to be a central database development cell accessible to all, which allows test data to be created and updated with releases. This environment is 'uncontrolled' in that developers can change it at will - but agreement is needed before making changes that affect structure and common routines. This should be refreshed from time to time to get rid of test data.
Updating the database from source control
The Integration Stage
Ideally, it should be possible to build the application, both the client-side and database, from the scripts that are in source control. This is essential if you are using continuous integration with your development, because automated builds, with unit testing, will be either triggered on every check-in of source code or at a set time. With a database, there are problems with automating this approach. For a start, databases are not normally part of a unified Check-in, and build, process. More importantly, change or deployment scripts (also known as Migration scripts, or rollout scripts) will also be required to supplement the code that synchronizes the live database with what is in source control. If the database has been subject to any refactoring between the two versions, it is impossible to modify a database entirely automatically without scripting some of the intermediate stages. If tables have been changed and data has been rearranged, or modified, one cannot merely change the schema. The data must be changed as well, as it may have been modified as apert of the refactoring, or be in different tables under different constraints. The problems of migrating the data to fit the schema will depend on constraints, referential integrity and a number of other factors. Where, for example, data has been de-duplicated at the time of the imposition of a unique constraint, then that de-duplication script must be included. If a VARCHAR column has been reduced in size, then some strings may need to be truncated. A rollback may require a restore from backup, or for the tables to be filled with data from a previously made BCP native output file, since migration scripts aren't normally reversible.
Any Database Source Control system must therefore include the special Change or deployment scripts to ensure that the current data in the database is preserved and does not cause constraints to fire. Not only must these extra scripts be included but they must be executed at the correct point in the build.
There is a limit to what can be automated for continuous integration. There is, for example, no way to automate exploratory testing or approvals for regulatory or compliance purposes.
The lack of suitable and simple tools for doing database Source control in the past has led to a number of problems with team working on database applications. Source Control must be perceived by the developers as a way of saving on administrative work and speeding development. It should never get in the way.
Now that suitable tools exist for all the IDEs, such as VS and SSMS, that are used for developing SQL Server databases, and the quality of source control systems has improved greatly, we’re reaching the point where Source Control for database developments can be subject to a list of simple best-practices rather than being an endless source of friction within development teams, and pain for those people tasked with providing consistent builds for applications.
If you'd like to learn more about database source control, Red Gate has put together a free whitepaper "5 Common Barriers to Database Source Control, and How You Can Get Around Them".
Download the whitepaper