Database Source Control- The Crib Sheet
For things you need to know rather than the things you want to know
As part of our long-running Cribsheet series, we asked William to come up with a brief summary of what was involved in bringing database development work under source control. What are the advantages it brings, and are there disadvantages?
Why have source control in a database?
Source Control is a requirement for any team development projects. In fact, it is essential for any development work. When an application depends on the coordination of a development that involves both database and application development, the scripts must be treated together, with the same version numbers applied, so that any working build can be reproduced. To do this, it is important to be able to tie database changes to the code changes that they relate to, by checking them into source-control as a part of the same changeset transaction. This makes it far easier to make consistent builds and deployments. The complications with integrating the database development into application development come with deployment of new versions from source control to existing live applications
With most developments, there are several points in the process where a consistent build that works should be available. For databases, there must be a correct version of the build script, not just for the database structure and routines, but also for at least the key ‘static’ data, and configuration information. Everything that goes towards the build must be in source control, including configuration files, and it should be possible to build the entire application from these scripts. Where a deployment needs to be made to a live database, there must sometimes be additional 'migration scripts' or 'change scripts' to enable existing data to fit in any changed schema.
Although the use of a predefined interface between the application and the database will help to keep conflicts to a minimum, it does not obviate the need for Source control, but just minimises the pain: It is likely that this interface will need to change when the requirements of the applications change. It is much better to use source control than to stick rigidly to an interface definition that is too archaic and inflexible to meet the needs of the application. The same is true of an Object-relational mapper (ORM). The conflicts just become more subtle and hard to eradicate if source control is not used.
Source Control allows developers more freedom to create branches of the codebase for testing or for releases, thereby making the role of the testers, configuration managers and releasers much easier. The alterations in the branches can then be merged back into the main trunk of development. This makes it easier for the team to work concurrently.
Traditional source control systems are not always an easy fit with databases, because of the diffuse nature of the code and the close interdependency of DDL and data. Problems have emerged when teams have assumed that database source code consists merely of a number of ‘objects’. A glance at the diagram of the interrelationships of the SQL Server 2008 system views will tell you why. The dependencies within a working system are complex: for example, one needs only to change one item in the database configuration (e.g. the collation) to stop the database working.
However, source control brings with it obvious benefits
Protecting production systems from ‘uncontrolled’ changes and fraud.
Because of the ease with which alterations can be made to the code, structure, or configuration of a production database, there must be checks against ‘uncontrolled’ changes. The code that has been tested must be compared against the code that is released to make sure it is identical, and that the team knows about the release, can see the changes and have agreed to the release. It must be possible to check that no unplanned and untested fixes or functionality are made directly on the live system. If the code on the production system is identical to what is in source control, then it is easy to see who did what, and when it was done. Hopefully, also, it will tell you why it was done.
Coordinating and monitoring the development work
Everything that is to be released should be kept in source control so that there is a complete record of all changes that have been made to the live system. It must be possible to see which developer is working on which particular module, to see which changes have been applied and when they were applied, which modules are available for release, the current state of the production system and the current state of any test systems
Maintaining a history of changes made to the source
At all times, the scripts within the Source Control system represent the master version of the source, and provide a historical archive of all the changes so that
- The database structure can be rolled back to a previous version
- Older versions can be created to find when a subtle bug was introduced.
- A code review can check coding standards and conventions
- The reason for an obscure change done in the past can never be lost.
- A series of changes done to a module for a temporary problem can be reversed out when no longer needed.
Supporting team-working
Source control is the most effective way of ensuring that each developer is always working on the latest version of a script for a given object (procedure, table etc)
Making more effective use of time and code
Where developments result in branches and merges, the complexity of database code quickly precludes the use of informal approaches to archiving source. Most commonly, branches are used for a release that requires a code-freeze so that development can continue. Merges are often required if a change is added to the release fork, (normally as a result of a bugfix) which weren’t also added to the development fork. It may also be necessary to develop two versions of the software at the same time. This could be where one version, the branch, has bugs fixed, but no new features, while the other version, the trunk is being worked on to apply new features.
Facilitating testing
Source control also makes it easier to release code to the various stages of testing. For example, the continuous-integration server must be able to build and update its own copy of the database, so that it can run automated tests of code and scripts that are checked in at the same time.
Source control clients
SSMS, the most commonly-used IDE for developing SQL Server databases, allows integrated source control within the Query Window. The client plugin varies according to the source control provider. It is not supplied with SSMS but purchased separately. After the Source Control components are installed, they can be configured via the tools menu (Options ->Source Control-> Plug-in selection) which allows you to select a source control product. The problem with this approach is that it does not enforce any restriction on the database, and so does not prevent alterations from being made outside source control. SQL Source Control does not use this architecture as it aims for a much closer-integrated source control with SSMS
Types of Source Control Providers
Although the most attention is generally given to the client that interacts with the Source control system in use, such as VisualSVN, TortoiseSVN or SQL Source Control, the most important part is the actual Source control system , such as SubVersion ,Vault, GIT, SourceSafe or TFS. These can use either the centralized or distributed model
Centralised Source Control Model
A large number of revision-control systems exist and are used for Development source Control. The majority are centralized, and based on the ‘library’ or ‘repository’ model, and based on CVS. These use the concept of check-out and check-in and rely on file or module locking to prevent clashes.
‘Check-out’ refers to the process of getting exclusive write-access to that module. Others can then read it but not alter it. ‘Check-in’ tells the source control system that you are relinquishing write access and are updating the source so as to make your changes available to all co-workers. This process is usually ‘atomic’ in that it either succeeds in its entirety or is rolled-back. It is easy to subvert this method of revision control.
A more subtle approach uses version-merging, which attempts to deal with simultaneous edits by merging the results of the edits of different developers. This works well, and creates less distraction for developers, until several people work on the same code, at which point it can easily result in inconsistent code that does not compile or run.
Distributed Source control
A distributed model no longer has the concept of a single ‘master’ source, and is used for large projects such as Linux and Mozilla where a number of developers need to co-work, sometimes in cases where network-contact isn’t permanent. As all co-workers keep a full copy of the project, it is very resilient. It also allows developers a great deal of freedom to safely ‘sandbox’ their work. Most of these systems allow a centralised control of the ‘release version’ of a project if necessary.
Distributed source control systems give each developer a local copy of the entire development history, and changes are copied from one such repository to another. These changes are imported as additional development branches, and can be merged in the same way as a locally developed branch. Merging in many of these systems can be extremely sophisticated, based on the histories of the common ancestors, or merit of previous changes, and users are informed if a merge is impossible.
Team Working Database Applications with source control
Application developers are used to agreeing up-front on standards so that they can work together on code effectively, without friction or misunderstandings. These standards usually include
- Styles and structure for commenting and documenting code
- check in/check out from source control
- Coding best-practices and the definition of ‘code smells’.
- Sharing a consensus Object model
- development tools and software, and when they are used
- Naming conventions
- Procedures for defining the features and fixes that will be included in each build of the application
Development teams will have processes for developing against several versions of the code-base, testing builds, packaging builds for deployment to test and production, and for rolling back a build.
Experience has shown that attempts to couple the development of the database too closely to this paradigm become fraught. It certainly can lead to application developers criticising, or interfering with, base tables that aren’t even accessible to the application, and with only a partial understanding of the performance issues. It can lead to bizarre naming conventions, and tortuous attempts to bend the relational model to the object-oriented world-view.
Where tables, views or routines (TVFs or stored procedures) make up the application-interface, they not only need to be in source control but should be commented and documented to project standards. Although a mechanism exists to attach comments to tables and other database objects for which no script is stored within the database, (extended properties) it is rare to find it used properly. The trickiest problem has always been to make this documentation available for the application developers in visual studio who are using ASP.NET, in the form of intellisense, though Entity framework and Linq have made inroads on this problem.
Database Development Methodologies.
The most draconian approach to source control involves the use of a single build-script for the entire database or schema, and to use it for all DDL as well as DML. This means that there is only one source control ‘object’ to check in or out per schema. (a database can have as many schemas as you wish) This allows permanent comments and comment-blocks anywhere, including tables, columns, and parameters. This ensures that build scripts will compile as the order is predetermined in the script. This approach has serious disadvantages, though: It allows only one person to work on a schema at a time, and it has to include the BCP routines for the insertion of test data since it will destroy all existing versions of tables as part of the script. This approach fits comfortably with the idea of the schema in SQL Server.
A more generally satisfactory approach for the larger team is to work at ‘object’ level, where scripts for individual objects can be checked in and out independently. This allows work at a more granular level, but it is easily possible to make database changes via the object browser or in script that affect one or more ‘objects’. The use of the term ‘object’ has a different connotation to the actual way that database entities are related. Columns, indexes and parameters are not ‘objects’ within the system views, whereas constraints are. Build scripts can easily have different parts of a ‘table object’ built at different parts of the script. If database ‘objects’ are stored independently of a complete build script, then these will need to be combined into a build script. Unfortunately, the order of scripting is important, and so there is an added risk at the point of creating a build. This has led to the practice of maintaining a complete build script in addition to the object scripts.
Many developments use a shared development server. This is often used where the test database has to run against large test data, sometimes even an obfuscated copy of the actual data. This can be made to work but the chances are not always good. It relies on structured headers and information in extended properties to keep a tally of the work in progress and the author of an alteration. The entire database script is then checked into source control at regular intervals, as well as all occasions when the team are likely to require a consistent build. The code must be put in source control as a series of table-scripts and routine-scripts, as a complete build script, or both. Unless the developers are particularly orderly in their approach to Source Control, it is usually better to use the hybrid approach as described in the next section.
Development databases.
Database developers need to have a ‘private’ version of the current database for doing the bulk of their development work, but they will need to be able to access a shared database that is built from the definitive source in source control for testing with realistic data sets and creating test data, and for making make ‘uncontrolled’ experimental changes that are overwritten from the master source in source control. It should be impossible to check in source directly from this shared database
Each database developer should either have a local version of SQL Server on their own workstations, or have access to a ‘sandbox’ server with instances for each developer. This allows them to test out routines, database configurations, and the effects of different database/server properties on performance, even to the extent that it crashes a server, or locks up the database without affecting others. Additionally, there ought to be a central database development cell accessible to all, which allows test data to be created and updated with releases. This environment is 'uncontrolled' in that developers can change it at will - but agreement is needed before making changes that affect structure and common routines. This should be refreshed from time to time to get rid of test data.
Updating the database from source control
Ideally, it should be possible to build the application, both the client-side and database, from the scripts that are in source control. This is essential if you are using continuous integration with your development, because automated builds, with unit testing, will be either triggered on every check-in of source code or at a set time. With a database, there are problems with automating this approach. For a start, databases are not normally part of a unified Check-in, and build, process. More importantly, change or deployment scripts (also known as Migration scripts, or rollout scripts) will also be required to supplement the code that synchronizes the live database with what is in source control. If the database has been subject to any refactoring between the two versions, it is impossible to modify a database entirely automatically without scripting some of the intermediate stages. If tables have been changed and data has been rearranged, or modified, one cannot merely change the schema. The data must be changed as well, as it may have been modified as apert of the refactoring, or be in different tables under different constraints. The problems of migrating the data to fit the schema will depend on constraints, referential integrity and a number of other factors. Where, for example, data has been de-duplicated at the time of the imposition of a unique constraint, then that de-duplication script must be included. If a VARCHAR column has been reduced in size, then some strings may need to be truncated. A rollback may require a restore from backup, or for the tables to be filled with data from a previously made BCP native output file, since migration scripts aren't normally reversible.
Any Database Source Control system must therefore include the special Change or deployment scripts to ensure that the current data in the database is preserved and does not cause constraints to fire. Not only must these extra scripts be included but they must be executed at the correct point in the build.
Conclusion
The lack of suitable and simple tools for doing database Source control in the past has led to a number of problems with team working on database applications. Source Control must be perceived by the developers as a way of saving on administrative work and speeding development. It should never get in the way.
Now that suitable tools exist for all the IDEs that are used for developing SQL Server databases , and the quality of source control systems has improved greatly, we’re reaching the point where Source Control for database developments can be subject to a list of simple best-practices rather than being an endless source of friction within development teams, and pain for those people tasked with providing consistent builds for applications.