Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of software delivery and deployment by making build, integration and delivery into a routine. The way that databases fit into the Continuous Delivery story has been less-well defined. Phil Factor explains why he's an enthusiast for databases being full participants, and suggests practical ways of doing so.
Software Delivery means different things depending on the type of development being done, but the common thread is the completion of software to the point where it is releasable and making it generally accessible within the organization. There is a world of difference between a simple game or utility and a complex manufacturing or financial application, or an embedded control system on which lives depend. This means that there will be a huge diversity in the range of processes that need to be a part of the chain between development and delivery. However, there are some principles in common.
Essentially, 'Continuous Delivery’ is a term that describes a means of supporting rapid development, focusing on delivery throughout development. The term 'Delivery' will, in the very early stages, simply imply that the initial aim is to automatically unit-test, and build a 'hello world' application automatically from, and only from, the code in the source control repository. It will apply all integration, load tests, and routine user-acceptance tests automatically before deploying the application to the test VMs. Of course, many of these will be stubs, and one wouldn’t argue that a ‘hello world’ is feature-complete in the terms of the users’ requirements, but a successful delivery will be one that is at least robust in displaying ‘hello world’.
As the application grows, the team can extend Continuous Delivery beyond the testers within the development team, to production-like environments, such as Staging, as well as to the 'alpha' users, and any broader 'beta' program. In a major application, the Continuous Delivery process will involve a wide range of specialized roles such as business analysts, operators, support people, DBAs, application programmers, database developers, compliance experts, change control managers, account managers, librarians, testers, system programmers, PMO, webmasters, and technical architects.
'Continuous Delivery' is a different mindset towards developing applications. It is not dependent on a prevailing Agile methodology and nor is it reliant on a "DevOps culture. It doesn’t require special tooling. All these can help of course, but the methods that underpin Continuous Delivery are long-established and well-supported, and we can use them to develop applications rapidly whatever the methodology of the organization in which it is practiced. For example, the routine of a nightly builds from source control, and deployment, is relevant to any methodology, and is the best way of nipping deployment problems in the bud.
This article will focus on Continuous Delivery and the database, describing the various issues that database developers will face, and the solutions they can apply, in order to participate in Continuous Delivery.
Why is Continuous Delivery necessary?
It was generally recognized some time ago that delivery and deployment of any application represents a roadblock. For commercial applications, games and software tools, the process of delivering a finished package uncovers many aspects that slipped through the planning. There will be problems with the installer, unforeseen difficulties with certain combinations of operating system or hardware, or even problems with functionality that the automated tests failed to cover. For a large enterprise-scale application, the problems are greater, rising exponentially with scale.
With an enterprise application, a release of a new version coincides with a handover to production and support staff, who will cast a critical eye over its production-readiness in its many aspects before accepting the hand-over. Because of the difference in cultures between development and production, the developers are often only dimly aware of the essential requirements of documentation, logging, security, compliance, reporting and application-instrumentation. Being focused on a particular application, they will, typically, have neglected the broader requirements for data-management and reporting across the enterprise. They will, perhaps, have made hardware requirements that cannot be met within the budget. It is rare for the two groups to agree on what constitutes a releasable software product. The developers will often, early on in development, take architectural and platform decisions without considering wider issues such as the use of application data for reporting or analysis, or the cost of support. By the time the first delivery comes to the attention of system architects or production managers, these decisions have become irreversible. There is no obvious way out of this deadlock, however cooperative all the parties are.
Added to this, the many steps and components that are necessary for a release make for a complex delivery process. The development team never arrive at a reliable and repeatable form of their release process, simply because the process is so infrequent. This makes deployment and release stressful and prone to postponement. During the delay the team inevitably add more functionality, broaden the scope of the application still further, which in turn lengthens and complicates the release process. The net result of this friction, and memories of stress, is that a no-man's land develops between development and production. Releases turn into extended struggles and the gap between releases becomes ridiculous. When releases finally happen, and the applications features and architecture are revealed, they often contain unpleasant surprises.
The Agile philosophy of promoting rapid releases is appropriate as a way of solving many of the problems of delivery, but it is not a universal panacea. They may be rapidly-releasing entirely inappropriate software that is impossible to support, or even contravenes security or usability regulations to the extent that it is impossible to require staff to use it. In enterprise IT, the idea of rapid releases is less appealing than elsewhere. Each release of a corporate application comes with a training cost that can match or even dwarf the development cost. The repercussions of a deployment can cause knock-on problems to server-storage, network, support, disaster-recovery planning and so on. Enterprises need predictable deployments with no surprises, but they don’t necessarily want rapid deployments. They must be able to release at the time they choose.
The idea of ‘Continuous Delivery’, where development is geared around the idea of an automated delivery process, evolved from the idea of ‘Continuous Deployment’ to meet the needs of rapid software-release from parts of the industry, and for predictable deployment from other parts. It is the practice of bringing in the techniques and culture of testing and release right through development and application-design. It takes the sting from delivery because it will have been practiced from the start. If a component such as a server component, third-party library, existing platform or database is introduced that proves to cause difficulty to the build, then the problem is tackled before subsequent work reduces the options down to sweat and tears. If coding practices, for example, make unit testing difficult, then the problem has to be faced and solved before the problem piles up into a considerable refactoring exercise.
'Continuous Delivery' cannot solve every development problem, but it prevents the accumulation of deployment issues that typically lie concealed until the release process has to be done in earnest. It reduces nasty surprises. It also makes a number of issues more visible and increases the urgency of fixing them rather than ploughing on regardless.
Continuous Delivery does not demand that software can be released into production at any time, since this makes work on new features, more difficult. It should be delivered, but not necessarily released. The relative appeal of Continuous Delivery depends on scale. Continuous Delivery makes a distinction between being able to deliver, and actually doing so. With larger applications, there are so many repercussions to a deployment that need to be considered, such as the customer, channel, process, organization and Location (Business Architecture) and Data, Applications and Technology (Systems Architecture). To take a simple example, the release of a new version of an application that is used by many workers in an organization will require training, might require a third-party audit for its compliance, clearance with unions for usability, work on facilities such as printers or lighting and ISO 9000 SQA certification. Although Continuous Delivery still means that a new release can be predictable and rapid, and that all the contributors will be up-to-date with the application, there will, inevitably, be processes that require judgment and expertise that are therefore unlikely candidates for automation, but they can still be participants in the workflow processes of Continuous Delivery..
What's Required for Continuous Delivery
A scaffolding of Automation
A development team will need to automate the build, integration and test processes before it can participate in Continuous Delivery. It is like putting up the basic scaffolding before working on the house. From then on, the developers will be responsible for ensuring that they develop in a way that facilitates automation of all the processes that contribute to a successful delivery of a working application. They will be constantly reminded of how the decisions they make have repercussions down the line.
It is a very liberating for the programmer to automate the worst development chores, from source control, unit test, build, integration tests, through to deployment. However, this isn't easy to do retrospectively. To develop easily-tested code, for example, the code must be designed with the needs of testing in mind. There must, for example, be separation of concerns, it must be possible to test each component in isolation, and there should be obvious assertions that can be tested. The code should be well-instrumented.
The ‘DevOps’ culture of delivery
We also need a different type of development culture that is aware of the needs for software delivery at all points in the development process. We must 'think delivery' at the point at which we design and develop applications.
Whatever the development methodology, there should be a 'culture of delivery', right from the planning stage. If the system architect, project manager, and developers are mindful of the means of build, integration and deployment, then there are likely to be fewer roadblocks when the developers are ready to release an application.
If developers understand the implications for the delivery process of a particular development choice, such as choosing to partition the logic across several databases rather than using schema, very soon after making the choice, then they can avoid many of the 'unforeseen issues'. Once this happens, a script to deploy this arrangement easily is likely to follow. If it proves difficult to deploy, then there is time to retreat from that particular decision.
The Culture of Resilience
Even more important is the 'culture of resilience'. Builds cause problems because of the time it takes to work out the reason for failure. It is not enough to detect where the build has failed, but also why. When things go wrong with the process, as is inevitable, then the event should be carefully logged in such a way that a member of the team that isn't 'in the know' can fix it. SQL source code that won't execute should be logged and recorded with an explanation, for example.
If the development team remains conscious from the start about what is needed for the delivery of the full system, then when the application is deployed or shipped, there is less likely to be a roadblock. There is nothing more exasperating for a production DBA than to come across an aspect of development that shows little or no thought for how one could ever go about deploying it. There are so many common practices that fit this category, from embedded connection strings to undocumented calls, to linked databases.
I suspect that the best approach is the imposition, right from the start of development, of an automated system for nightly delivery. This would include build, followed by deployment to all the Test VMs, and a run of integration testing, before deployment to an environment that is set up as if it were production. With a nightly build, it is far more likely that anything that could cause difficulties to the deployment operation, such as a server-based background scheduled task, will get quickly scripted and put in source control. This means fewer nasty surprises later on. The nightly build will highlight problems whilst the full details of the offending component is fresh in the developer's mind and, of course, there is a certain stigma attached to the 'dev who broke the build'.
If the problem is a deployment one ('it works on my machine!'), then it can be solved by scripting. Rehearsing this extra work before a deployment is strictly necessary may smack of wasted effort but it means that the requirements in terms of scripts to support deployment will be known up-front, and are likely to be in a good enough state for the first release. These scripts should be all in source control, ready for use by the automated deployment system.
From the developer's perspective, an automated build system may be a pain to set up, but it is always a pleasure to use. It is a great feeling when red turns to green; all the unit tests run overnight, and in the morning it is just green ticks. Loads of test databases automatically created, or spun up, fixed and randomized data inserted and all those tests run after the lights at work were switched off. Then, because the alterations look good, they are included in the automated build and subjected to the suite of integration and scalability tests, again all run overnight and done by the morning. As one idly sips coffee and scans the paper, the last tests complete and we have a reliable version of the software.
The Benefits of Continuous Delivery
During the early stages, it can be hard for a team to appreciate the benefits a development practice that imposes certain restrictions on them. It means they have to spend considerable time and effort in practicing build and deployment, rather than working on features. Early feedback from production and support staff may cause them to rethink certain development decisions, slowing initial progress still further.
It can be tempting to just "push on" with the features, but any time saved early in development will disappear very quickly when it comes time to deploy. Do not underestimate how often the unpredictability of the deployment process causes major delays in the completion of an IT project. The team that delays embarking on a deployment process will find errors springing up in all parts of it, from test, UX signoff, security audit, resilience testing, misunderstandings about data feeds, code reviews, misjudged hardware scaling, and user-acceptance testing. When these problems arise, it is tempting to blame 'politics' for the inevitable postponement of the software release, which is tantamount to describing a headache as a cause rather than a symptom.
If teams start deploying early, and continue to refine their build, test and deploy processes, the most obvious benefit becomes predictability. If the whole build and deployment process is timed, and done regularly, the manager will be able to give a good estimate of how long it will take. Most, but not all, of the likely issues will be revealed much earlier on in the development process. Once they are known about, then the project managers can predict timescales better and act to resolve conflicts easier.
The idea of designing and developing software for the purpose of deployment is no more alien than the idea that Nature designed baby mammals to pass through the birth canal.
If a development team is able to do rapidly build, test, deployment and release, then it is better able to support whatever development methodology is in place. It is wrong, as some have claimed, that the ability to do rapid deployment is somehow linked with Agile methodologies. The problems are not so much with the developmental methodologies in place, as with the technical barriers; it isn't easy to automate all the steps and some, like exploratory software testing, are logically impossible to automate.
The Challenges of Continuous Delivery
Testing, change control, release, and deployment should be as slick, automated, and painless as possible, but no more so than that. Continuous Delivery does not imply reckless release of software, but is geared more to the idea that all the areas of expertise within IT have the current version deployed and available to them to spot issues earlier.
Whereas the controls within the deployment and release process are minimal for a startup or small IT shop, they become important for the corporate, government, healthcare or large manufacturing setting. Controls are there for a reason.
The need for Multi-skilled staff
To move closer toward the reality of Continuous Delivery, a broader degree of skill will be required of IT staff. For example, operations people such as network admins and DBAs will need to be familiar with skills such as scripting with PowerShell. Developers will require more awareness of other parts of the build, integration, test, UX testing, deployment, user-acceptance and release process, and appreciate the constraints of the wider context of the software that they create. There is, however, a danger of automating processes that are no longer required, irrelevant or seem to be there simply to give work to the otherwise unemployable. Even those processes that are really important, such as bug-tracking or quality-checking may be inefficient and inappropriate for rapid delivery. There is no sense in automating a process that is deficient. These will need attention first, and this inevitably requires broad knowledge and experience in a number of different aspects of development.
The problem of scale to the corporate IT setting
It is difficult to scale the practice of running the entire deployment process regularly, and automating as much as possible of it. The process must be able to work in a traditional moderate-to-large scale legacy-IT setting. Where a range of different expertise is required, such as change-management or compliance, the task of coordinating these processes with the deployment process becomes more complicated. However, if other experts are able to become familiar with the product as it evolves and are alerted to potential issues earlier in the development process, there are less likely to be major issues to resolve when it is released.
User Documentation and Training materials
There is still a lot of work to do even after the final build of the software. User documentation, and the training materials are obvious examples. Although it is reasonable to keep documentation and training in step with the software if releases are sporadic, continuous changes to the functionality of a software application are a different matter altogether. Where a training manual or help text makes use of screenshots, these all have to be redone on every change: training courses have to be checked on every release to make sure that they are up-to-date: It can become a daunting task.
Documentation requirements for production
Any production or operations department will have an 'operational readiness checklist ' requirement for support of the application, including any necessary 'instrumentation' for monitoring the application in production to alert for potential problems. All maintenance and support procedures have to be described in step-by-step terms in case of emergency. There must be sufficient materials provided for operational training as well
Database Challenges for Continuous Delivery
There is an entrenched myth that a database is "just code" like any other application code and therefore the team can treat it in the same way, evolving the structure of the database rapidly, in step with the team's application, and as their understanding of the problem domain, evolves.
In fact, the database poses very particular challenges for our Continuous Delivery processes, making it even more critical that teams practice their database build and deployment process from very early on in the development cycle.
The Myth that a database is 'just code'
One can see how the idea started that database deployment is no different from application deployment, since a small database with no data, or external dependencies looks much like any application code. All you need to do is to execute the code and the database is set up. There is a danger in generalizing from this experience. As databases grow, so an apparently simple task gets complicated. What if you are updating a database to a new version, rather than just creating it, for example? What if the database has scheduled tasks? What if the database has ETL routines or if there is a separate database doing reporting. Having coped with these cases, what if the database is part of a complex replication system, or if it accesses linked databases across servers?
A database of any complexity is more than just a collection of tables, views and routines. Aside from database objects, we need to consider the deployment of a number of server objects such as linked databases, endpoints, linked servers and triggers that will need to be scripted as part of the deployment. There will also be SQL Server Agent objects such as Agent alerts, Agent jobs, job alerts, and operators. You will probably need different security objects, such as those associated with your application. There are ETL processes, driven by PowerShell or SSIS, and other processes that defy generalization. All this needs to be part of the build.
There are many ways of distributing the processing in a database of any size. Sometimes, several databases share a SQL Server instance. At the other extreme, we might spread a single logical database across several linked instances in a number of physical databases. Co-hosted databases all share the same system databases, so the job and alerts for all databases on the instance are stored in the same
msdb database. Likewise, SQL Server defines linked-server mappings at the server instance level, so they are not stored with the database either. If these mappings are missing when you deploy a database then your links won't work. Instead, you'll get an error.
More subtle problems come from the common arrangement in the corporate IT environment where several applications will share one or more databases. From the database perspective, it is serving several applications and database feeds. The one-application-one-database model doesn't apply. In this case, the database will have a version, and each API to each application has a version. This requires rather different handling.
A method of deployment that works fine with a small database may not scale up to handle industrial-scale databases. Here, the idea of building the database in any reasonable time-frame is absurd. Even the DDL (tables, procedures, constraints, permissions and so on) takes a surprising amount of time when there are several thousand tables or procedures. On top of that, there is the task of copying in the data. On a system like this, even the trivial act of adding a column to a table can lead to the database being inoperable for maybe an hour. Who knows? It could be a lot longer.
All these difficulties have their solutions, of course, but it requires expertise, and good judgment. I would also argue that as much as possible must be automated, and preferably done when the system was first developed.
Source Control Issues
Developers who are unfamiliar with Database Development are sometimes horrified by the fact that a significant minority of Database Developers don’t use source control. It is certainly bad practice, but not for the reasons usually given. If a SQL Server instance is being used for a shared development, and is subject to a good backup regime, it will not easily suffer loss of code. It is, also, unlikely to suffer integration or build problems. Both the default trace and a server trigger can tell you when an object was altered and by whom. The problems come from the database’s application-interface being out of sync with the version number of the application, and from the source not being held together. Normally, where database developers are using their own system, they also use source control to make it simpler for team-working, to add documentation that cannot easily be attached to individual database objects, for belt-and-braces backup, and to make sure that the source code, as a company asset, is stored in easily-accessible text format. There is another subtle problem that is due to the fact that some of the code, and rather more documentation and other information, from the typical hand-cut Database build script isn’t retained in any way in the database. This includes comments and comment blocks outside module definitions, static data insertion code, and in-line assertion and integration tests. The only way of team-working with hand-cut build scripts, without fear of losing information or ‘merge-hell’ is to use a source-control with sophisticated merge capabilities.
A database is unlikely to be operable without initializing it with data. The data that is required for the database to function should be part of the standard build, and the DML SQL code to make this happen must be held in source control.
Database teams will manage the complexity with large or complex databases by partitioning the work into Schemas. This requires that the source code also be logically split into schemas. The tooling that is currently available gives little support to this sort of working. The SSMS object browser, for example, does not allow the listing of objects hierarchically within schemas. It is also likely that source control will need to reflect the schema structure in the same way that C# code can be saved in namespaces.
For the database, as well as a nightly clean build from the SQL Data Definition Language (DDL), I like to do an upgrade from the previous release to the current build that preserves existing data. This may not always be possible if there has been significant refactoring, but SQL Compare can sort out most of these problems. If it is impossible to determine how existing data is preserved when these DDL changes are made, then SQL Compare might need help with a hand-written rather than automated migration script.
Basic integration and object ‘assertion’ tests.
A problem with putting automatically-generated scripts into source control is that documentation is lost. The traditional scripts used by database developers include comment blocks outside script blocks, stored at the end of column definitions, or within table definitions. These are all inevitably lost if the build scripts in source control are generated from the code in the database or from a comparison tool such as SQL Compare. Additionally, all the ‘assertion’ scripts that check that every function or procedure works properly with at least a test case, or which ensures that a function is first compiled with characteristic parameters, are all lost. Many Database Developers also add basic integration tests into the end of a build script to check that a series of processes give the predicted results, based on a standard immutable data set.
All these ‘extras’ that are not stored in the database, must be preserved in source control, and executed in the right order. Any errors or warnings must also be detected and logged properly.
Curing the Database-Application mismatch
Earlier, I mentioned the particular problem where several applications will share one or more databases. This problem is usually solved by creating an interface, consisting of views, procedures and functions. This represents a consumer-driven ‘contract’ between database and application that is changed only after mutual agreement and is ‘versioned’ in source control. The use of an interface between database and application in order to provide an abstraction layer is probably the most important way of solving the mismatch, but is often rejected in favour of unfettered access by the application to the base tables, but this creates a large number of dependencies that are difficult to control and lead in turn to problems of release and version tracking that have to be dealt with at the time of deployment, which is a bad idea. When a single database supports more than one application, then the problem just gets more interesting.
Direct database access and impedance mismatch
If an application requires access to a database, then we have to be able to deploy all the objects that are accessed so as to be version-compatible with the database, in phase. If we can deploy both together, then the application and database, or at least the database relevant schemae, must normally be deployed at the same version in which they, together, passed integration and functional testing.
To be more precise, it is the application-interface definition of the database that needs to be in a compatible 'version'. You wouldn’t need to be concerned with the TSQL source code of the interface as this will change as the database evolves: it is the unchanging name of the views, procedures and functions and, where relevant, the names, datatypes and meaning of the parameters passed to them that are important. Unfortunately, most databases that get into production have no separate application interface; in other words they are 'close-coupled'. For this vast majority, the whole database is the application interface, and applications are free to wander through the bowels of the database scot-free.
As well as fixing the mismatch between application and database, an interface makes tests for the application a great deal easier. Instead of hand-crafted’ stubs, the application can be ‘pointed’ (via connection-string or whatever) to a stripped down version of the database that can be installed on a developer’s machine, with sufficient data to allow unit tests.
If you've spurned the perceived wisdom of application architects to have a defined application interface within the database, based on views and stored procedures, then any version-mismatch will be as sensitive as a kitten. If a small change is made withing the internals of the database, such as when a column in a base table in the database is renamed or has a change in datatype, then this will break the application. A team that creates an application that makes direct access to base tables in a database will have to put a lot of energy into keeping database and application in sync, to say nothing of having to tackle issues such as security and audit. It is not the obvious route to development nirvana. An abstraction layer can easily prevent this.
I've been in countless, tense meetings with application developers who bridle instinctively at the apparent restrictions of being 'banned' from the base tables or routines of a database. There is no good technical reason for needing that sort of access that I've ever come across. Everything that the application wants can be delivered via a set of views and procedures, and with far less pain for all concerned: This is the application interface. If more than zero developers are creating a database-driven application, then the project will benefit from the loose-coupling that an application interface brings. The SOLID principles of the 'gang of four', for which any object-oriented programmer will fight tooth and mail, within application code, apply equally to the database, particularly the dependency inversion and interface segregation principles.
Once we allow an intimate relationship between application and database, we are into the realms of impedance mismatch. Part of this impedance problem is a difference in development practices. Whereas the application has to be regularly built and integrated, this isn't necessarily the case with the database. An RDBMS is inherently multi-user and self-integrating in that it isn’t rebuilt from code on every change. If the database-developers work together on the database, shared on a server, then a subsequent integration of the database on a staging server doesn't often bring nasty surprises. If a database developer attempts to alter code with results that ‘break the build’ it is made immediately obvious with unit tests, or will be prevented by existing constraints.
A separate database-integration process is only needed if the database is deliberately built in a way that mimics the application development process, but which hampers the normal database-development techniques. By using this technique, each developer can, separately, alter a database, and satisfactorily unit-test the objects they create, only to find that the database will not build because another developer has altered an object referenced by an object they’ve altered. This can’t happen if they are sharing a development server. This process is like demanding that an official walk with a red flag in front of a motor car. In order to closely coordinate databases with applications, entire databases have to be 'versioned', so that an application version can be matched with a database version to produce a working build without errors. There is no natural process to 'version' databases. Each development project will have to define a system for maintaining the version level.
A curious paradox occurs in development when there is no formal application interface. When the strains and cracks happen, the extra meetings, bureaucracy, and activity required to maintain accurate deployments looks to IT management like work. They see activity, and it looks good. Work means progress. Management then smile on the design choices made. In IT, good design work doesn't necessarily look good, and conversely bad design work doesn't necessarily look bad.
The importance of the interface
The idea of an application interface with a database is as old as I can remember. The big corporate or government databases generally supported several applications, and there was little option. When a new application wanted access to an existing corporate database, the developers, and myself as technical architect, would have to meet with hatchet-faced DBAs and production staff to work out an interface. Sure, they would talk up the effort involved for budgetary reasons, but it was routine work, because it decoupled the database from its supporting applications. We'd be given our own stored procedures. One of them, I still remember, had ninety-two parameters. All database access was encapsulated in one application module.
The basic idea is that each application has a stable, well-defined application interface with the database (one for each application, usually). Essentially, the application development team owns the application interface definition, which should be within the application source. It is the responsibility of the database developers to implement the interface and maintain it, in conformance with the ‘customer-driven’ contract.
The team should keep the external definitions of the components of this interface in version control, in the same repository as the application source, along with the tests that verify that the interface is working correctly. Internally, we can make all sorts of changes and refactoring to the database, as long as the definitions of the components are maintained in source control . The database developers will, of course, maintain their own source control for the entire database, including the source of all database objects, and will be likely to maintain versions for all major releases. However, this will not need to be shared with the associated applications. What is important here is that the database development role is separated from the application development role, even if it is the same developer performing both roles.
Database developers and application developers must carefully track and negotiate any changes to the interface. Changes in it have to be subject to change-control procedures, as they will require a chain of tests. We design a comprehensive set of daily, automated integration and functional tests for the application interface, for each version of the application. If it passes all tests, nothing is broken. The performance tests can 'hang' on the same interface, since databases are judged on the performance of the application, not an 'internal' database process. In this setting, the deployment can proceed if the more stable application interface, rather than the continuously-changing database, passes all tests, for the version of the application.
Normally, if all goes well, a database with a well-designed application interface can evolve gracefully without changing the external appearance of the interface, and this is confirmed by integration tests that check the interface, and which hopefully don't need to be altered very often. If the application is rapidly changing its 'domain model' in the light of an increased understanding of the application domain, then it can change the interface definitions and the database developers need only implement the interface rather than refactor the underlying database. The test team will also have to redo the functional and integration tests which are, of course 'written to' the interface definition. The database developers will find it easier if these tests are done before their re-wiring job to implement the new interface.
If, at the other extreme, an application receives no further development work but survives unchanged, the database can continue to change and develop to keep pace with the requirements of the other applications it supports, and needs only to take care that the application interface is never broken. Testing is easy since your automated scripts to test the interface do not need to change.
The Difficulties of Refactoring
It is a cause of irritation to the developers that the DBAs discourage refactoring. Whole books have been written with the shaky assumption that one can evolve databases as one's understanding of the business domain evolves. If it weren't for the inconvenient presence of data, this would be a sustainable approach. Actually, any grey-muzzled database developer will try to avoid refactoring tables that contain many millions of rows, and are subject to high levels of access by database users. The good developer will avoid these problems by getting the design right up-front, taking pains to ensure that the application domain is properly understood.
Whereas there is little problem in changing procedures or other routines, or even adding indexes and constraints, things can get tricky with big tables. If, for example, an upgrade script requires an
ALTER TABLE command that will result in changes to data. When the command is executed, the data altering starts immediately, requiring a schema-modify lock on the table. SQL Server fully logs all actions that form part of the table modification, in order to ensure it can roll back the operation.
If, for example, we wish to add a column to a large and important table, then every other process that needs to access that table will be blocked for the duration of the modification. The modification will be logged, of course and if something goes wrong, as when a constraint is violated, or the person doing the update loses their nerve and cancels the operation, then the rollback can seem interminable.
It isn't just changes to columns that will bring a production database to its knees; many other
ALTER operations on large tables, such as changing the clustered index, can take a long time.
Some alterations can't be performed at all using
ALTER TABLE, so we have to fall back on the even longer process of creating the altered version of the table as a new table with a different name, copying the data into it, and then finally renaming both tables. If, for example, a column participates in a schema-bound view, we cannot change that column.
There are all sorts of restrictions on how we can or can't alter a column, depending on its data type. For example, with SQL Server:
- We can't alter a column if it has a
timestampdata type, is a computed column or is used in a computed column, is the
ROWGUIDCOLfor the table, or is used in a
- We can only increase, not decrease, the length of a column used in an index,
CHECKconstraint, or in custom statistics if the data type is
- Although it is certainly possible to alter the length of a VARxxx (
)column if it is involved in an index, or explicit statistics, we can't change its type. If the column is part of a
PRIMARY KEYconstraint then only a VARxxx column can be altered in size.
- Columns associated with a
DEFAULTdefinition can only have their length, precision, or scale changed.
- The data type of a column in a partitioned table can't be altered.
- If we change the length, precision, or scale of a column with the
ALTER COLUMNstatement, and data already exists in the column, the new size must be sufficient to hold the largest data item in the column.
Of course, SQL Compare will do all your thinking for you, when working out how to perform a change in a table. It knows all the complex restrictons and will work out whether an ALTER TABLE statement can be performed or if the table must be renamed, rebuilt and restocked; but it may have to opt for a strategy that will prevent a production system from accessing the table involved because of the necessary table-locks. If it is an important table, or one with a lot of dependencies on it, this is quite likely to bring trading to a halt.
At the heart of integration and deployment is the database synchronization process.
The first stage of synchronization is to ensure that the two databases are identical in their metadata, tables, and routines. This isn't an all-or-none action, since you can exclude certain classes of database objects from the process. You are unlikely to want to remove replication objects from a production server, for example. You may also need to synchronize certain objects that are outside the database, and based on the server, such as events and scheduled tasks.
The next stage of synchronization involves the data. If synchronization of metadata has already been achieved, then all is reasonably easy, with the right tool. The arm-wrestling with the data will already have been achieved on the integration server. Arm-wrestling? I mean that if you have made substantial changes to the schema of the databases, then you have the problem of preserving the existing data. Where does it go?
On the integration server, we create a latest version of the database from source control. However, we need to put the data into the database. Although a data synchronization tool will work fine when there are identical schemas, no tool could be expected do the DML to stuff the data into a newly refactored set of tables, since there is no way that any tool will be able to work out where the data should go. If there are substantial differences to the table structure, we would have to do this migration by hand-coded scripts, and these would need to go into source control. Migration scripts describe how to preserve the data when the schema changes between X and Y, we need to execute them when we upgrade a database from version X to Y. If we wish to use an automated process to deploy most of our database changes, then we need a way to over-ride the automated script when the system detects that it entails an upgrade between version X and Y.
Whenever there is a metadata change to the tables of the database, we will need a script that over-rides the data synchronization process between the two versions. The script needs to have checks to make sure that the data migration has not already been done.
There is a problem with separating a table refactoring from the data migration. The process should be atomic. Either the whole lot is done, or it is not done. This sort of deployment can't remain stuck in the tubes. This means that it should, ideally, be done all in one transaction, so it should be one script. This becomes impractical with very large amounts of data purely because of the time taken to do the data migration, whilst all the time holding a schema-modify lock on the table. Also, depending on the way you do your deployment, you may not want to migrate the data at that stage. There is no advice I can give that will fit every circumstance, but it is always essential to ensure that there is no possibility of the operation ending up half-done.
Another problem with scripting a refactoring change to a database is that it has to work both ways. If we deploy a new system into production and then have to roll back the changes afterwards, we could, unless we are careful, lose all of the data that has been added or changed before the rollback decision was taken. Yes, if you are subject to a stringent service-level agreement (SLA) then you have to script any rollback! These have to be mercilessly tested, and put in source control just in we need to rollback a deployment after it has been in place for any length of time. You will be faced with the dispiriting task of carefully writing scripts which, hopefully, will never be used, but at least you will sleep well!
Although it is possible to automate the deployment of databases, it is a mistake to underestimate the difference between a database and an application. The sort of difficulties one can come across range from the subtle to the scary, and it is hardly surprising that operations and productions specialists in IT can be apprehensive about a database deployment. In general, by far the best approach is to adopt the discipline of a daily automated build from source control, followed by an integration test run so as to deal with all problems as they happen. When it comes to deployment, these scripts can be integrated into the pre and post deployment scripts for the package, greatly increasing the likelihood of a stress-free delivery of the application.