One of the greatest pain-points in developing a database-driven application happens when the application is in source control,but the database isn't. When the development database is shared, the pain increases, and it is not alleviated by source control alone. Troy Hunt spells out why each database developer must have their own version of the database.
I have memories from the late 1990s of building classic ASP apps in VB script, using Dreamweaver, side by side with my fellow developers, working on the same set of files using the same mapped path. You had to talk a lot; “Have you closed the CSS file? I need to add a class”. And I remember the painful ‘find and replace’ process, safe to execute only once every developer had saved their work and closed all their files. We huddled around shared drives mapped to the same UNC path and recklessly worked on the same set of files before firing them up in the browser right off the same server.
Back then, there was source control of a kind, by which I mean Visual Source Safe, CVS or even Rational Clear Case: But normally it would be the classic pattern of selecting the root folder of the app, and then going “CTRL-C –> CTRL-V –> Rename” with the date appended at key points in the project.
In hindsight, there were probably better ways of doing it a dozen years back; but the practices were reasonably common. Today, no one in their right mind would consider building apps this way. So why are so many people still using these same methods to build the databases behind their web applications? Is the data layer really that special that it needs to be approached entirely differently? Don’t those hard-learned lessons from the last century apply to database development?
What are we talking about here?
Just so there’s no confusion, let’s try to be clear what we’re talking about. When using the shared database development model, the developers build the web app layer in the usual fashion (Visual Studio running locally, files stored on the developer’s PC) yet all connecting to a remote database server and working directly on the same DB. It looks something like this:
Usually they’d also be working with SQL Management Studio running locally and connected to the remote database server. When the app actually runs locally, all data connections are going out to the shared, central development database. There is no local instance of SQL Server on the developers’ machines.
The alternative, of course, is dedicated development databases. Things now look a little bit different:
Obviously each developer has their own version of the database but the biggest difference to the earlier model is the presence of a version control system. Why?
The problems of the Shared Development database
The “last writer wins” problem
The obvious problem with collectively working on a shared database is that, when it comes to the problem of multiple developers working on the same object at the same time, it’s a case of “last writer wins”. No merge, no management of editing-conflicts, just the last guy getting his way.
In order to mitigate the risk of this happening, you have to implement social mechanisms to work around the problem. So you’re back to developers communicating backwards and forwards in an attempt to keep their fellow coders out of their work in progress. It’s clumsy, it’s labour intensive and it will fail. It’s just a matter of time.
The “experimentation” problem
An essential part of software development is experimentation. Unless you’re a true believer in the waterfall mantra of “design then build” and two shall only ever progress in that sequence, you’re inevitably going to build some code that sucks and consequently change direction. This is healthy, it’s part of the continuous improvement cycle we all implicitly exercise every time we build software.
What’s not healthy is to unintentionally impede the work of the other developers in the team by exposing the consequences of your experimentation to them. It could happen in all manner of ways: It could, for instance, be that your work breaks their code by removing dependencies, or by unexpectedly changing the result of their ORM layer generation by adding objects. Experimentation should be about you trying different approaches and not forcing it upon your team. In short, you need to play in your own sandbox; a sandbox that spans each layer of the application.
Martin Fowler has a nice summary in his article about Evolutionary Database Design
Evolutionary design recognizes that people learn by trying things out. In programming terms, developers experiment with how to implement a certain feature and may make a few attempts before settling down to a preferred alternative. Database design can be like that too. As a result, it's important for each developer to have their own sandbox where they can experiment, and thereby prevent their changes from affecting anyone else.
To develop code, you need to be able to experiment alone, so you must insist selfishly on your own play-space.
The “unpredictable data state” problem
Whilst the problems of “last writer wins” can be largely ameliorated by such strategies as segmenting the tasks or announcing changes before making them, the unpredictable state of the data is a whole new level of trouble.
The problem is simply that databases tend to change as the client applications are used. I don’t so much mean the changes at the object level as much as the data within the tables. It’s the whole reason we’ve got the things in the first place; so that we may manipulate and persist the state of the data.
Let’s take a typical example. We’ve got an app with an authentication module and administration layer and each of the developers have created accounts which they can test with. But the guy building the administration layer wants to test how the app behaves with no accounts in it. Now we’ve got a problem because deleting the existing accounts is likely to hinder the other team members.
As things get more complex, the problems worsen. Applications that are highly dependent on the state of data, for example, become a nightmare when you simply can’t control how it’s being changed. Unpredictability is not your friend when it comes to building software.
The “unstable integration tests” problem
If you have complex data dependencies when the time comes to do integration tests, it is essential to have a predictable, stable set of data to run against. Otherwise, with a constantly changing set of data, you can abandon all hope of a function returning a predictable set of data. You’ll never achieve this if the rest of the team is continuously evolving both the schema and the data.
The “objects missing in VCS” problem
You will never get continuous codebase integration with the work of your team if every change is made centrally without source control, and is immediately available to everyone. What’s the motivation to commit your code?
Of course what’s really happening here is that the shared model is simply allowing a bad practice to creep in without any repercussions. If changes have to be committed, then the benefits of Version Control will become apparent. Dedicated development databases nurture good VCS practice.
The “disconnected commuter” problem
By using a shared development database, you are forced to be connected to the network at all times. If you want to take work with you and slip some development-time into the train journey, or if you want to work through the weekend in the comfort of your own home, you’re only going to have half the solution available. Yes, there are often VPN solutions or other means of remote access but you’re starting to increase the friction of working productively.
The same problem extends to working effectively with those outside the network segment which contains the shared database. Want to send work out to a partner for a few weeks? Sure, you can do that by backing up the DB and sending it out but the chances are you’re going to have issues integrating back into the trunk of development if the project is really not geared for distributed database development.
The project has been tightly coupled to a single internal implementation of a database server and this is always going to result in more difficulties further down the line.
The “sysadmin” problem
One of the problems with development in any server-based environment, including a database server, is that there are times when elevated privileges are required. When this environment is shared, there are potential consequences well beyond the scope of a single application.
Here’s a case in point; I recently suggested to a developer that their performance tuning could benefit from some SQL Server Profiler analysis to take a closer look at things like reads and writes. This particular case involved a shared database so the next thing that happens is I get an email back with an image like this:
Frankly, I don’t want the guy to be sysadmin on a box that may contain totally unrelated databases to which he probably shouldn’t have access . I could give him ALTER TRACE permissions (and ultimately, I did), but of course this has to be set at the master database level so now he has the right to inspect every query across every database.
This discussion would never have even taken place in the dedicated local database scenario. He would have simply already had the rights and it would have been dealt with locally. There are plenty of similar occasions where the rights a developer needs to do their job exceed what should be granted in a shared environment.
The “unexpected performance profiling” problem
Continuing from the previous point, performance-profiling in a shared environment where you have no control over the other processes running on the machine is an absolute nightmare. That query which takes 20 seconds to run in one test-run can easily blow out to 50 seconds a few moments later. Why? You have no idea.
Whilst it’s always a bit tricky getting consistent results from any sort of performance profiling, the worst thing that can happen in the midst of this is other processes getting in your way. When you’re doing this locally, you have both visibility and control over these processes.
Of course there are cases, such as where huge volumes of data which would normally be queried on serious servers, where performance profiling on a PC is not going to yield constructive results. However, for the majority of performance-tuning tasks, a developer needs a predictable environment more than anything else.
The “non-representational latency” problem
So let’s say you’re working on a shared database which is almost inevitably located on a server in a data centre. Where exactly is that? How many milliseconds of latency are being added to each round trip?
The problem is that in a world of data centre consolidation, you’re quite possibly going to be loading up a whole heap of additional latency to each ADO.NET connection which isn’t going to represent the target live environment. I’m guessing you don’t have the same sort of gigabit-Ethernet connectivity from your PC as the production web application server will have, and that creates a little bit of a problem.
It’s a problem in that the application performance in the development environment is going to be comparatively slow. The degree of sluggishness will depend on the latency and the amount of activity over the wire but, for example, 150ms to a remote SQL server coupled with a chatty application is not going to make for a very accurate representation of real world app performance.
The “my development machine is slow” problem
Of course the “problem” with developing databases locally is that it is necessary to run SQL Server on the PC. I say “problem” in quotes because the issue is not so much that SQL Server is asking too much of the machine, it’s that developers are all too frequently given slow, underspecified, PCs.
If developers are given machines which struggle to concurrently run Visual Studio, SQL Server and the usual “business as usual” tools (Outlook, Word, etc.), there is a bigger underlying problem: Developers are not cheap. In Australia, your average developer is costing about $90k a year. There are then a whole bunch of other costs to contend with such as floor space, equipment (other than the PC), operating expenses (such as payroll) and on and on and on. Conservatively call it $100k annually or around $420 for each day they work. On the other hand, fast PCs are cheap. I recently replaced an aging laptop and the difference in price between a run of the mill machine designed for the desk jockey who lives in the Office productivity suite and an 8GB, SSD, i7 machine was $0.60 a day over a three year lifespan. Put it this way – if you fit the $90k/y bill and you’ve read this far (say 10 minutes), you’ve just consumed three weeks’ worth of super-fast machine upgrade, based on your hourly rate and the time cost of reading this post. Enough said.
And yes, yes, I know developers and costs are a lot cheaper in other countries. So let’s assume only $25k annually; you’re still looking at over $100 a day for these guys and a $0.60 cost to fundamentally improve their productivity. If you need to debate the mathematics of this with anyone, it’s probably time to have a good hard look at how conducive the environment is to having a productive, fulfilling role; for some essential further reading, check out Jeff Atwood’s Programmer's Bill of Rights.
Getting to the solution of dedicated development databases
The “SQL Server Developer Edition” solution
Microsoft provides the perfect means of developing DBs locally in the SQL Server Developer Edition. This is effectively a full blown Enterprise edition licensed for non-production use. Chances are you already have a license if you have an MSDN subscription but even if you don’t, it’s dirt cheap.
Installed locally, it can easily be configured so the service doesn’t start automatically if you’re really worried about it dragging down the performance of your PC when you’re not even using it:
But having said that, the resource usage is actually pretty small unless you’re seriously pounding it. Mine is sitting there consuming only 340MB of memory (about 4% of what’s on the machine) and 0.4% of CPU. So unless you’re running under-specced hardware (again, this is reflective of a deeper problem), the performance impact shouldn’t even be noticeable.
The “script it all” solution
One great thing about decentralising your development database is that it forces you to script a “ready state” of data. Versioning of database objects is one thing, and it’s obviously essential, but we all know that most applications won’t play nice if they start out with zero records in them. All sorts of reference data is usually required to initialise a database so the problem now becomes how you get it in there.
If you need to initialise the state of the database via SQL scripts, you are forced to think clearly about what your application needs to function. You need to think through the purpose of each table and what it needs to contain in order to achieve that “ready state”, rather than just organically growing reference records as required.
The other big bonus is that this script then goes into source control. It gets versioned along with the DB objects and persists in VCS for perpetuity.
Finally, scripts are fantastic for automation. It means that at any time you can pull a revision from VCS and have a clean, repeatable installation of the application. Tie that into a continuous integration environment and you now have one click deployment of the entire app.
The “this will damn well force you to do it right” solution
By working on dedicated local databases, the developer is forced into a number of good practices which could otherwise be circumvented in the shared world. The obvious one is source control; if you’re not versioning your database objects and reference data, you’re doing it wrong. You simply won’t be able to get away with it any more if the only way your colleagues can get the changes is via VCS.
So as to work effectively and not break builds, work needs to modularised and committed atomically. You can no longer get away with randomly changing unrelated parts of the application; otherwise you begin “breaking the build” for others, something which is generally not received very positively by your peers. This is a good thing; it forces more thoughtful design and conscious completion of tasks.
And of course you can’t get away with running SQL Server on that tired, cheap, PC with a single GB of RAM and an old 5,000 RPM disk. You actually have to get a half decent machine – I mean one that is actually suitable for building software on!
So you see the whole shared development database model can disguise the use of those practices you might not be doing properly to begin with. Working autonomously on a local DB becomes a self-perpetuating cycle of practice improvement as it simply won't let you get away with taking nasty shortcuts.
If you’re using a shared development database, the chances are that you’ve simply inherited the practice. Take a good look around; are you really working this way because it’s the most effective possible way of building software? In times gone by, it wasn’t easy to ‘version-control’ databases, but we’ve now got tools at our disposal to do it.
In terms of .NET, there’s obviously the official Microsoft Team Foundation Route but there are also offerings from third parties such as Red Gate’s SQL Source Control. Around the middle of last year I wrote about Rocking your SQL Source Control world with Red Gate and then Foolproof Atomic Versioning of Applications a little after that, both of which go into detail about the importance and value of versioning your databases. So I won’t repeat the message here. Just make sure you’re doing it, ok?
Developing locally on dedicated databases is not only better for the process of database development, it’s better for configuration, which means better for deployment. It’s also better for development processes in general, such as experimentation, modularisation of work. It solves all sorts of other problems which are engendered by the communal DB model.
So really, what’s stopping you?