About two years ago, my team deployed a production change to our loan origination system, at Farm Credit Services of America. It wasn’t a significant new version; just one minor new feature and a couple of bug fixes. Shortly after our operations team pushed the changes live, our logs started filling up with deadlock errors, and the performance of this and other applications degraded rapidly. Within 30 minutes, the whole system was offline. In short, one change to one line of code brought down a production level SQL Server and stopped 90% of our users from working for two hours.
During the aftermath, my team of developers worked closely with operations, rolling back to a stable state, diagnosing the cause of the problem, working on a solution, and discussing ways the whole situation could have been avoided. As painful as it was, it occurred to me at some point that this was the first time I’d had a chance to collaborate at length with our operations team, and that I was enjoying it.
I didn’t know it then, but it was the start of my journey into DevOps.
Deployment Time Bombs
This event and its aftermath forced me to reflect on how the various teams hadn’t really been working together as well as they could up to that point. Like most the development shops I’ve worked for, at the time there was not a whole lot of collaboration between my team of developers and the operations team (DBAs, web admins, sys admins). Developers would write the code and “throw it over the wall” to the operations team whose responsibility it is deploy and keep running smoothly. By the time the operations team had a chance to look at the changes, when we pushed them to pre-production, it was far too late in the development lifecycle; those changes need to go to production.
It turns out that this “simple” code change from a Timeout to an Interval increased the throughput to our application by 1800%. SQL Server did its best to keep up, but the table this particular query was hitting had over 10 million rows and because of a lack of a proper index, SQL Server was scanning every row in that table every time the routine fired. At that time, we had serval production databases on the same SQL Server instance, and all the resources were now being consumed by repeated execution of that one query.
To stop the hemorrhaging, we rolled back to a working state, and then embarked on a ‘triage’ process with the operations team. The DBAs retrieved the query plans and suggested some indexes to stop those nasty table scans. They also ran a trace and found 80% of all queries coming in had the same WHERE clause. With that in hand, they asked what data could be cached on the server? The web admins helped us learn to navigate New Relic and track down important troubleshooting information, such as the most expensive queries and how often they were being called.
We learned a lot during this process. Most of us knew how to run traces on a SQL Server, but the DBAs showed us how to refine them to a point where we could focus on the important queries and ignore the rest. We also learned about the tooling available to the operations team, such as NewRelic, which can help find problems while they are happening. On the flip side, our operations staff learned more about the loan origination system, specifically about the search functionality. When we explained the 10 million row table was intentionally de-normalized for fast queries and high throughput they could then adjust and help support that decision better.
Sadly, after the crisis was over, we all dispersed and went back to our normal jobs. This type of knowledge sharing was very helpful to both sides, but what we needed a way to extend this collaboration so that it became a standard feature of the development cycle, instead of happening only in response to a crisis.
Defining what DevOps means to you
DevOps is about adopting the tools and practices that will enable collaboration between developers and operations, with the goal of improving overall quality of the applications. This can include working on tools to improve communication and automation of manual steps.
I knew we needed to adopt some DevOps practices. We needed to find ways to “collaborate more“, but that phrase is very abstract. It is akin to “I need to lose weight; it’s a goal or desire rather than a strategy. Likewise, I was struggling to know how to define what “we need to do DevOps” really meant to our organization.
The turning point came when I overheard someone describe DevOps as the answer to the question:
“What do we have to change in our processes to be able to deploy to production 10 times a day?”
The implication, of course, is that when the collaboration is done right, it is possible to deploy to production 10 times a day.
Excited, I posed the question to a few people, only to be greeted with a lot of worried looks. Firstly, we have to get approval to go to production and that is currently a very manual process. Surely, this was a blocker? True, and my immediate response was to propose that, in that case, we need to remove that manual approval step!
Queue more worried looks. It turns out that as soon as you talk about removing manual checks and making statements to the business such as, “We want to be able to go to production every 45 minutes”, then the business people tend to worry, a lot, and with good reason. After all, we were just starting on this journey and really had no idea about “best practices”. As Indiana Jones says in Raiders of the Lost Ark, “we are making this up we go along”. In other words, we were going to make mistakes.
The truth was that our deployment process, right now, was not well-automated. Each deployment was a “big event” and took time, and so the team tended to delay doing them. This meant that changes arrived to pre-production too late in the process, and the operations team often needed to do a lot of last-minute tweaking to get the new version running smoothly. This was the real, and perfectly valid, reason why the business felt the need for the manual approval step to get to Production.
Surely, then the deployment process was the first “weak link” we needed to tackle. Maybe we couldn’t deploy to production ten times a day, but what if we could automate and streamline the process of deploying to pre-production to the point where it was easy?
The pre-production environment was as close to production as possible. Same hardware, same setup, basically the same everything, but only the IT staff used it. If it goes down or we mess up, we are not stopping our users from doing their jobs. It is also the first environment where the servers are locked down just as they would be in Production (in the development and testing environments, all the developers are sys admins). In other words, pre-production was a perfect test bed for our applications; if we could make deploying to pre-production a “non-event”, I saw a couple of immediate benefits to this:
- Proactive instead of reactive troubleshooting. Let’s say we needed to troubleshoot a performance-critical specific feature. Instead of the development team making some changes, running a few tests then throwing the new code over to Ops, we’d collaborate. We’d work the web admins on determining what the current performance level is using a tool such as New Relic. We’d work a DBA to identify potential bottlenecks on the database. Then we’d make the changes, deploy to pre-production, and re-measure.
- Better Testing and Higher Code Quality. After each deployment to pre-production, we could run a set of smoke tests, or a load test could run at night. In addition to collaborating with the business and QA, by collaborating with DBAs and Web Admins we could create tests that would much more closely match the real world.
We refined our statement of DevOps intent to the following:
“What needs to change in our processes in order to be able to deploy to pre-production every 45 minutes?”
Why every 45 minutes? My theory was that most people can handle doing something once a day, even if it sucks. There is always some grumbling, but rarely enough to make change. Inertia will kick in for enough people and no progress will be made. One example in our process is every database deployment to pre-production has a delta script generated automatically and then it is manually approved by a DBA.
However, you must do something that sucks, like manually approving a delta script, every 45 minutes, everybody will be screaming, “please make it stop, I’ll do anything to make this suck fest stop.”
The start of the DevOps Journey
With our goals in mind, I posed this question in our latest monthly meeting of Octopus Deploy workgroup, which is made up of two DBAs, three web admins, three developers, two lead developers and an enterprise architect. Each job role in the room had their own ideas on how to make the 45-minute goal a reality. It quickly turned into a great brainstorming session.
The following sections summarize some of the ideas proposed by each role.
The manual approval step is the biggest hurdle to overcome with going to pre-production. Going to the leaders and telling them we want to remove it and not have anything to replace it will fail.
- Involve the DBAs earlier in the process, pre-production is just too late, we need to start the collaboration a lot earlier. The manual approval step came about because they are not involved until pre-production. By involving them a lot sooner we can eliminate this manual step.
- Is there an automated way to check the delta scripts for certain breaking changes? We were already manually checking the scripts for drop tables, drop columns, and so on, and there seemed no reason we couldn’t automate these checks
- Start making use of a tool to verify database guidelines are being followed much earlier in the process, ideally during the build steps. One of the ideas proposed during the meeting involved writing TSQLT test and making use of Redgate’s DLM Automation‘s built in support for running database unit tests.
The process for getting C# code up to the pre-production environment is fairly streamlined. The only time a manual approval is required is when a new project is about to be deployed or something in the deployment pipeline changed. The expectation is the code will be pushed to pre-production by Octopus Deploy within a minute or two.
- Regularly scheduled realistic load testing. Review the logs to determine which endpoints are being hit the most and how often and run load tests against them. Don’t have the load tests just push the system to the absolute extreme all the time, maybe first focus on being 20% higher than the heaviest load, or 50%.
- Research what it would take for blue/green deployments. What tooling can we use, how should we change Octopus Deploy, and so on.
- Use of New Relic to be proactive instead of reactive. Say that after a deployment the performance of the search function dropped 5%, then the development team should be notified before it drops to a point where performance is unacceptable to our users.
There were several strategies that the developers wanted to start looking at to help get to pre-production every 45 minutes.
- Research for additional analysis tools to help enforce C# guidelines. The loan origination system my team is responsible for had the guideline is any time a database is hit or a service is called the code must make use of .NET’s async programming.
- Automated testing right after the deployment, not just unit testing during the build. Some examples proposed would be a suite of service tests using a tool like Fitnesse or Postman to make sure a RESTful service deployment was successful. This would allow us to make use of a blue/green deployment strategy.
- Start considering breaking apart large monolithic applications into smaller components and only deploy those components that change. For example, if there is a windows service that only changes once a quarter then it doesn’t make sense to rebuild and redeploy it all the time. Only build and deploy when a change needs to happen.
The Next Steps
After this meeting, my team refined our ideas still further, and proposed the first four specific steps that we would take, to help us get to pre-production every 45 minutes:
- Delta Script verification – we are using Redgate’s DLM Automation for our deployments, which produces a delta script of changes. Can we run some regular expressions using PowerShell to look for certain key phrases we know will cause issues when going to production? Our goal is to run this verification on every deployment to every environment, and to stop the deployment if a ‘bad change’ is about to happen.
- TSQLT Unit Tests – we are considering what kind of verification these tests can provide out of the box as well as any other tests we can add. These tests would run during the build and would help prevent any bad changes from reaching Octopus Deploy.
- Scheduled T-SQL Script Verification – the DBAs and enterprise architects have created a set of scripts that can tear through a database and check to see if guidelines are being followed. Where guidelines are not being followed, the scripts will spit out a SQL Statement to update the database so it does follow guidelines. We are looking at ways we can schedule them and at where we can include them in the process (build or deployment).
- Smaller Builds – start looking into the large builds that take over 15 minutes to complete. Do all the components in the build really need to be built on every check-in to version control? What steps are redundant? Can it be separated into smaller builds? This was the topic of a recent article on my site.
While working on these ideas, we are going to focus on getting more and more people on board with the idea of DevOps.
Early in 2017, at the first meeting of the year for “Octopus Deploy Workgroup”, we plan to rebrand to the “DevOps Workgroup” and work together to come up with a “elevator pitch” as well as goals for us to focus on and a strategy to get this implemented across all development teams. I’m really excited about the upcoming meeting because my goal was to get us started having these conversations, it is really cool to get a group of people together to really start putting an action plan together.
After that first meeting, where we discussed as a group what needed to change in order to be able to deploy to pre-production every 45 minutes, one of the DBAs stopped me in the hall and said he is all for DevOps, and that most of the DBAs were on board as well, but that we had to be careful not to ram this down their throats.
I couldn’t agree more. During the time I’d spent trying to “push through” these proposals, I’d learned a lot. Firstly, change must occur on all sides of the DevOps equation. It can’t just be the operations people being forced into change.
Secondly, in any enterprise, change takes time. If you rush in proposing to remove this step, and that manual check, in order to reach your goal, expect a lot of push back, often for very valid reasons. First, you need to prove that your processes can be trusted.
Finally, while most of the time people are “for” the change, most of the time they are also very busy. Despite itching to get started on my proposed process changes, for the past six months our teams’ focus had to be solely on delivering a major new project to the business. Any other changes being proposed had to be postponed. It is a similar story for all of the other development teams in the organization, and each team has their own set of priorities.
That all being said, I have been very encouraged by much of the feedback I have received so far. They were very excited to get started on this collaboration, which is great and makes it feel much less of an uphill struggle.