Taming the Wild Deploy

Jun 24, 2020

A role change taught me all about how even in the constantly-changing world of web development, stability is king. I’d just joined a new team to provide technical guidance and oversight, and I didn’t realize what I had gotten myself into until I saw the steps to deploy to production. I saw our setup with a fresh set of eyes and spotted lots of room for improvement. I didn’t know where to start; all I could see was chaos.

The deploy process started with the lead developer reviewing a list of issues “scheduled for release” and merging them into master one by one. Next, he built the application and deployed it to our staging environment. He visited it in a separate browser window, confirming nothing seemed wrong except a few issues he explained were inherent to the staging environment. Ready to proceed, he went to a company chat room and told the usual release engineer he would like to deploy to production. That release engineer confirmed some details which he would need to pass to his deploy script. Twenty minutes later, the relevant caches had been cleared and the new version of the application was available. The deployment was complete.

I knew my boss wanted us to be more self-sufficient and able to run that release engineer’s script by ourselves, which I had previously been totally in favor of. But my feelings turned mixed. This sounded less like empowerment and more like hot potato. I'd learned the pipeline had improved dramatically, but while it's important to recognize accomplishments and improvements, I’m not used to looking at things in terms of how bad it was; what matters to me is how good it is. Things seemed far from good.

Imagine you’d been asked to provide recommendations. Go ahead and re-read that paragraph describing the deployment process, and imagine you’re responsible for what it yields — not just for one application, but many deploying every week, sometimes twice in a day. And if something goes wrong, your boss needs to know what happened because he’s going to have to explain it to his boss. How much ownership would you be ready to take?

Identifying the Issues

Looking back, my first step should have been consulting with the engineers who got us there to understand why and how things had become the way they were. I was preparing to ask my team’s engineers to use those systems, but all I saw was risk we were going to inherit. People who’d been around for longer knew it wasn’t perfect, but at least it worked. I had a different perspective: my new team had been stuck working with half-broken systems and without enough resources to make things right.

The first problem that jumped out at me was the lack of testing. We’d outsourced all QA work overseas to a team who’d manually test on development environments running feature branches, performing only rudimentary automated testing. Near as I could tell, integration testing was limited to a lead developer clicking around briefly on a staging environment before approving the build for production deployment.

The second problem was the lack of confidence in that staging environment. There were many reasons for this, but if I had to summarize it: hodge-podge architecture. Page content could come from a database coupled with the application server, or from template files as hard-coded strings, or from middleware with redundant caching, or from a service that behaved differently on different environments and applied poorly-understood logic. The end result was that we didn’t seem to have a good place to perform integration testing before deploying to production, because our pre-production environments were not representative of production.

Some organizations have the same problem dressed a little differently: each environment is stable, but not accurately representative of any other. Engineers can’t solve problems in their local development environments because they don’t present themselves until they’re found on staging. This might seem better but I’m not convinced it’s not worse. At least when you dismiss an issue, you’ve seen it. Hiding it until it’s more expensive to fix is not an improvement.

A normal approach to solving this problem would be to identify all the issues and address them in priority-order. But when you have real instability that approach turns into a huge time burden, documenting issues that come and go without easy explanation. So here’s a big easy shortcut: it doesn’t matter. If you don’t trust your staging environment, it doesn’t matter how good it “actually” is. When any issue more complicated than a typo can be dismissed without investigation, you’ve lost nearly all the value of testing. After chasing enough intermittent issues, it’s easy to get fatigued to the point where you stop even asking for a second opinion. You have to fix the confidence problem, and that’s incredibly difficult.

My confidence in our staging environment had gotten so low that at one point during a review I saw a critical part of our application was broken and approved production deployment anyway. Moments later, we saw the same unstyled Apache error page in production. We reverted immediately and other engineers wondered how I could have possibly given the go-ahead to deploy. It’s simple: Chip away at your trust in your staging environment until anything can be chalked up as an issue unique to staging. Without the experience that leads to knowing the underlying causes of various issues, it’s impossible to know what’s a real problem until you get to an environment you do trust. In our case it was production.

The third problem was the lone engineer responsible for carefully performing the manual steps to deploy every application our teams handled. The term “single point of failure” isn’t entirely accurate because he never failed, and other engineers on his team had been trained to do that work. But somehow the task always landed at his feet and it became part of his de facto job. Between our many apps and our seemingly random deployment times, we found ourselves constantly asking him to drop everything to run the deployment script. It seemed like an easy way to burn an engineer out. Until we took responsibility for running that script, we had no one else to reasonably ask. And he had no reasonable way to say no.

Slowing Down to Go Faster

More than anything, we needed stability and consistency, so we started with what we could control on our own: deploying less often. We started deploying on a regular, strict weekly schedule. This required buy-in from the whole team (not just engineering) because it affected all the product owners’ relationships with their stakeholders. Someone who’s used to saying “We’ll launch that as soon as possible” needs to learn how to transition to “That isn’t on track for the 10th, but let me confirm we can hit the 17th and get back to you.” But that simple change made things a lot more predictable.

During those regular deployments, I’d take note of things I’d notice that were important for just a moment and easy to forget until the next deployment. Things like notifying the system operations team to clear the server cache. I made a checklist from those notes, with a dozen and a half line items explicitly describing each thing that needed to be prepared before we could ready to request a deployment, and who needed to do each of them. It felt onerous but everything on the list was necessary. It seemed like a lot of work because engineering is a lot of work, the checklist just gave it visibility.

It didn’t take long before we all began to enjoy the benefits of our new predictability. We could more easily “back into” engineering deadlines; if we were deploying on Monday morning and knew the QA team generally needed two business days to thoroughly test a release candidate, the whole team knew why upcoming features had to be finalized by Wednesday. We didn’t have to come up with a new schedule every time we wanted to deploy something. If the team saw we might miss a deadline, they could react accordingly. Instead of adding risk by “coding faster” or making the QA team “handle it” (or “stay late” in other words) the team could have a grown-up conversation about what to do next.

We also invested in hiring experienced local QA engineers, who evolved our testing culture immeasurably. Although they each worked on different applications, they also made a cross-team QA guild with weekly meetings to build upon a shared testing framework they had inherited, with their own repository with clear contribution guidelines they established. Week after week, they regularly followed through and knocked goals out of the park. After a few weeks, we went from manually testing everything to having a rigorous integration test suite, which enabled more focus on testing upcoming features and improving testing infrastructure even more.

Earning confidence in our testing environments took time, and meant making each environment appropriately representative of the next. Every organization will have its own idea of “appropriately representative,” but does everyone who uses those environments agree? If you’re uncovering new errors when promoting changes, why? Did you see it earlier and not trust your environment, or are your environments that are supposed to be more stable actually less stable?

The engineers who maintained our testing environments had to split their effort between supporting arcane legacy systems we depended on and migrating us over to modern cloud-based environments, but eventually we got testing environments we could trust. There wasn’t any shortcut or magic to it, it took a lot of writing tickets, escalating issues, and following up. But the payout was immeasurable.

As we grew more confident in our testing and release schedule, we slowly transferred the responsibility of running the deployment script to our QA engineers and team leads. It fit naturally into our checklist of deployment tasks, making us more self-sufficient and predictable in our deployment process. Instead of asking that release engineer to drop everything to help us, we could do our own work. That gave us all a lot of freedom.

Finally Thriving

Writing code and learning about new technologies can be a lot of fun and very satisfying, especially when an organization gets value from it. It’s easy to forget all that writing and learning is done to support business objectives — it isn’t the business itself. It doesn’t matter how fast the build times are or how elegant the architecture is if the business has no confidence in what it’s delivering. For us, once that confidence and stability was established, we could appreciate the tooling itself. But we had to focus on the human parts first.

At times, all the manual tasks and ceremonies we built up seemed like a series of giant steps backwards in the march towards continuous deployment. But honestly, it wasn’t. Although being able to deploy “whenever” enabled some bad habits, becoming more predictable didn’t hurt our CI/CD systems. Instead, it made it dramatically clearer what manual steps could be automated to reduce friction. But what helped us the most was shifting focus away from the technical work and automation and instead on the human aspects that led up to shipping. We didn’t need cooler tools, we needed environments we could trust, a mature testing process, and a way to be sure everyone had everything they needed.