Our journey to Continuous Deployment: Part 2
Adapting culture and process
Welcome back to the Crunch tech blog, where we have been discussing our journey to Continuous Deployment. In Part one, we gave an overview of what continuous deployment is and why we needed it, before exploring some of the technical challenges we had to solve in order to start doing it.
However good tooling and technical solutions could only get us so far — there were a whole host of other human and cultural factors we had to pay attention to and optimise to reap the benefits of continuous deployment.
Business buy-in
Assigning a team of developers to solely work on this new infrastructure was a big long-term investment, and a potentially risky move. We knew we were unlikely to see any short term benefit or return on investment, and it’s a difficult decision to divert resources away from creating customer facing feature work. Having support and buy-in from the very top of the company was critical to the success of this new venture.
It wasn’t difficult to make a business case — it was clear that investing in improving our development workflow would significantly improve development velocity, allowing us to deploy faster and more frequently. An example of when this demonstrated value to the business was a significant improvement to our Year End document process we worked on recently. Whilst this was in beta, we were able to develop and deploy many more bug fixes, often on the same day. Previously these would each have required a lead time of several days at least!
Honesty and continuous improvement
Deployment delays and production outages
The journey so far hasn’t been entirely smooth sailing. Teams were regularly blocked by breakages on the continuous deployment platform. We automatically spin up test environments early in the morning so that they are functional by the time developers and testers arrive in the office. A failure to deploy these successfully means that teams could be blocked from working until lunchtime due to not being able to deploy and test their code.
We had production outages; sometimes these were caused by bugs in the deployment process, and other times by development teams misunderstanding some of the processes and principles of the platform. In these cases it is extremely important to be honest about the causes of the issues.
Postmortem culture: investigating and learning from incidents
For every critical incident we do a full investigation and postmortem meeting where we discuss what went wrong and the lessons we can learn. It is crucial that these meetings aren’t used to blame individuals for their errors or bad decisions. It is true that incidents can quite often be traced back to human error such as buggy code in a script or service, but it is unrealistic to expect humans to write perfect code and make perfect decisions all the time. Instead, we prefer to focus on improving our process of catching errors (which includes automated testing and peer review) before they reach production. If our tests failed to catch an issue, then it’s a gap in code coverage that we need to address! The most important aspect of blame-free incident analysis is that it means no-one is afraid to speak up and raise an important issue. This means our system and processes should trend towards being ever more robust, as we continuously find and fix the faults.
Department-wide culture of improvements
As more of the development teams have started using the platform for continuous deployment, issues and feature requests have led to an ever-growing backlog for our infrastructure team. They needed to know how to prioritise work, to both satisfy these feature requests and reduce their operational load. In addition, we wanted to ensure there was a clear forum where anyone could bring up a point for discussion, or an issue they had experienced.
We set up a fortnightly meeting called “Platform Touch Base”, where everyone in the engineering team is invited to come along for an honest discussion on how things are going with the platform: any issues faced, opinions on how we should solve a particular problem, areas of confusion about a particular process or part of the infrastructure. We’ve had many passionate debates along the way, but these have often been extremely productive sessions that often clarify misunderstandings.
Key benefits of Touch Base meetings
Amongst other things, we’ve:
- Proposed and implemented improvements to our local development stack
- Incrementally improved the deployment process — the team decision to press forward with continuous deployment was made in this meeting!
- Discussed the best way to proceed with a complex database migration
- One key innovation which was refined in one of these sessions was our automated deployment notification bot
Tracking code changes all the way to production
Automation is fantastic at hiding away the nitty-gritty details. For example, when doing a zero downtime rolling upgrade, no one actually needs to be aware of when particular replicas of a service are destroyed or brought up. This is often a desired property, as it reduces complexity and noise, but it also has its drawbacks. When we started doing many more automated deployments with continuous delivery, there weren’t a large volume of changes going through the process, which made it quite easy to manually track something through the workflow. As development teams started pushing more work through the pipelines and we started doing more deployments, we started getting feedback that it was getting quite difficult for a team member to work out if their changes had been deployed to production.
Linking JIRA tickets to deployments
Like many teams, we use JIRA to track our work and development lifecycle. GoCD, our deployment tool, gives us a very useful page to compare two different deployments to see the changes that have gone out, with a full list of commits in every project. However, it’s quite time consuming to search through several deployments to work out if a particular JIRA issue has been deployed (via the tag in the commit message):
We decided to automate this process. Every time a new deployment runs, a bot searches for JIRA issue tags (e.g. PROJ-1234) in this list of changes. It then searches JIRA for these issues and attaches a label uniquely identifying the deployment, which makes them searchable in JIRA. It also adds comments to the tickets to notify anyone watching to tickets. Our product owners and scrum masters can see whether an issue has been deployed just by looking at the comments:
The bot then updates a channel on Mattermost (our chat tool) with a list of relevant links:
This gives us great visibility into our process. In one click, anyone can get a changelist of commit messages for a particular deployment, or a list of affected JIRA issues.
Why Tange?
Our bot is named after the infrastructure team mascot, a goat called Tange. His name originates from an early team meeting where someone misheard the phrase “tangible goals” as “tangible goats”.
Observability: instrumenting and monitoring the behaviour of the system in production
Hopefully it is clear by now that increased automation allows for a smaller team to manage a vastly more complex architecture than if they were doing it manually. One tradeoff of the additional complexity is that it is suddenly much harder to debug problems. For example, we might get an alert that a load balancer is returning a lot of HTTP 500 errors to clients. Where does the issue lie? In the load balancer, the web server backing it, the API gateway, the initial backend service, or any number of other backend services that processes those requests?
Centralised logging
Firstly, it is crucial to have a centralised logging system, so that logs from multiple disparate systems can be viewed in the same place. At Crunch we use the ELK (Elasticsearch, Logstash & Kibana) stack for this. We also take this info and make dashboards which are visible to the whole company on large screens in the office:
Linking to system behaviour
It’s also very important to measure how different parts of the system behave. “Behaviour” can be defined at many levels: right from basic system resources (How much CPU, memory, I/O is this node using right now?) right up to business/application-level metrics (how many emails has the application sent in the last 15 minutes?). We do this by using general monitoring solutions such as the Prometheus node_exporter to measure CPU and memory usage, and by defining custom metrics within our code that count the number of emails sent.
This helps us to debug complex issues, as we can relate symptoms (such as HTTP 502 errors) to intermediate causes/symptoms (NGINX is getting a timeout from the EmailService so returning a HTTP 502 to the user) to more low level symptoms/causes (EmailService is using 100% CPU).
Proactively resolving problems before they happen
Not only that, but we can proactively alert on abnormal conditions (5% HTTP errors in the last 5 minutes/CPU usage over > 90%), catch errors and fix them before our users become aware of them and fix them.
In the long term, development teams can also use this data to optimise their services, by focusing their efforts on the part of the stack which requires most attention.
Conclusions
We’ve needed to become much more efficient at developing and delivering software in order to support new products and develop new features, as well as embrace a departmental culture of honesty and continuous improvement. We’ve achieved a lot by adopting cutting edge industry practices to help us continuously deploy software, and by improving our logging, monitoring and alerting systems to allow us to avoid issues or debug them quickly when they arise. The ability to track code changes easily from commit to production has enabled our Engineering team to Crunch to speed up their deployments and deliver software faster. We hope that this post will help other small and medium sized businesses who are looking to improve their own software delivery pipelines.
What’s next?
So, where to next? Are we done, now we’ve implemented continuous deployment? Absolutely not! There’s still a lot of work for us to do.
It still takes a minimum of about 30 minutes to get from a commit to master to production deployment — this is phenomenal by most software development team standards, but we think we can improve this a lot more. One way we are looking to do this is by migrating our services from deploying on Amazon EC2, to Kubernetes. Currently the minimum amount of time to deploy a service from commit to production is around half an hour, due to having to build an AMI (see part one), then launch a number of instances. These take time due to the expensive and heavyweight nature of virtual machines. By building and deploying Docker containers instead, we hope to hugely cut down the time from commit to production.
There are also many other areas of the development workflow we can speed up — one objective is to make it possible to create a new service from a template with one command, with a fully functional blank Spring Boot service (which is the default framework choice for Crunch developers) and continuous integration/deployment pipelines, avoiding a lot of boilerplate and manual effort.