From baker to businessman aka Site Reliability Engineering at Crunch
When one talks about companies using cutting-edge technologies in Brighton, Crunch is usually one of the first companies that comes up in conversation. That doesn’t mean we have no place to improve, but the amount of time we actually spend at the next level upgrading from one technology to another is significant on its own.
This kind of technological involvement takes its toll if it isn’t done with the proper focus, therefore the bigger challenge tech companies are often facing is not coming from the technology, but from the lack of principle.
In a world where the list of technologies is almost endless, the question shouldn’t be “which technology are we going to use to solve a problem?”, but rather, “how are we going to choose from the endless list?” or “what will drive our choice?”. Most companies know that one of the key factors should be efficiency, and tech companies in particular know that a big part of efficiency comes from resilience.
Before we dig deeper into the subject, I’d like you to imagine you are opening a bakery, because your nan made the best macaron of all time, and you want to share it with the world (and why not become successful at the same time?).
Before you could set up all the infrastructural environment for a bakery, your focus should be on the product, the recipe, and how it manifests into a delicious pastry, which is crispy on the outside and creamy on the inside without being chewy. If you are to make a successful business, you should be driven by the quality of your product, because that’s what will get people talking. If the product is great, the word will spread and your customer base will grow organically.
Let’s say the business is booming, and you have so many customers a day that your shop can’t bake enough macarons. The smart choice is to open up another shop, so you can serve the growing need while automating the processes to ensure the same high quality is served to every customer, no matter where they shop.
As your business grows, new possibilities open up, and what you decide to spend your money and time on becomes crucial. Making a business bigger (or even just sustaining one!) requires more than your nan’s recipe. At some point, you’ll have to become a businessperson and make hard decisions. The question is going to be: “What are you going to base your decisions on, once the quality of your product is proven?”.
I guess in a real-life scenario, preparing your business for smaller disruptions or bigger disasters comes down to having proper insurance in place. However, disruptions in the world of IT don’t have to take days, or even hours. Just to support this with an example, let’s say having an unhealthy microservice written in Go, sending VAT returns to HMRC, could be restarted and be ready in as quick as 15 seconds.
While recovering from disaster might be quick in the digital world in contrast to a shop’s physical failures, with a tech company you have the disadvantage of the unlimited competitors offering similar services in the online world — so you might want to focus your attention on providing a high level of service.
Tech companies’ growing focus on resilience is not a new come-and-go trend, nor an architecture style — but it’s ultimately a business tool which recognises that the quality of the product will not be enough to grow your business. The product will also need to have proper reliability, or as Google likes to call it: Site Reliability.
How did it start at Crunch?
Back in the day, we had DevOps Engineers, with the main focus of automating as many things in our digital realm as possible. Crunch has three separate working environments: SmokeTest (where developers can test their feature branches), PreProduction (where our automated tests are running against our service’s master branches), and Production (that one needs no introduction).
Our services are all living in a Kubernetes cluster on AWS, so our Production cluster is up and running 24/7. We quickly discovered, though, that having the other two test environments running is anything but cost-efficient. That’s where CloudFormation comes into the picture, as Amazon offers this service completely free of charge.
You make a shopping list in JSON format (as you would normally have on your shopping list) and you put each AWS components on it you’d like to have, including the VPC, network settings, EC2 instances, EKS, RDS, Redis, ELB, and so on, with all the necessary configuration details.
We send our list to CloudFormation on every weekday at 6:00 AM, and they make sure that every required component is ready as soon as possible. At the end of our working day, our GoCD pipelines send a signal to CloudFormation confirming that we don’t need their resources anymore, so the environments can be torn down. The nice thing about this is that we only have to pay for the resources we actually use, nothing more.
One thing the Infrastructure team is really proud of is automation. If a Developer at Crunch needs to create a completely new micro-service, it is just a matter of a few commands sent to our ChatOps on Mattermost (open-source version of Slack). We refer to our ChatOps as Tange, as he is our team mascot. He’s a fluffy little goat.
Our Continuous Deployment process is quite straight-forward: developers can test their feature branches against their unified local docker-compose stack. They’re also responsible for writing the automated tests against their services which includes unit tests, integration tests, and the API blackbox tests which we usually write in Python, as we found that pytests are really easy and quick to write on the go.
If they’re happy with their change locally, they can build that branch with a pipeline (by sending one command to Tange, our ChatOps bot). The generated Docker Image will be sent to AWS ECR. We also generate a helm-chart for every one of our builds — it doesn’t matter if it’s written in Python, Java, Kotlin, Go, or NodeJS, as everything is living in a Docker container at the end of the day.
When developers deploy their services into SmokeTest, they basically deploy a helm chart in the background which will fetch the relevant Docker Image from the ECR. Once a team gives at least two approvals on a Pull Request, the feature branch can be merged into master, and an automation picks up every merged PRs to ensure that the new version will be deployed to PreProduction. This environment is where the Black Box Tests will prove if the new version is not breaking existing behaviour of the API.
Once the PreProduction pipeline is passing all the tests which run against that service, the Production pipeline automatically kicks in, deploying the new, tested version of the service into our Production Kubernetes cluster. That means we can have as many new versions being released on a single day as a developer can code in 15 minutes, without any human interaction. We usually have about 16 releases a day on average.
This also means that it’s super easy to revert changes within the Kubernetes cluster in case something goes wrong on Production, as every helm chart has a version number. In the unlikely event that a bug makes its way to our Production cluster, we can simply request a revert to the previous version of the helm chart, making a quick and safe recovery from production issues. It takes five minutes for us to roll back, and it’s completely automated.
How can we be more resilient?
Not long ago, I heard a great SRE saying: “You can’t improve what you can’t measure”.
When our DevOps team were looking to find ways to make our infrastructure more resilient, we quickly realised we won’t be able to deliver the right level of resilience for the business unless we have visibility over how our systems are working to the smallest details. At the very beginning of this realisation, we were interested in literally everything, including uptimes, response time, latency, and HTTP response codes.
Our initial approach was to make as many Service Level Indicators (SLI) as possible. That included our low-level statistics like Kubernetes nodes’ performance, CPU, memory usage, and free disk spaces across our linux VMs. It also included the higher level metrics like the response time of our marketing website.
As of today, we have over 100,000 metrics generated by our services and our infrastructure, and we gather all of these data points in Prometheus, using Thanos on top of it, to ensure we can store all our old data points in an S3 bucket.
Given the number of metrics we put in place, one might think that we overdid the job a little bit, but from an SRE perspective that couldn’t be further from the truth. Metrics are invaluable; the only question is what we do with them.
You can have all the metrics in the world, but without putting them into context and monitoring them, they do not serve a purpose. Service Level Objectives (SLO) are monitored by our SRE team, to make sure that our planned (and unplanned) maintenance and upgrades on our system do not cause too much interruption beyond our downtime budget.
One of the major challenges for an SRE is to recognise which SLIs are important to follow, and how to prioritise them. For many businesses, SLOs should be driven by the customer’s needs which usually means uptime of the company’s services, response times to ensure quality service, and HTTP response codes to monitor possible issues. That’s also true for Crunch, and these SLOs are being heavily monitored to ensure that we’re giving the best customer experience possible.
But with all honesty, when you define your SLOs, you also have to stay realistic. For example, our marketing website’s uptime SLO is defined as 99.9% — which means that we have to have our website up and running for 99.9% of the time. If you calculate the number of minutes in a month, you can easily see that we have a very imperative rule saying we cannot do upgrades or maintenance which would require more than 42 minutes of downtime in a month altogether.
That is of course not the case for bigger companies like Amazon or Google, where they go even further, adding more nines at the end of the SLOs, but you experience that the higher the aim when defining an SLO, the more complex and more expensive solution you have to implement to achieve that.
On top of our “customer-facing SLOs”, as an infrastructure team, we also had to come up with “internal SLOs” where the focus is on our engineering and product teams, ensuring there are no downtimes interfering with their daily use of the infrastructure (UAT, build, release to production).
Cherry-picking the SLIs and defining high but reasonable SLOs is probably one of the hardest parts of the job for an SRE at Crunch, and it’s part of an ever-evolving journey on reaching perfect efficiency.
But what if the biowaste encounters the roto-impeller?
When things go south, and our clients are affected in any way (even if it’s just a growing latency across our services), we try to respond immediately. SREs are usually the first responders, but that doesn’t mean we’re responsible for coming up with the fix in all cases. That being said, if there is a workaround to put in place, we tend to be the one to implement it, until service owners (usually one of our Dev teams) come to further investigate the root cause.
We document the original symptoms, the investigation, we draw the timeline, and we try to facilitate the communication across the engineering team when required. Occasionally we need to allocate Investigation Leaders and, even if the fix is deployed to production, we don’t consider it as a done deal.
Usually, within a few days past the incident, we gather together for a blameless Post-Mortem (some call them lesson learned) meeting, where we invite all the stakeholders who were involved with the product, the incident or its solution and go through the timeline retrospectively, trying to identify how we can make sure that the same issue doesn’t happen again in the future.
So are you DevOps engineers after all, or what?
SRE is definitely part of the DevOps world, and at Crunch, I would argue that we’re still dealing with tasks which one would consider being “classical DevOps” tasks. But with every passing day going by, we have more automation in place, and less human interaction required, allowing the infrastructure team to focus on monitoring and observability.
We spend approximately 50% of our time on so-called “toils”, which are non-automated tasks requiring our manual interaction. The rest is spent with monitoring grafana dashboards, handling incidents, communicating with stakeholders to draw priorities, organising chaos engineering game days, and coming up with ideas on how our system can be more resilient.
To be effective, we work closely with developers. We keep organising Tech Talks and workshops where we teach them our methods and involve them to ensure that everyone has the operation knowledge on tracing, monitoring and observability in general to keep track of business-related trends. That is because we truly believe that reliability, just like security, doesn’t belong to one team — it belongs to everyone.
Chris Heisz — SRE at Crunch