Cost saving strategies with Cloud Hosting — Part 2

Crunch Tech
8 min readMar 13, 2019

--

In Part 1, I began to talk about some of the strategies we’ve implemented at Crunch to save money on our Cloud Hosting, whilst also enhancing our processes and culture within the Engineering team. This post will pick up where we left off with four more things we have implemented.

If you haven’t already, I would highly suggest giving Part 1 a read since it will give you a great introduction to the work that we’ve been doing.

So without further ado, let’s move on and continue to dive into this topic.

Smallest Deployable Unit

The number of pipeline runs: a pretty clear picture of how much velocity we gained when we starting making changes.

We were spending lots of time and resource on the automated processes themselves when deploying things. Originally we had one pipeline that would do everything. However, this was proving to be unmanageable as the number of services were increasing.

So we broke it down into smaller pipelines in two ways. Initially, we split the underlying infrastructure and the services into their own separate pipeline. We followed this up by then splitting the services pipeline so that every service had their own individual deployment pipelines.

On top of this, we also parallelised the tasks within the pipelines where we could. Ultimately, this enabled us to deliver a lot more value in the same amount of time, whilst using a comparable amount of computing power.

This now enables the team to deploy small improvements, since the pipelines are now very small, self-contained and short.

Challenges

The most significant challenge that arose because of this was the fact that we were now receiving so many notifications due to having a lot more pipelines. Therefore we have had to review our notification strategy.

We had to move from a world where we needed to know the status of every pipeline, to only alerting us when they are failing. On top of this, we also had to narrow down the audience of each and every notification. We landed on a compromise where only people that care about a change would be notified about it.

We had to create a lot of new channels in Mattermost (our internal messaging system) and make sure people who needed to see the notifications didn’t mute the channel straight away due to them being too noisy. This proved to be quite a challenge.

Outcomes

Since a pipeline now only covers a small area, when it breaks we avoid compounding multiple problems in one pipeline. This means that small problems no longer snowball into incidents.

We now also have an incentive to fine tune the time taken for a service to startup. This is because it’s now one of the most time-consuming parts of the deployment process.

Happy Path to Production

A few months on, we changed the way we are counting pipeline runs, as compared to the other graph above (we would have to use a logarithmic scale otherwise)

The path to production used to be fairly long in terms of the number of steps required and the time taken to get there. Since we had now merged all the development environments into one, we now had an opportunity to redefine what CD really meant.

We now have a “feature branch” path where developers can commit and push to the one development environment. This is useful for when individual features need to be reviewed on a production-like environment before merging to master.

We also have a “master branch” path, where developers commit code to and it will be automatically built. This will then get deployed to PreProduction, have automated tests run against it, and then be pushed to Production if this is successful.

Challenges

Changing the way to deploy to Production wasn’t so much of a challenge since we had already been successful in getting the testing environments merged into one.

However, testers have been most impacted by this change, as the quality of the tests written has become much more important. A flaky test is now a bigger risk since these tests are now only run on PreProduction.

Outcomes

Since the path to Production is now 15 minutes and a rollback is five minutes, we don’t have any incentive to fix things by hand.

With a short path to Production we are deploying a lot more per day, enabling the delivery of as small an increment as possible, which helps us strive towards becoming a more Agile organisation. The rate of release is an order of magnitude higher, if not two orders higher, compared to before.

Investment

One of the paradigm shifts with cloud computing is that you pay per second/minute/hour. Whilst it’s very good for all ephemeral workloads, the bulk of the Production infrastructure is on 24/7.

All the previous efforts had been focused on reducing the quantity and quality of the resources needed as part of our development practices, there isn’t much that can be done for long-lived servers (think database etc).

Autoscaling is great if you have huge seasonality in the quantity of traffic to your products. However, it turns out we don’t have any reasons to justify having more than three of anything. But, for disaster recovery, we tend to have a minimum of three of anything. This means that autoscaling is not really something that we can make a saving with.

What we did instead, was to make sure we run the right workload on the right instance. Then it was just a matter of choosing a reservation agreement and getting the money from our finance team.

Challenges

One of the biggest challenges was to find the right instance type/OS/workload match. Finding out what workload is a good fit for a “T2” instance type where CPU usage is low yet bursty, requires some testing. Upgrading to the right instance/OS takes time and planning also.

Going to our finance controller and asking for some money with a credible promise to save some was fairly easy, it’s just a matter of timing and cash flow.

Outcomes

It didn’t go perfectly — we had some AWS instances reserved that we couldn’t use because limitations with our vendors make it impossible for two months. That meant that we didn’t hit a perfect return on investment.

However, we still saved 30% by investing and paying money upfront. We reached that number because on average we were using 99% of what we reserved. In the context that’s about two and a half months worth of savings over the course of the year. That’s a lot of money!

We currently have a low coverage, about 44% of our infrastructure is reserved. The goal is to cover up to 60% in the next round of reservations, paying more upfront but saving more. At this point, we may look into spot instances for the rest of the 40% of our infrastructure.

Vendor Management

We are big open source users, we believe it makes for better software. However, some are easier to manage than others. Redis, RabbitMQ, ELK stack, Linux, Prometheus and Kubernetes are all tools that we really enjoy using.

In the end, we found the ELK stacks to be an expensive stack to manage. We used our own, but it was expensive and labour intensive. With our initial vendor, it was also expensive but less labour intensive. With our latest vendor, it’s less expensive, involves no labour at our end and they built in a value-adding service, which is some additional dashboards that we enjoy using, to distinguish themselves from the market.

We also migrated our Kubernetes provider. We could have opted for using KOPS on EC2, but we ended up going for EKS. Our previous provider was quite expensive, so it saved us a fair amount of money, and since AWS manages K8’s component upgrades, we also save ourselves a fair bit of time.

Challenges

Working with vendors is quite time-consuming. Not only do we need to negotiate the best deal, we also have to review legalities and GDPR provisions for example.

Since the Synergy team have a good track record, it wasn’t too hard to onboard new vendors and sunset others. Would we have started with vendor management to save cost, it would have been a much harder sell to the senior management and finance team.

Migrating all our workload from one provider to another is fairly risky, we had to communicate and plan the migration of Kubernetes very carefully. We’re lucky that we have the Engineering department and wider business’ trust and support to carry out such a big change.

Outcomes

By using vendors, rather than managing these internally, we’ve saved a lot of money because their cost is less than what we used to pay in EC2 instances. Also, a lot of time is saved within the Synergy team since we don’t need to manage the clusters in-house. We also have less vendors in general, which reduces time and effort to negotiate deals.

Conclusion

This is the story from Synergy’s point of view, but overall it has been a department-wide change. We use cost saving as a strategy to improve different feedback loops. Every part in which we regained control over our spending also served another purpose.

The biggest savings we have made, have had to do with culture and development practices. Only once these changes have been put in place can you really start going on a penny-pinching adventure.

All of these changes have been possible because we reached out and took people’s feedback on board with a clear objective in mind: gain control on our Cloud budget and improve service whilst we’re at it.

We made informed decisions backed by data and invested where it matters. We focused our energy to make a big impact and keep velocity. As part of that culture shift, we increased trust in our developers to run their change to production on their own.

Challenging the status quo is hard because it challenges habits and culture. Done well, it is a liberating exercise.

What are you waiting for?

Written by Florentin Raud, IT and Support Manager at Crunch.

Find out more about the Technology team at Crunch and our current opportunities here.

--

--

Crunch Tech
Crunch Tech

No responses yet