Cost saving strategies with Cloud Hosting — Part 1
At Crunch, as a business that develops and hosts its own software, we rely on cloud computing as a cornerstone of our business. Running infrastructure for over 10,000 clients doesn’t come cheap, so keeping our spending within budget is paramount.
Over the past year, the Synergy team, which is in charge of the Infrastructure here, embarked on a journey to bring back control over our cloud spending. What we ended up doing is achieving a lot more with a lot less.
TL;DR: Our team challenged assumptions until everyone assumed challenges.
Introduction
Before Synergy embarked on this journey, we needed two things:
- A clear mandate from the directors to reign in control of our cloud spending
- A clear and accurate understanding of our current spending.
It’s important to have a really good understanding of what’s happening before you make changes. It allowed us to have a baseline, so we could measure how much progress we were making. By improving the labelling of our resources, it allowed our data to be more accurate. It also enabled us to be very precise with the business stakeholders, which is key to gain trust.
Having backing from the top shouldn’t be a big deal, as directors are always keen to hear that we won’t go over budget. Having them proactively and regularly discuss this has been really helpful in enabling my team to work without too much resistance. In a way, it forced the engineering department to be more honest with each other and triggered some creativity where it mattered.
What did we do?
To achieve this goal, we did the following, mostly sequentially:
- Reduced the number of testing environments and streamlined them.
- Enabled increased developer ownership from code to production.
- Split every pipeline into the smallest possible unit.
- Reduced the process path from code to production.
- Worked/stopped working with vendor where there was opportunity to change.
- Reserved capacity for the resources we required.
- Managed our vendors more effectively.
By doing all of this, there have been some really clear outcomes:
- Our hosting costs have been halved.
- Deployment rate has been increased by a few orders of magnitude.
- The number of incidents has been drastically reduced.
- The culture in the department has improved.
In this and the following post, we’re going to go into more detail about how we have achieved all of this, starting with how we reduced the number of testing environments.
Testing environments
At Crunch, we had up to 10 environments in total. This includes Production and PreProduction on identical specifications, running 24/7. On top of this, there were six production-like environments used for testing by the development teams and two environments for infrastructure testing. Most of these were switched on most of the time.
It was clear from our reporting that a big chunk of our expenses was from having all these environments.
The engineering department had a culture where every team had their own full-blown, production-like environment. Each one would run every service, including the ones that they never really interacted with.
Each environment was also slightly unique, so teams could not use them interchangeably. Some would have a different data retention policy, an external integration, different helper scripts and different hardware specifications. This made it a lot of time and effort to maintain as well as update.
To counter this, we ended up merging five of our testing environments into one, called SmokeTest. The impact of this was clear, both on our availability, and the cost reduction. When something went wrong, Synergy had more time to focus on a single environment now.
Often before, the same problem would be manifested across all the testing environments, making our day an uphill battle from the start. This included having to re-prioritise and managing our stakeholders expectations.
Challenges
Changes like this are inherently hard since people are afraid of it. This is illustrated by the saying “Better the devil you know”. We had to convince four separate development teams, all with different worries and workflows.
The real challenge was finding a venue to communicate and create interest. We used every opportunity to talk about cost saving, including:
- In our digital standup where engineering meets with design, sales and marketing
- By changing the aim and format of our twice a month “Synergy touch base”
- With the Scrum Masters, helping us find common ground by trading some convenience for stability.
In the Synergy touch base in particular, we made it a venue for feedback to us (rather than us sharing with the wider team as much), giving us a platform to listen to the developers and find out what their actual problems were with the development environment. Our solution to cut costs would have to include solutions to current problems.
Outcomes
From reducing the number of environments that spin up each morning to one, our resolution time is now shorter. This is because when there’s a deployment problem, we only have to solve this on the one environment — meaning that we can focus on this 100%.
Synergy have dramatically increased our capacity to make changes. There is less for us to make changes to, so Synergy provides return on investment faster than ever before.
Ultimately though, we have saved a lot of money by not spinning up 100’s of EC2 instances, removing ELBs, managing ⅙ of the logs and ingesting ⅙ of the monitoring data.
Developer ownership
In the not-so-distant past, developers would throw their code over the fence and leave it for Synergy to deal with. Synergy always had tools like Jenkins and GoCD to automate most of the code to production path, yet Synergy was still required at the beginning and end of the process.
We have had to provide manual intervention in a few places. This included having to create code repositories on behalf of developers. We’ve now given permission to all the senior developers, letting them create new projects easily.
We also had to create and amend pipelines. We now have a chatbot that automates the creation of new pipelines for Java services. Developers can create a path to production directly from Mattermost (our internal chat tool).
Pipelines were having to be paused to allow us to manually enter secrets. We’ve now moved all the secrets into the developer’s remit and automated the need to manually enter passwords.
While it’s a softer approach, more ownership in the development teams empowered them to think broader than their usual remit. Developers feel more responsible and enabled to unleash their creativity. It might not look like much, but in the discussions that followed, discussions were no longer one-sided.
Challenges
The main obstacle in giving developers more ownership is to find developers that actually want it. Luckily for us, we have a fantastic team of developers here at Crunch, and the new starters have wanted to get involved as well. The teams are always keen to be able to do more.
Outcomes
There’s less toil which means work is less mundane for Synergy and helps us to focus on important things. Developers don’t have to ask for help as often, since they can start work on something new and drive it all the way to production with minimal external input. That’s a lot of “work in progress” that isn’t waiting to be actioned by a Synergy team member anymore. Two improvements to productivity!
The cost saving is mainly delivered in terms of the improvement in productivity. That said, we got better with password management, we removed Vault and we reduced the number of EC2 instances we run.
We also got the developers buy-in and that’s a major help in the change process.
Conclusion
This has been a summary of two of the key things we’ve done to help reduce cost. However, we’ve done plenty more! Soon we’ll be publishing another post to go through more strategies we’ve undertaken to help with this goal.
If you’d like to read part two, you can find it here!
Written by Florentin Raud, IT and Support Manager at Crunch.
Find out more about the Technology team at Crunch and our current opportunities here.