All Hands on DevOps #10

ecs-admin 25th August 2017

Yesterday our 10th All Hands On DevOps Meetup, co-organised with Third Republic, was hosted at Shazam’s office.

Our first talk was done by Dawn James, Portal Architect at Kobalt Music.

Dawn went through the issues he faced previously when Kobalt had a monolithic application. With monolithic application usually comes the same set of issues, regular downtime, difficulty to deliver changes quickly, etc.

Over the course of a year, he was able to transition from on premise to AWS, to go from using zip files to Docker, from PHP endpoints to JSON APIs. They now use micro services, Terraform and Fabric.

The change was progressive, one piece of the monolith at a time, but the result is no regular downtime (and hopefully no irregular one!), better performance (triple API speed), quicker deployments, and more independent developers who can self service.

Dawn mentioned he was a big fan of HashiCorp, but the biggest blocker to the adoption of Terraform being the difficulty for developers to adopt the tool.

ECS Digital offers a range of HashiCorp training sessions, from Consul, Terraform and Vault – register for our courses here.

Our second talk was split between Ben Belchak, Head of SRE and Jesús Roncero, Site Reliability Engineer at Shazam.

Ben talked about Shazam’s journey to containers over the past three years. When Shazam started, 20 odds years ago, it was a monolithic application, with unpatched OSes. Progressively, micro-services creeped in, in an ill-conceived way, while trying to match the business requirements and the deadlines, while putting out fires. A lack of good communication across offices spread around the globe also played a role in creating silos. At that point, there was a large amount of technical debt that needed to be addressed.

Enters Ben.

He started by defining targets: A happy team, a stable infrastructure, a good relationship between SREs and software engineers, and monitoring systems that could be trusted.

He took steps to move towards these targets. He demolished the silos that each SRE in the company had created, with the support of all levels of management, CTO and CEO comprised.

He started addressing each and every alert, deleting the useless ones, and properly annotating the useful ones.

He addressed recurring issues, which snowballed and freed up more time to fix more issues.

He then worked on automating deployments, going from taking more than an hour of an SRE’s time for a single deployment to no time at all after pressing a button.

He collected extensive metrics and incident tickets over the entire stack to understand exactly what was going on across the board, going as far as doing pre-mortems to try predicting future issues.

This lead to much more breathing room and a much more stable environment, with developers happy to focus on their work.

Later on, Ben walked us through how Shazam went from baremetal to running Kubernetes on Google Cloud.

Shazam’s servers were provisioned for large events, like the Superbowl or the Granny’s, which meant that most of the time, a lot of hardware was sitting unused.

At the beginning of 2017, they started migrating to Google Cloud. They now have almost all their clusters on Google Cloud, with only a few services left on premise.

The adoption of Kubernetes came after, and it emerged from these wants: Self healing services, auto scaling based on metrics, self sufficient developers, rolling deployments and rollbacks, dynamic monitoring based on SLOs, and the ability to create several environments from the same Docker image.

This has all been achieved, using Kubernetes on Google Cloud, with the help of helm (The Kubernetes Package Manager).

This gives Shazam the right amount of processing power at the right time, and the ability to deploy changes safely, quickly and reliably.

 

Thanks to everyone who came along. We love hearing about people’s knowledge, whilst consuming beer! We hope everyone had a great time and learned something new.

 

As always, we’d love to hear any ideas and suggestions you might have for our next event. 

Found this interesting? Why not share it: