LISA18 has ended
Back To Schedule
Monday, October 29 • 11:00am - 11:45am
SLO Burn—Reducing Alert Fatigue and Maintenance Cost in Systems of Any Size

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Based on a true story.

As systems grow, they get more components, and more ways to fail. The alerts of the last system's design can slowly "boil the frog", and all of a sudden the SRE team finds they have no time left to address scaling problems because they're constantly firefighting. Alert fatigue sets in and the team will burn out.

Naturally maintenance work will always increase as the system itself grows. To make alerting sustainable, instead of on cause, only page on symptom, and even then by declaring what the acceptable threshold of symptom is -- also known as the SLO (and it's complement, the error budget).

Even at Google scale, many teams are yet to implement the change in their monitoring to realise SLO based alerts. But systems don't need to be the size of a planet to benefit from these patterns.

Whether you're oncall for 10 machines or 10 datacenters, in this talk a well rested champion of work/life balance will show you how to select service objectives, then to construct robust and low maintenance alerting rules using Prometheus for a live demonstration. We'll also discuss the tooling required to help make such a system retain observability in the absence of noisy caused-based alerts, now they're not telling you exactly which components are failing.

avatar for Jamie Wilkinson

Jamie Wilkinson

Jamie Wilkinson is a Site Reliability Engineer at Google. Contributing author to the "SRE Book", he has presented on contemporary topics at prominent conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and SRECon. His interests began in monitoring and automation of... Read More →

Monday October 29, 2018 11:00am - 11:45am CDT
Legends Ballroom D