Loading…
LISA18 has ended
Monday, October 29 • 11:45am - 12:30pm
Incident Management at Netflix Velocity

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Netflix—as a service and a system—goes through an enormous amount of change all the time. Our engineering teams make 1000s of changes a days while our customers stream 100,000,000s hour of entertainment every day. At that velocity, an outage seconds or minutes long has real and noticeable impact to our customers. Stir in some Chaos Engineering and things become even more unpredictable.


The talk begins with a story. Netflix had a healthy relationship with Chaos Monkey—our tool to ensure that instance loss didn’t affect a running service application. We’d had such good luck we extended our plans from just Chaos Monkey to more Monkeys that would do nasty things to our environment. A new entry, Latency Monkey, would help us increase the health between our microservices by injecting latency and errors at our common IPC layer. What we thought was a safe, little experiment went completely off the rails. The centralized SRE team, called CORE, realized that we’d have to think differently about outages and managing them if the company was going to to be successful moving forward.


This is the story of how the centralized SRE team at Netflix changed and adapted to help the service and our engineering teams prepare for and handle problems—big and small—when they do occur.


Key Takeaways:

  • How Netflix prepares for failures
  • Incident Handling at velocity requires special expertise
  • Preparation and training of everyone that runs services is key for quick recovery
  • You should be spending more time after an incident learning that you do during an incident managing
  • Outages being unique is an excellent goal—it takes work to make it happen

Speakers
DH

Dave Hahn

Netflix
Dave Hahn is a Senior SRE in the Cloud Operations & Reliability Engineering organization at Netflix. He has designed tools and systems used by many teams in the organization to support the Netflix service. He has decades of experiences in systems operations, networks, reliability... Read More →


Monday October 29, 2018 11:45am - 12:30pm CDT
Legends Ballroom ABC