Name: Incident Management at Netflix Velocity
Start: 2018-10-29T11:45:00-0500
End: 2018-10-29T12:30:00-0500

Incident Management at Netflix Velocity

Feedback form is now closed.

Netflix—as a service and a system—goes through an enormous amount of change all the time. Our engineering teams make 1000s of changes a days while our customers stream 100,000,000s hour of entertainment every day. At that velocity, an outage seconds or minutes long has real and noticeable impact to our customers. Stir in some Chaos Engineering and things become even more unpredictable.

The talk begins with a story. Netflix had a healthy relationship with Chaos Monkey—our tool to ensure that instance loss didn’t affect a running service application. We’d had such good luck we extended our plans from just Chaos Monkey to more Monkeys that would do nasty things to our environment. A new entry, Latency Monkey, would help us increase the health between our microservices by injecting latency and errors at our common IPC layer. What we thought was a safe, little experiment went completely off the rails. The centralized SRE team, called CORE, realized that we’d have to think differently about outages and managing them if the company was going to to be successful moving forward.

This is the story of how the centralized SRE team at Netflix changed and adapted to help the service and our engineering teams prepare for and handle problems—big and small—when they do occur.

Key Takeaways:

How Netflix prepares for failures
Incident Handling at velocity requires special expertise
Preparation and training of everyone that runs services is key for quick recovery
You should be spending more time after an incident learning that you do during an incident managing
Outages being unique is an excellent goal—it takes work to make it happen

Speakers

Dave Hahn

Netflix

Dave Hahn is a Senior SRE in the Cloud Operations & Reliability Engineering organization at Netflix. He has designed tools and systems used by many teams in the organization to support the Netflix service. He has decades of experiences in systems operations, networks, reliability... Read More →

Monday October 29, 2018 11:45am - 12:30pm CDT
Legends Ballroom ABC

Talks Track 1

Attendees (125)

J
R
J
J
m
d
N
p
s
t
View All →

LISA18

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Dave Hahn

Attendees (125)