Netflix—as a service and a system—goes through an enormous amount of change all the time. Our engineering teams make 1000s of changes a days while our customers stream 100,000,000s hour of entertainment every day. At that velocity, an outage seconds or minutes long has real and noticeable impact to our customers. Stir in some Chaos Engineering and things become even more unpredictable.
The talk begins with a story. Netflix had a healthy relationship with Chaos Monkey—our tool to ensure that instance loss didn’t affect a running service application. We’d had such good luck we extended our plans from just Chaos Monkey to more Monkeys that would do nasty things to our environment. A new entry, Latency Monkey, would help us increase the health between our microservices by injecting latency and errors at our common IPC layer. What we thought was a safe, little experiment went completely off the rails. The centralized SRE team, called CORE, realized that we’d have to think differently about outages and managing them if the company was going to to be successful moving forward.
This is the story of how the centralized SRE team at Netflix changed and adapted to help the service and our engineering teams prepare for and handle problems—big and small—when they do occur.
Key Takeaways:
- How Netflix prepares for failures
- Incident Handling at velocity requires special expertise
- Preparation and training of everyone that runs services is key for quick recovery
- You should be spending more time after an incident learning that you do during an incident managing
- Outages being unique is an excellent goal—it takes work to make it happen