LISA18 has ended
Back To Schedule
Monday, October 29 • 2:00pm - 2:30pm
Designing for Failure: How to Manage Thousands of Hosts Through Automation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
At Uber, we run thousands of services on top many thousands of hosts using Apache Mesos with the Apache Aurora framework. This setup ensures that when a host breaks a service will automatically get rescheduled to another host, but what happens to the host? What happens when a host is still running services but is misconfigured or has a hardware fault that can be affecting the performance of the service. How about when you want to upgrade the Kernel or other software across your fleet. At Uber, we created CLM or Cluster Lifecycle Manager which is used to answer these questions in a safe and automated way. In this talk we will go through the architecture we are using to make this possible and how we are ensuring our actions don't impact services.


Brandon Bercovich

I've been working in the industry for over 18 years as a Systems Administrator, a DBA, and now a SRE. At Uber my team manages our Compute platform through automation.

Monday October 29, 2018 2:00pm - 2:30pm CDT
Legends Ballroom ABC