By definition, SREs are responsible for the reliability of sites, but what if they don’t own any sites themselves? Within ING, the largest bank of the Netherlands, BizDevOps teams are autonomous and responsible for the build and run of their services. In theory, that could make the existence of SRE obsolete, right? How can you improve availability for end customers in an environment of engineers with full service ownership? How to convince without the power of intervention? How to improve without being blameful?
We’ll explain how we, a team of 8 SREs among 1700 DevOps engineers, try to improve stability by focussing on software engineering. We created the Reliability Toolkit to help BizDevOps teams with their reliability challenges in the fields of white box monitoring and alerting while minimizing toil.
This talk will explain:
- Our SRE team purpose and why we think our approach with heavy focus on software engineering works for our organization
- The concept of the Reliability Toolkit and introduction of its components and their set up (Prometheus, Alertmanager, Grafana, NGINX Log Aggregator, SMS and ChatOps functionalities)
- How we provision Reliability Toolkit
- How we convince, onboard and educate BizDevOps teams to use the Reliability Toolkit
To conclude we will end our talk with a demo of our Reliability Toolkit.