Session Description
Imagine an Oncall shift where you don’t start your day sifting through routine capacity alerts, nudging stuck rollouts, or closing noisy, low-impact tickets. Instead, you get to tackle things that matter. This is the goal we’re chasing at Google.
We are developing a system where software agents can autonomously handle a significant chunk of operational toil. The key is to do this generically and horizontally, making the solutions broadly applicable crossing the lines between developers and operations.
In this session, I’ll share our journey and lessons learned. We’ll cover the significant challenges, including evaluation, ensuring safe and secure operations, and how to codify complex, sometimes opinionated, remediation steps. I’ll outline the infrastructure we’ve put in place due to those challenges and requirements.
This talk aims to provide a practical perspective on leveraging automation and agents in a production environment. You’ll leave with critical questions to consider for your own agent that interacts with production.
Speaker
Google, Production & AI manager
Today: Managing a team of Site Reliability Engineers
Before: >10years of being an IC SRE @ Google
PhD in Information Retrieval Master in AI





