SRE Wasn't Invited to the AI Party

Nov 4

The email comes down from leadership. "Everyone needs to leverage AI tools to increase productivity." The development teams nod knowingly. They've already got Copilot suggesting code completions, helping them write unit tests, and explaining complex functions.

Then there's you. Your team. The ones who keep everything running.

Your "code" is a Kubernetes manifest. A Terraform configuration. A bash script that glues three systems together because that's what the architecture demands. You open GitHub Copilot, and it suggests... what exactly? Another way to indent your YAML?

This is the infrastructure paradox. Leadership mandates AI adoption. Vendors showcase developer productivity gains. Meanwhile, SRE and infrastructure teams sit in the gap between the directive and the reality, unable to demonstrate value from tools that weren't built for how they work.

The Nature of Infrastructure Work

Here's what most people miss about infrastructure and operations work. Traditional software development follows patterns that AI tools learned from millions of GitHub repositories. Functions have similar structures. APIs follow conventions. Algorithms repeat across codebases.

Infrastructure work is fundamentally different.

You're not writing algorithms. You're declaring desired states. You're orchestrating systems that need to talk to each other. You're scripting one-off operations that handle the specific, often bizarre requirements of your particular environment. The patterns are there, but they're buried in an organizational context that no AI training set has ever seen.

Your team knows this intuitively. A developer might write a REST API that looks like thousands of other REST APIs. You're writing Terraform that provisions infrastructure in a way that satisfies your security team's requirements, your networking team's constraints, and that one legacy system that nobody fully understands anymore.

Where's the AI that helps with that?

The Mandate Without the Tools

The pressure is real. Leadership sees headlines about productivity gains. They read case studies showing developers shipping features faster. The directive comes down: demonstrate AI usage.

But demonstrate it how?

You can get AI to suggest a better way to structure a Helm chart. Maybe. If you prompt it carefully enough. But you already knew the better way. The constraint isn't knowledge. It's that five different teams have dependencies on the current structure, and changing it requires coordination you don't have time for.

You can ask AI to help debug a Kubernetes networking issue. It will suggest the same troubleshooting steps you already tried. Because the actual problem is that specific network policy your compliance team requires, interacting with that service mesh configuration that solves a problem three teams ago faced.

The traditional code companion tools optimize for a workflow that isn't yours. They assume you're building features. You're keeping infrastructure reliable. These are related but distinctly different challenges.

Where the Real Gaps Exist

The disconnect runs deeper than tools. It's about visibility.

Development teams can point to features shipped. "We used AI to accelerate this implementation." Infrastructure teams maintain systems that, when working correctly, are invisible. Your success is measured in what doesn't happen. Outages prevented. Issues caught before they cascade. Systems that scale smoothly when traffic spikes.

How do you demonstrate AI productivity gains when your job is to make sure nothing goes wrong?

Traditional metrics don't help. You can't show AI wrote 30% of your infrastructure code when most of your week was spent investigating why memory usage gradually increases on a specific node pool, then implementing a solution that's three lines of configuration change after hours of analysis.

The knowledge work in infrastructure is different. It's systems thinking. Understanding blast radius. Predicting failure modes. These are areas where AI could potentially help, but not through code completion.

Where AI Could Actually Make a Difference

Think about what consumes most of your time. It's not writing manifests. It's troubleshooting. Incident response. That 2 AM page because something's wrong and you need to figure out what, fast.

You're correlating logs across multiple systems. Checking pod status. Examining metrics. Running diagnostic commands. Comparing current state against what should be normal. Most of this follows patterns, but they're patterns specific to your infrastructure, learned through painful experience.

This is where AI agents could actually help.

Not code completion. Not suggesting better YAML formatting. But agents that can parse logs, correlate data across systems, and help you find the needle in the haystack faster. Agents that reduce Mean Time To Resolution by doing the grunt work of data gathering and pattern recognition while you focus on analysis and decision-making.

Picture this: you get paged. Something's degraded. Instead of manually checking logs across services, querying metrics, checking pod status, and trying to piece together what's happening, you have an agent that's already doing that work.

It's parsing application logs and system logs simultaneously. It's pulling the current pod status and comparing it to the baseline. It's checking recent deployments and configuration changes. It's correlating error patterns with infrastructure events. And it's surfacing what matters, not dumping raw data at you.

The agent doesn't make the decision. You do. But it compresses the time between "something's wrong" and "I know what's wrong" from forty-five minutes of investigation to five minutes of analysis.

That's measurable. That's demonstrable AI value. And it's actually solving a real infrastructure problem.

Building Intelligence Into Operations

The key difference is context. Generic AI tools don't know your infrastructure. But an agent trained on your log patterns, your system topology, your typical failure modes? That becomes genuinely useful.

Start with the basics. Build an agent that can query multiple data sources simultaneously. Your logging system. Prometheus or whatever metrics platform you use. Kubernetes API for cluster state. Application health endpoints. Let it gather the information you'd normally gather manually.

Then add correlation. The agent learns what normal looks like in your environment. When it sees anomalies, it knows what related systems to check. That spike in error rates coincides with that deployment and that increase in pod restarts. The agent surfaces those connections.

The sophisticated part comes with pattern recognition. Over time, the agent learns from resolved incidents. It recognizes symptoms that previously indicated specific root causes. It doesn't diagnose definitively, but it suggests where to look first based on similar situations.

This isn't hypothetical. Teams are building these agents now. They're using them to reduce MTTR from hours to minutes. They're demonstrating clear productivity gains that leadership actually understands. Because "we reduced incident response time by 60%" is a metric everyone recognizes.

The Practical Path Forward

You don't need to solve everything at once. Start where the pain is sharpest.

Pick your most common incident type. Maybe it's pod restarts. Build an agent that gathers all relevant data when pods restart unexpectedly. Logs from the pod before termination. Resource usage leading up to the restart. Recent config changes. Node health. Let the agent compile this automatically instead of you running six different commands.

Test it during the next incident. Did it save time? Did it surface something you would have missed initially? Refine based on what you learn.

Then expand. Add more data sources. Improve correlation logic. Train it on your historical incidents so it recognizes patterns. Build in the tribal knowledge that currently lives in senior team members' heads.

The beautiful part? This is AI usage that leadership understands. You're not trying to explain how Copilot helped you write a Terraform module 15% faster. You're showing concrete improvements in system reliability and incident response. These are metrics that matter to the business.

What Good Looks Like

Imagine your operations workflow with intelligent agents embedded:

An alert fires. Before you even look at it, an agent has already gathered context. It's checked the logs across all affected services. It compared the current system state against the baseline. It identified recent changes that might be relevant. It presents a summary: "Error rate increased 400% in service X, correlated with deployment Y, similar to incident Z from last month."

You're not starting from zero. You're starting from "here's what's different and here's what it might mean." Your experience and judgment determine the response, but you're not wasting time gathering basic information.

During remediation, the agent continues helping. You implement a fix. It monitors the impact across systems, confirming whether the change is having the desired effect or if something else needs adjustment.

After resolution, it helps with the post-mortem. Timeline of events already compiled. Related system changes documented. Similar past incidents are linked. You write the analysis and learnings, but the data gathering is handled.

This is AI that fits infrastructure work. It doesn't try to write your code. It helps you understand and respond to complex system behavior faster.

Moving Forward

The disconnect between AI mandates and infrastructure reality won't resolve itself. But you can address it by focusing on where AI actually provides value for your work.

Stop trying to force developer tools into infrastructure workflows. Acknowledge they're built for different problems. Instead, identify where AI agents could compress time-consuming investigation and correlation work.

Build internal capabilities. Your infrastructure is unique. Generic tools won't capture its complexity. Invest in building agents that understand your specific environment, even if they start simple.

Measure what matters. MTTR is a metric everyone understands. Incident volume reduction. Faster root cause identification. These demonstrate business value far better than "lines of manifest generated."

Share learnings across teams. Other infrastructure teams face the same challenges. Compare approaches to agent-based troubleshooting. Build collective knowledge about what works.

The mandate to "use AI" isn't going away. But you can shift the conversation from "how are you using Copilot" to "how are we using AI to improve system reliability and reduce incident response time."

That's a conversation worth having. Because it's about solving real problems, not checking compliance boxes.

The gap between AI hype and infrastructure reality is real. But there's a path through it. It just doesn't look like what vendors are selling to developers. It looks like agents that help you do what you already do, faster and more effectively.

That's where the value is. That's what's worth building.

MetricsCommunityInfrastructureAsCode

Paul Karsten