Ai In DevOps: Fewer Fires, Better Sleep

Practical ways we can use ai without turning ops into a circus.

Where ai Actually Helps Our DevOps Work

We’ve all seen ai pitched as the magical intern who never sleeps. In DevOps, we don’t need magic; we need fewer incidents, faster recovery, and less time spelunking through logs at 2 a.m. The good news: ai can help—when we keep it on a short leash and give it clear jobs.

The most useful wins we see tend to be narrow and operational: summarising noisy alerts, clustering similar incidents, suggesting likely owners based on past tickets, and turning a pile of logs into a “probable story” we can validate. Think of it as a helpful analyst, not an auto-pilot. When ai is used to assist humans—triage, correlation, drafting runbooks, and creating decent first-pass postmortems—it can cut toil without increasing risk.

The trap is trying to make ai “run production” before we’ve mastered basics: reliable telemetry, clean service ownership, consistent deployment metadata, and runbooks that don’t read like a treasure map. ai can’t infer what we haven’t instrumented.

So our approach is simple: pick one painful workflow (incident triage is a great starter), measure it (time-to-ack, time-to-mitigate, number of escalations), and add ai as a step that produces suggestions—not actions. If it’s helpful, we expand; if it’s noisy, we turn it off and go back to fixing fundamentals. DevOps is already exciting enough.

Telemetry First: Feed ai Something Worth Eating

ai projects fail in ops for the same reason diets fail: we “start Monday” with optimism and zero prep. For ai to help, our telemetry has to be consistent, queryable, and tagged with context—service name, environment, deployment version, region, and request identifiers. Without that, ai can’t connect symptoms to causes; it just regurgitates generic advice that sounds smart and solves nothing.

We should start by standardising a few things across services: structured logs (JSON), trace propagation, and metrics labels that don’t change every sprint. If we’re using OpenTelemetry, we can set baseline resource attributes so every signal carries the same identity. That’s not glamorous, but it’s the difference between “ai helped” and “ai hallucinated confidently.”

Here’s a minimal OpenTelemetry Collector snippet that enriches data and keeps it sane:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: upsert
      - key: service.namespace
        value: payments
        action: upsert
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  otlphttp:
    endpoint: https://otel-gateway.example.com/v1/otlp
    headers:
      Authorization: "Bearer ${OTEL_TOKEN}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlphttp]

If we’re unsure where to begin, the OpenTelemetry docs are a solid compass. Once signals are consistent, ai can do the fun stuff: correlate a spike in latency with a specific deployment and a noisy downstream dependency—without us playing detective across five dashboards.

Incident Triage With ai: Summaries, Clusters, and Suspects

Let’s be honest: most incidents start with confusion, not failure. The alerts fire, Slack lights up, and half the team is trying to remember what changed. ai can’t magically fix the broken thing, but it can compress the chaos into something we can act on.

A practical pattern: pipe alert context (firing alerts, recent deploys, error budgets, key logs) into a triage assistant that outputs (1) a short incident summary, (2) top suspected services, (3) similar past incidents, and (4) suggested next checks. Crucially, we keep it advisory. The human on call still decides.

We can also use clustering: grouping incoming alerts by common labels (service, region, error signature) and asking ai to label each cluster. That reduces the “50 alerts, 1 root cause” problem. If we’re already using an incident platform, we can integrate there; if not, even a lightweight script that posts into Slack can help.

Where we need discipline is prompt hygiene and boundaries. We don’t send secrets. We don’t ask it to “fix prod.” We ask it to “summarise evidence and propose hypotheses.” The best results come from giving it a structured template and forcing citations (links to dashboards, log queries, traces). If your tool supports it, retrieval augmented generation (RAG) over internal runbooks makes answers less hand-wavy.

For inspiration on incident practices, Google’s SRE book remains a classic. Use ai to speed up the boring parts—context gathering and recall—so we can spend our brains on mitigation and learning, not on scrolling.

Runbooks That Don’t Rot: Generate Drafts, Then Review

Runbooks are like gym memberships: everyone loves the idea, few keep them current. ai can help by generating runbook drafts from the things we already have—Terraform, Kubernetes manifests, dashboards, and past incident notes—then we review and harden them. This flips the burden from “write from scratch” to “edit what’s mostly there,” which is a huge difference when we’re busy.

A good workflow: after each incident, we feed the timeline, the key graphs, and the final fix into a runbook generator. It creates or updates a markdown runbook with: symptoms, immediate actions, verification checks, rollback steps, and escalation paths. Then we require a human approval in the same PR that contains the actual code fix. No approval, no merge. This is how we avoid “tribal knowledge” living in one person’s head (or worse, in a Slack thread nobody can find).

Here’s a simple repository layout we’ve used to keep runbooks close to the services:

services/
  checkout/
    README.md
    runbooks/
      latency-spike.md
      5xx-errors.md
      dependency-timeouts.md
    dashboards/
      grafana.json
    alerts/
      prometheus-rules.yaml

And we can add a lightweight “runbook freshness” check in CI: if alerts changed but no runbook changed, flag it. Not block-by-default at first—just nudge.

This is also where external references help: the NIST AI Risk Management Framework is useful for thinking about governance without turning it into a paperwork festival. We’re not trying to produce perfect docs; we’re trying to make the next on-call shift less miserable.

Safer Deployments: ai-Assisted Reviews and Policy Checks

Code reviews and change approvals are ripe for ai assistance, mostly because humans are inconsistent when tired. We can use ai to spot obvious risks: changing a timeout without updating retries, introducing a breaking API change, widening IAM permissions, or deploying a schema change without a backfill plan. Again: suggestions, not automatic merges.

The key is to constrain scope. We don’t ask, “Is this good?” We ask: “List potential operational risks in this diff.” We also provide context: service tier, SLOs, dependencies, deployment strategy. The assistant can then propose a checklist tailored to the change, which the reviewer can validate.

We should also keep hard rules as code. ai is not a replacement for policy-as-code; it’s a helpful second set of eyes. Tools like Open Policy Agent can enforce the non-negotiables, and ai can explain failures in plain language so devs don’t rage-quit.

Here’s a tiny example of a CI job that runs policy checks and posts a human-friendly summary:

name: policy-checks
on: [pull_request]

jobs:
  conftest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install conftest
        run: |
          curl -L -o conftest.tar.gz https://github.com/open-policy-agent/conftest/releases/download/v0.55.0/conftest_0.55.0_Linux_x86_64.tar.gz
          tar -xzf conftest.tar.gz
          sudo mv conftest /usr/local/bin/
      - name: Test Kubernetes manifests
        run: conftest test k8s/ --policy policy/

If we want to learn policy-as-code properly, Open Policy Agent has approachable docs. Combine that with ai-based review comments, and we get both: guardrails and helpful explanations.

Cost and Capacity: Let ai Find Waste (Then We Confirm)

The cloud bill is the quiet incident that happens every month. ai can help spot anomalies and waste: a service whose CPU requests are 5x actual usage, a sudden egress jump after a deploy, or a forgotten test environment running “temporarily” since 2022.

The trick is to tie spend to ownership and change events. If our cost data isn’t tagged by service/team, ai can’t route the findings to the right people. If we don’t track deploy metadata, we can’t correlate cost spikes with changes. So we start with tagging standards and basic FinOps hygiene, then we add ai-based anomaly detection and recommendations.

What’s useful in practice:
– “This namespace’s memory requests are consistently under 20% utilised.”
– “This load balancer has near-zero traffic for 30 days.”
– “Egress jumped 40% after build 1.3.18—check new image pulls or analytics calls.”

We still validate before acting. Automated cost “optimisation” can accidentally remove headroom and trigger performance incidents. We prefer a workflow where ai proposes candidates, we review in a weekly ops/cost session, and then we ship changes with the usual safety nets.

If you’re building this out, the Kubernetes docs on requests/limits are worth revisiting—most waste begins there. ai helps us find the outliers; good engineering practices keep us from “optimising” ourselves into an outage.

Governance Without the Fun Police: Access, Data, and Auditability

If we’re going to use ai in DevOps, we need a few boring rules to keep it from becoming a security incident generator. The goal isn’t to slow teams down; it’s to make sure we can answer: what data did we send, who approved it, and what did the tool output?

We can keep this practical:
– Use an approved gateway for ai calls so we can centralise logging, rate limits, and redaction.
– Classify data: “OK to share,” “internal only,” “never share.” On-call logs often contain secrets and customer data—assume they’re radioactive until proven otherwise.
– Store prompts and outputs for a limited time for auditing and incident review, then expire them.
– Require human confirmation for any action that changes production (tickets, config updates, rollbacks).

We also need to set expectations: ai output is advice. It can be wrong, outdated, or overly confident. That’s not scandalous; it’s normal. The fix is to demand evidence: links to dashboards, log queries, and diffs. If it can’t cite sources, it’s just vibes.

A lightweight governance doc plus a couple of technical controls usually beats a 40-page policy nobody reads. When we keep ai bounded—clear data rules, clear action boundaries, clear accountability—we get the benefits without introducing a new class of “whoops” incidents. And we can all go back to arguing about YAML like professionals.