Ai In DevOps Without The Hype Or Headaches

Practical ways we can ship faster, safer, and saner with ai

What We Mean By “ai” In A DevOps Team

When someone says “ai,” half the room imagines a friendly robot SRE sipping coffee and fixing incidents before they happen. The other half imagines a compliance officer sharpening a pen. In practice, what we usually mean in DevOps is narrower and more useful: models that can summarize, classify, suggest, and sometimes generate drafts (code, configs, runbooks) based on patterns they’ve seen before.

We’re not “replacing engineers.” We’re reducing the amount of time we spend doing the least enjoyable parts of engineering: trawling logs, writing the first version of a postmortem timeline, translating a vague ticket into a concrete checklist, or scanning a change for obvious foot-guns. Think of ai as an eager junior teammate who’s fast, tireless, and occasionally confident about something completely wrong.

The key is deciding where we can tolerate mistakes. For example, “suggest a Terraform module refactor” is fine if we review it. “Auto-apply to prod” is… how do we put this politely… a career-limiting move. The win comes from pairing ai with the existing DevOps safety rails we already trust: version control, review, tests, policy checks, and gradual rollout.

If we treat ai output as untrusted input—just like we treat anything coming from the internet—then it becomes a useful tool rather than a risk magnet. That framing (untrusted, reviewable, testable) keeps us honest and keeps the benefits real.

The Best Places To Use Ai In The Pipeline

Not every part of the delivery chain benefits equally. We’ve seen the best results in “text-heavy” work and “pattern-heavy” work. Text-heavy: incident notes, runbooks, change summaries, PR descriptions, release notes, and stakeholder updates. Pattern-heavy: log clustering, anomaly hints, alert deduplication, and suggesting remediation steps based on past incidents.

Where it shines in CI/CD is at the edges: before the code is merged and after it’s deployed. Before merge, ai can act like a second set of eyes: flagging risky changes (“this opens security group to the world”), suggesting missing tests, or producing a clearer migration plan. After deploy, it can help reduce cognitive load by summarizing what happened: “These 4 services spiked latency; the common dependency was Redis; errors correlate with deploy X.”

We also like ai for knowledge retrieval—if we keep it grounded. Point it at internal docs, runbooks, ADRs, and past postmortems, and it can answer “How do we rotate Kafka credentials again?” without someone spelunking Confluence for 20 minutes. The trick is ensuring citations or links so engineers can verify the source, not just trust vibes.

A helpful litmus test: if the output is something we can diff, review, or validate automatically, it’s a good candidate. If it’s something that silently changes state (like deleting resources), keep humans firmly in the loop.

For external reading on the guardrails mindset, NIST’s work on risk management is a solid reference: NIST AI Risk Management Framework.

Prompting Like We Mean It: Reproducible, Reviewable Output

We don’t need prompt poetry. We need prompts that behave like build scripts: boring, repeatable, and easy to improve. The biggest change is to stop asking, “Can you help?” and start specifying: role, context, constraints, and the format of the answer.

A prompt we can reuse should include:
– Goal: what we’re trying to achieve (e.g., “draft a runbook step list”).
– Inputs: logs, configs, error messages, links.
– Constraints: “don’t assume unknown infrastructure,” “no destructive commands,” “Kubernetes v1.29.”
– Output format: Markdown checklist, JSON, diff-style patch, etc.
– Verification: “include 3 risks and how to test.”

Here’s a prompt template we’ve used for incident triage summaries that’s surprisingly consistent:

You are our on-call assistant. Summarize the incident data below.

Context:
- System: payments-api on Kubernetes
- Time window: 10:05–10:35 UTC
- Recent changes: release 2026.04.12-3 deployed at 10:02 UTC
Constraints:
- If you infer, label it as "Hypothesis"
- Do not recommend destructive actions
Output (Markdown):
1) Symptoms (bullet list)
2) Impact (who/what)
3) Timeline (5-10 timestamps if present)
4) Likely causes (ranked, with confidence)
5) Next checks (commands or dashboards to consult)

Data:
<PASTE LOGS, ALERT TEXT, AND DEPLOY NOTES>

This gives us something we can paste into Slack or a ticket with minimal editing. It also forces the model to separate facts from guesses, which is where most “ai went wrong” stories begin.

For general prompt patterns and evaluation ideas, Google’s SRE material is still a great north star: Site Reliability Engineering book.

Ai-Assisted IaC Reviews With Policy Checks (Code Included)

IaC is where we can get real value because it’s already review-driven and testable. The model can propose changes, but our pipeline enforces sanity. The sweet spot is using ai to draft a fix and using policy-as-code to verify it.

A practical flow:
1. ai suggests a Terraform change (e.g., tighten an S3 policy).
2. We run static checks (tflint, tfsec, checkov).
3. We run policy checks (OPA/Conftest).
4. We review and merge like adults.

Here’s a minimal Conftest policy example that blocks overly-permissive security groups in Terraform plan JSON. It’s not fancy, but it saves us from the classics.

package terraform.security

deny[msg] {
  input.resource_changes[_].type == "aws_security_group_rule"
  rc := input.resource_changes[_]
  rc.change.after.cidr_blocks[_] == "0.0.0.0/0"
  rc.change.after.from_port <= 22
  rc.change.after.to_port >= 22
  msg := "Security group rule allows SSH from 0.0.0.0/0"
}

And a GitHub Actions snippet to run it:

name: policy-checks
on: [pull_request]

jobs:
  conftest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install conftest
        run: |
          curl -L -o conftest.tar.gz https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_0.56.0_Linux_x86_64.tar.gz
          tar -xzf conftest.tar.gz
          sudo mv conftest /usr/local/bin/
      - name: Terraform plan (JSON)
        run: |
          terraform init
          terraform plan -out=tfplan
          terraform show -json tfplan > tfplan.json
      - name: Conftest
        run: conftest test tfplan.json -p policy/

Now, ai can draft the SG rule changes all day long—our pipeline still blocks the bad ones. For more on OPA and friends: Open Policy Agent.

Incident Response: Faster Triage, Fewer False Leads

During an incident, we’re juggling imperfect info, time pressure, and the fact that someone always suggests “let’s restart everything” a little too quickly. ai can help by compressing the chaos: summarize alerts, extract common error strings, and propose a short list of likely culprits based on symptoms.

The danger is confirmation bias. If the model says “it’s DNS,” we’ll notice every DNS-looking clue and ignore the rest. We counter that by requiring ai to produce:
– a ranked list of hypotheses,
– explicit confidence levels,
– and “disconfirming checks” (what would prove this is not the issue).

We also like using it after the fact to produce a first-draft postmortem timeline. Not the conclusions—those need engineering judgement—but the tedious assembly of “what happened when,” pulled from Slack timestamps, alert payloads, and deploy events.

Where ai helps most is when it can be anchored in our own telemetry. If we give it log excerpts and metrics summaries, it can point out correlations humans miss under stress. If we give it nothing but a vague “latency is up,” we get generic advice like “check CPU.” Thanks, we hadn’t considered electricity.

If you’re building a more serious incident workflow, it’s worth reading up on how modern observability stacks structure data. Honeycomb’s write-ups are consistently practical: Honeycomb Observability Concepts.

Shipping Securely: Secrets, Data Boundaries, And Redaction

The fastest way to turn “ai adoption” into “ai incident” is to paste secrets into a chat box. We’ve all seen it happen: someone drops a kubeconfig, a token, or customer data into a prompt because they’re trying to be helpful. The model is not the problem there—our process is.

We need clear boundaries:
– What data can be used (public code, synthetic logs, redacted snippets).
– What data cannot (secrets, credentials, customer PII, proprietary keys).
– Where prompts are allowed (approved tools, approved accounts, auditable logs).
– Retention rules (how long prompts and outputs are stored).

If we’re using ai with production troubleshooting, we should have automated redaction. Even a basic scrubber for obvious patterns (JWTs, AWS keys, PEM blocks) reduces risk. We also recommend “link, don’t paste” where possible: reference a dashboard or a log query, and only paste a small redacted excerpt.

And we should assume the output could be wrong in subtle ways. Security advice is notorious for being confident and slightly outdated. So we route ai-suggested security changes through the same checks we already trust: SAST, dependency scanning, and policy enforcement.

For baseline security guidance and controls language, ISO is paywalled, but OWASP remains a good free resource: OWASP Top 10. It’s not DevOps-specific, but it’s a solid reminder of what not to accidentally automate.

Adoption That Doesn’t Annoy Everyone: Metrics And A Small Rollout

If we roll out ai like a mandatory “productivity initiative,” engineers will rightfully roll their eyes and use it in private anyway. We get better results by treating it like any other tooling change: start small, measure, and iterate.

A rollout that’s worked for us:
1. Pick two workflows: e.g., PR description drafting and incident summaries.
2. Create approved templates (prompts + output formats).
3. Add light governance: where data can come from, what cannot be pasted.
4. Measure outcomes for a month.

What do we measure? Not “number of prompts.” That’s vanity. We measure:
– PR cycle time (did review get faster or slower?)
– On-call time-to-triage (did we get to first plausible hypothesis quicker?)
– Quality signals (number of rollbacks, post-deploy incidents)
– Engineer sentiment (quick monthly pulse: “helpful / neutral / annoying”)

We also set expectations: ai output is a draft. If someone treats it like a source of truth, we correct the behaviour, not the person. (Okay, sometimes we correct both, gently.)

Once the team sees real wins—like shaving 15 minutes off every incident write-up—it becomes self-sustaining. The best sign adoption is healthy is when people share prompt tweaks the way they share shell aliases: slightly nerdy, oddly proud, and genuinely useful.