Build Boringly Reliable ai Into Your DevOps

Build Boringly Reliable ai Into Your DevOps

Practical runbooks, configs, and metrics to ship ai without pager fatigue.

We Don’t “Adopt ai”; We Operate It
Let’s retire the fantasy that ai is a magical sidecar we bolt to our stack and call it a day. We don’t adopt ai; we operate it. That means SLOs, on-call expectations, telemetry, cost controls, and change management the same way we handle databases and message buses. Start by defining user-facing SLIs that reflect outcomes you care about. For a chat-assist feature, we’ve used p95 response latency < 1.2s, task success rate > 85% as judged by a deterministic rubric, and hallucination rate < 2% on a fixed evaluation set. “Vibes improved” doesn’t cut it.

Guardrails matter more than model selection. A safety budget—like an error budget—limits how much experimentation we allow before we clamp down. If hallucination rate exceeds 2% for three consecutive hours, we auto-rollback prompts or routes to a more conservative model. We also treat prompts and system instructions as code: version them, review them, test them. For testing, create a stable 1,000-sample eval corpus and run it nightly. When the model or prompt changes, you get a clean diff in outcome metrics.

Risk isn’t just reliability; it’s also ethics and compliance. We map our controls to the NIST AI RMF so security and legal don’t feel like referees—it’s just the same playbook we use elsewhere. Once you frame ai as a production service with SLIs and runbooks, your team stops chasing novelty and starts delivering repeatable value, calmly.

Wire ai Into CI: A Pull-Request Copilot
We love code review, but we don’t love 700-line diffs on a Friday. Let ai do the first pass. In CI, run a lightweight, deterministic model-assisted review that flags insecure patterns, migration risks, and missing tests. Keep it humble: it suggests; humans decide. The trick is to make its verdicts reproducible and auditable so we don’t argue with a black box during incident calls.

Here’s a minimal GitHub Actions workflow we’ve used to comment on PRs and fail only on high-risk findings. It caches the model hints in artifacts and records the prompt/version in logs so the result is explainable later:

name: pr-ai-review
on:
  pull_request:
    types: [opened, synchronize, reopened]
jobs:
  ai_review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install deps
        run: pip install -q safety bandit
      - name: Analyze diff
        run: |
          git fetch origin ${{ github.base_ref }} --depth=1
          git diff --unified=0 origin/${{ github.base_ref }}... > diff.patch
          bandit -r . -f json > bandit.json || true
          safety check --json > safety.json || true
      - name: Model-assisted summary
        env:
          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
        run: |
          python .github/scripts/ai_review.py \
            --diff diff.patch \
            --static bandit.json safety.json \
            --out review.md
      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('review.md','utf8');
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body
            });
      - name: Fail on high risk
        run: python .github/scripts/score_exit.py review.md --threshold 0.8

If you’re starting from scratch, bookmark the GitHub Actions workflow syntax. Keep the threshold conservative at first (we use 0.9), and ratchet down only after a few weeks of noise analysis to avoid developer eye-rolls.

Observe the Invisible: Tokens, Latency, Drift
Observability for ai is different because “correctness” isn’t binary and inputs are messy. We focus on three pillars: live service metrics (latency, token counts, error rates), evaluation metrics (task success, hallucination rate), and lineage (which prompt/model/version produced which output). The first pillar looks like any microservice: we scrape metrics and trace request/response cycles. We prefer OpenTelemetry for traces because we can tag spans with prompt IDs, model routes, and experiment flags. The benefit is obvious when a perf spike happens and you can isolate it to “experiment=prompt_v17.”

Here’s a minimal Prometheus scrape for an internal LLM proxy that exposes standardized metrics like llm_tokens_total and llm_duration_seconds:

scrape_configs:
  - job_name: 'llm-proxy'
    scrape_interval: 10s
    static_configs:
      - targets: ['llm-proxy.svc.cluster.local:9090']
    metric_relabel_configs:
      - source_labels: [model, route]
        target_label: model_route
        replacement: '$1:$2'

Pair metrics with traces. Emit spans for the end-to-end user request, with child spans for retrieval, generation, and post-processing. Use attributes like service.name, model.name, and prompt.version. The OpenTelemetry docs cover instrumenting HTTP clients and server middleware; wrap your ai calls there. For drift, keep a daily canary set—say 100 examples—and log the outputs for head-to-head comparison. When the canary task success drops by more than 5% day-over-day, fire an alert and pin the previous prompt or model until we understand what changed. We also snapshot embeddings for key queries weekly and compare cosine similarity distributions. It’s not perfect, but it’s a quick smoke test that catches silent regressions.

Put a Price on Every Prompt
Costs don’t explode; they creep—one verbose chain-of-thought at a time. We price every inference the same way we price a SQL query: tokens in, tokens out, latency, and downstream work. For a customer-support deflection bot, we discovered that truncating history to the last 6 messages cut average tokens by 41% with no measurable drop in solved-rate over 30 days. That was an easy win. Harder wins come from selective routing: ship easy tasks to a small, fast model; escalate only when confidence is low.

Track cost like an SLI. If your proxy exports llm_tokens_total and llm_cost_usd_total, create a rolling budget and an alert that’s business-facing. A simple PromQL alert rule might look like:

groups:
- name: llm-billing
  rules:
  - record: llm:cost:daily
    expr: sum_over_time(llm_cost_usd_total[1d])
  - alert: LlmDailyCostBudgetBreached
    expr: llm:cost:daily > 500
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Daily LLM spend exceeded $500"
      description: "Review routing/prompt verbosity. Check model escalations and retry rates."

We set budgets per feature, not per cluster, so product managers see ownership. Also watch retries and timeouts—an 8% timeout rate can secretly double spend through automatic retries. If you store prompts/outputs, compress aggressively and define retention: 7 days for raw logs, 90 days for sampled evaluation sets, and forever for aggregated metrics only. Those numbers keep storage bills sane and make legal happy. Finally, quarantine verbose debugging; it’s useful in staging, expensive in prod.

Data Hygiene Beats Model Size
Data quality makes or breaks ai results. Before we debate model choices, we sanitize inputs, enforce schemas, and redact PII. You don’t want a customer’s credit card to become part of your “context.” We’ve had great results with a lightweight validation layer in the request path and daily batch checks on the source corpora. If the validator finds PII or corrupt markup, we either block the request or run a safe transformation—no exceptions in regulated environments.

You can use your favorite tools; we’ve leaned on Great Expectations for batch validation and a tiny Python guard for real-time checks. Here’s a compact Python example that enforces a schema and redacts emails/CC numbers before prompts hit the model:

import re
from typing import Dict

EMAIL = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
CARD = re.compile(r"\b(?:\d[ -]*?){13,16}\b")

def validate_payload(p: Dict) -> Dict:
    assert isinstance(p.get("user_id"), str) and len(p["user_id"]) <= 64
    assert isinstance(p.get("query"), str) and 1 <= len(p["query"]) <= 4000
    q = EMAIL.sub("[email_redacted]", p["query"])
    q = CARD.sub("[card_redacted]", q)
    return {**p, "query": q}

# Example
payload = {"user_id": "u123", "query": "Hi, my card 4242 4242 4242 4242"}
safe = validate_payload(payload)

For batch pipelines, define expectations for null rates, duplicate ratios, and HTML well-formedness; block the run if thresholds exceed caps. If you need a solid starting point, the Great Expectations docs are practical and don’t require buying anything. Remember: a clean 10GB corpus beats a noisy 100GB one most days, and it’s cheaper to store, scan, and debug.

Incidents With ai: A Real Postmortem Win
Here’s a real number from our pager logs. At 03:17 on a Thursday, our support-assist feature started returning off-topic answers after a schema change in the product knowledge base. Users saw “help” but got product marketing. Ouch. Normally, we’d pull two engineers into Zoom and spend 30–40 minutes collecting logs, screenshots, and trying to reproduce. Instead, our incident Slack channel auto-posted a live summary: p95 latency steady at 820ms, token usage normal, route=small_model, prompt.version bumped from v22 to v23 at 02:59, and evaluation canary success down 14%. A bot also attached five anonymized examples with links to their trace spans.

Two humans still did the thinking, but the summarizer shaved the “what changed?” hunt from 28 minutes to 6. We rolled back the prompt, pinned the model route, and restored quality by 03:45. Total MTTR: 28 minutes, down from our 90-day rolling average of 62 minutes. In the postmortem, we added a new guardrail: prompt version bumps auto-run the 1,000-sample nightly eval set and require a green check in production. We also added a pre-deploy check that validates knowledge base schema diffs against our retrieval templates.

The lesson wasn’t that ai magically fixed ai. It’s that we treated it like any other service: good telemetry, diffable changes, and tiny blast radiuses. The pager stayed quieter the next month, not because we “did ai,” but because we did ops.

Governance That Doesn’t Crater Velocity
Governance becomes tolerable when it’s codified, automated, and quick to change. We keep three lanes: what data can leave the boundary, who can deploy or tweak prompts/models, and how long we retain raw artifacts. If you’re on Kubernetes, put these rules near the workload with policy-as-code. For example, we block workloads from mounting LLM API keys outside whitelisted namespaces and require a “pii-scrubbed=true” label for anything that logs prompts. This prevents “just for testing” pods from quietly leaking data.

Here’s a sample Gatekeeper ConstraintTemplate and Constraint to give the idea:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sllmkeys
spec:
  crd:
    spec:
      names:
        kind: K8sLlmKeys
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package k8sllmkeys
      violation[{"msg": msg}] {
        input.review.object.kind == "Deployment"
        ns := input.review.object.metadata.namespace
        not ns == "ai-approved"
        some c
        c := input.review.object.spec.template.spec.containers[_]
        some e
        e := c.env[_]
        startswith(e.name, "LLM_")
        msg := sprintf("LLM env var %v not allowed in namespace %v", [e.name, ns])
      }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sLlmKeys
metadata:
  name: disallow-llm-env-outside-approved
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod", "Deployment"]

We keep policy exceptions in version control with timeboxed expirations. For audit trails, we log prompts, outputs, and decisions with minimal PII and deterministic IDs. Retention is brutally short for raw data: 7 days live, 30 days cold. That’s enough to debug and re-run evals without hoarding risk. If you’re getting started, the OPA Gatekeeper docs show how to wire policies into admission so governance feels like merging code, not filing tickets.

What We’d Do Tomorrow Morning
– Publish SLIs/SLOs for your top ai feature: p95 latency, task success, hallucination rate.
– Add a CI step that posts model-assisted review comments but fails only on high risk.
– Instrument your ai calls with traces and tokens; expose metrics at /metrics and scrape every 10s.
– Create a 1,000-sample evaluation corpus and set a rollback threshold.
– Add a Prometheus alert for daily spend exceeding a business-owned budget.
– Enforce a simple PII redactor before prompts and require “pii-scrubbed=true” labeling in prod.
– Codify a tiny set of policies in Gatekeeper and timebox every exception.

None of this is flashy. It’s the boring discipline that lets us use ai without turning our weekends into incident retrospectives. When the graphs get dull and deploys get routine, we know we’re doing it right.