Ship ai With 99.9% Sanity: A DevOps Playbook
Practical guardrails, configs, and checklists to run AI safely in prod.
The Messy Truth: ai Is Probabilistic, Ops Isn’t
Let’s start with the uncomfortable bit: ai is probabilistic, and production ops is not. We’re used to reproducible builds, deterministic rollouts, and crisp SLOs. LLMs and generative models happily respond with “it depends,” and that “depends” changes over time with new tokens, model patches, and upstream provider tweaks. If we treat these systems like yet another microservice and ship them with the same mental model, we’ll get paged for vibes. We need seatbelts specific to stochastic behavior: versioning prompts and model weights, capping variability where it matters (temperature, top-p), and capturing the exact inputs and context that produced each response so we can debug Tuesday’s incident on Wednesday. The toolkit looks familiar—feature flags, canaries, shadow traffic, red/black deployments—but the targets are new: prompts, system messages, and retrieval pipelines.
We also have to accept that quality isn’t a single metric. Latency and availability still matter, but now we track hallucination rate, toxicity, cost per 1,000 tokens, and task success against a golden set. We’ll fail if we try to “unit test” a model into certainty; we win when we bound risk, monitor drift, and make rollbacks cheap. Borrow a page from SRE: define error budgets not just for uptime but for behavior quality, and tie release velocity to those budgets. The Google SRE book frames this trade-off well—swap “request errors” with “bad outputs,” and the logic holds. In short, treat ai’s uncertainty as a first-class citizen in our ops design, not a footnote we hope the pager ignores.
Control Planes Beat Heroics: Design the ai Backbone
We can scale heroics for a week, maybe two. After that, we need a control plane that understands ai components as declarative objects: prompts, models, embeddings, retrieval indexes, and policies. Think of it as our eeek-to-okay translation layer. On the data path, keep inference stateless where possible and isolate non-determinism. Wrap the model behind a consistent API that enforces headers for version, temperature, safety level, and timeout. Those headers become trace attributes and metrics labels. On the control path, track the lineage: which prompt version shipped with which model checksum, which retrieval index snapshot, and which safety policy. That lineage is how we explain “what changed?” when an on-call graph goes jagged.
On Kubernetes, we like using a serving layer that supports canaries and autoscaling to token-intensive workloads rather than CPU-only heuristics. Systems like KServe make model serving feel like deploying a Deployment—rollouts, scaling, and routing included. Even if we’re using a vendor API, we can mirror requests to a local or alternative model for shadow comparisons. We’ll also want an offline registry for prompts and models, not just code. A simple rule helps: if we change it in prod, it needs an artifact version and a changelog. That includes templates and retrieval query plans.
Finally, commit to separation of concerns. The model doesn’t get to call arbitrary tools without an orchestrator approving the plan. Tool use is powerful, but it’s also the fastest path to “LLM executed the wrong thing at 2 a.m.” Keep the blast radius small with scoped credentials and brokered calls.
CI/CD For ai: Test Prompts, Models, And Rollbacks
Our pipeline should test more than code. We test prompts, safety policies, retrieval logic, and the ability to roll back fast. We can’t guarantee identical outputs, but we can freeze seeds where supported, set temperature to 0 for determinism in unit tests, and use statistical or golden tests for eval suites. When a change lands, we want three gates: regressions on a curated eval set, cost/latency guard checks, and safety screens for jailbreaks or sensitive terms. If any gate yells, the merge waits—no exceptions because “demo is in two hours.”
Store prompt and model artifacts with the same rigor as containers. A registry like MLflow Model Registry (or your chosen tooling) records versions, stage tags (Staging/Production), and annotations for eval scores. Every CI run should produce a signed artifact—prompt template, tokenizer config, retrieval query—and a JSON manifest with metrics, ready to attach to a release.
A sketch in GitHub Actions might look like this:
name: ai-ci
on: [push]
jobs:
test-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install -r requirements.txt
- name: Unit tests (deterministic)
run: pytest -q tests/unit --maxfail=1
env:
LLM_TEMPERATURE: "0"
- name: Eval suite (golden set)
run: python tools/run_eval.py --dataset data/golden.jsonl --budget 100000
- name: Safety checks
run: python tools/safety_scan.py --input data/adversarial.jsonl
- name: Package artifacts
run: python tools/package_manifest.py --out dist/manifest.json
- name: Upload to registry
run: python tools/publish.py dist/*
We’re not chasing perfection; we’re enforcing gates so “slightly worse but cheaper” or “much faster with same quality” becomes a conscious choice, recorded, and reversible.
Production Guardrails: Policies, Rate Limits, and Sanitizers
Once requests hit prod, we want strict bouncers at the door and sober chaperones inside. Rate limits per API key, per user, and per tenant stop accidental model meltdowns. Timeouts and circuit breakers prevent model slowdowns from cascading to our entire stack. Request sanitizers trim prompts, redact obvious PII, and normalize inputs before inference. On the outbound side, we cap response size, scrub secrets, and add a trailing safety classifier if the stakes are high. We treat the model as creative, not authoritative—anything actionable demands a second check.
Policy engines help keep this sane. We like a sidecar with Open Policy Agent, evaluating a Rego policy for every request. Keep policies versioned, testable, and deployable separately from the app. Example policy snippet:
package ai.gateway
default allow = false
# Hard caps
deny[{"msg": "tokens over limit"}] {
input.request.max_tokens > 1024
}
# Block obvious secrets or banned terms
deny[{"msg": "banned content"}] {
some term
term := input.request.prompt_terms[_]
term == "password" or term == "ssn"
}
# Require safety level for dangerous tools
deny[{"msg": "unsafe tool invocation"}] {
input.request.tool == "shell_exec"
not input.request.safety_level == "high"
}
allow {
count(deny) == 0
}
Place this in front of the model endpoint and log every decision with request IDs. Combine with per-tenant quotas to avoid a single integration burning your monthly budget by lunch. For public APIs, add captcha or signed nonces. It’s not glamorous, but it keeps us asleep at 3 a.m., which is our real SLO.
Observability That Understands Tokens, Latency, and Drift
If we can’t see it, we can’t fix it. ai workloads need observability that speaks tokens and prompts, not just CPU and 99th percentile latency. First, structure the logs: include prompt_id, prompt_version, model_name, model_digest, temperature, max_tokens, user/tenant IDs (hashed), and a correlation request_id. Pair inputs and outputs with hashed content to avoid storing raw PII while still enabling dedupe and trace linking. Second, expose metrics that matter: tokens_in_total, tokens_out_total, cost_dollars_total (labeled by tenant), generation_latency_ms, safety_block_count, and eval_pass_ratio by prompt_version. If we label these consistently, dashboards become useful instead of abstract art.
Traces help too. Wrap the inference call with spans for retrieval, model, and post-processing. Attach attributes like prompt_version and safety_policy_version so we can correlate spikes with a specific rollout. We don’t need to reinvent telemetry; use your existing stack with custom metrics. The OpenTelemetry docs are a solid base—just add ai-specific attributes and sampling rules. For storage, log bodies can get large; keep raw text in short-term storage and samples for longer windows, or store hashes plus pointers to encrypted blobs.
Finally, we need drift detectors. For retrieval, watch hit-rate and embedding similarity distributions; arrows pointing down usually precede “the bot forgot everything.” For model responses, watch output length, refusal rates, and cost-per-request. Alert on deltas, not absolute numbers. Models don’t break like web servers; they fade, wander, and then fall over. Our graphs should notice the wandering.
Cost And Latency: Make ai Affordable At 3 A.M.
ai is remarkable at two things: answering questions and generating bills. We keep both under control with a few habits. Cache aggressively at the right layers: response cache for deterministic Q&A, embedding cache for repeated chunks, and retrieval cache keyed by query fingerprints. It’s not cheating; it’s good manners. Set strict per-request budgets: max_tokens_out, max_context_bytes, and max_latency_ms. If a request threatens those limits, either refuse gracefully or route to a cheaper path. When latency matters, try batching small generations, precomputing summaries, or using streaming to delight users while the rest trickles in.
Pick model tiers with intent. High-end models are great for complex tasks; smaller or specialized models often do 80% of the job at 20% of the price. Maintain a fallback chain: try the preferred model; if unavailable, use the backup; if slow, degrade gracefully with a simpler template. Calculate cost per successful task, not per request. That encourages better prompts and retrieval instead of just spamming tokens. Your finance team will send cookies.
Create a cost SLO per tenant and environment, and enforce it in the gateway. Simple alert: if cost_dollars_total{tenant=”X”} crosses N per day, alert and cut rate limits until someone looks. The AWS Well-Architected cost pillar has timeless advice—apply the same thinking to tokens and context length. Also, stop freewheeling temperature 1.0 in prod unless you’re writing poetry. That’s for staging, demos, and late-night hackery, not high-traffic endpoints.
Incidents Without Drama: Rollbacks, Kill Switches, And Drift Hunts
Incidents with ai feel different. Nothing crashed, but suddenly the bot recommends pineapple on everything and calls it “nuanced.” Our playbook starts with blast-radius control. Every prompt/version/model combo must be routable via flags so we can flip back to the last known good without redeploying. Keep a universal kill switch for tools and a range-limited one for high-risk actions. Rollbacks should touch configs first, artifacts second, and code last—speed matters.
For detection, we rely on soft signals: rising refusal rates, shrinking output lengths, safety blocks jumping, or a cost spike. When any of those trip, we route 10% to the prior version and compare outcomes. If the old version behaves, we roll back traffic while we investigate. We also run scheduled offline evals nightly on a stable dataset to catch silent regressions. Use shadow traffic to continuously compare a candidate model against production without user impact.
Write runbooks that acknowledge probabilistic failure. “Reproduce locally” becomes “replay the same request with identical context and seed if supported.” “Root cause” is often “model update upstream” or “retrieval index drift”—call those out, link to the artifact diff, and file a vendor ticket if needed. We close the loop with a post-incident change: a new alert, a tighter policy, or a better eval set. Nothing fancy—just steady improvements. And yes, add a QA step to check for pineapple enthusiasm before we ship the next prompt.