Ship Reliable AI: 7 Painfully Practical DevOps Moves

ai

Ship Reliable AI: 7 Painfully Practical DevOps Moves

Make AI deployments boring, safe, and fast without burning weekends.

1. Decide What “Good” Looks Like Before You Ship

Before we wire up GPUs and sprinkle transformers everywhere, let’s decide what “good” actually means. “Works on my laptop” won’t save us when a model turns a simple help prompt into a creative novella or blows past our latency budget. We start by writing clear, measurable SLOs: p95 latency under 400 ms for autocomplete, under 1.2 s for grounded Q&A; a per-request cost ceiling; and a safety target like “zero critical policy violations, fewer than 0.5% soft violations.” Tie those to dashboards first, not after launch. We also specify where AI is allowed to fail gracefully. If the model is down, autocomplete becomes a static suggestion list; if retrieval is stale, we temporarily suppress long-form responses and nudge users to search. Bias and safety aren’t hand-wavy either: we document use cases, non-goals, and failure modes in a short model card and we pin a minimum evaluation score the change must meet before production. If our use case has regulatory or reputational risk, we map it to an existing framework and write simple controls: data retention windows, acceptable training sources, allowed providers. For those looking for a vendor-neutral reference to anchor risk controls, the NIST AI Risk Management Framework is a solid starting point. We keep it pragmatic: one page of SLOs, one page of guardrails, one page of “what we’ll do when it breaks.” It sounds small because it should be. Clarity early pays down weeks of chaos later.

2. Version Everything That Teaches Or Nudges The Model

We’ve learned the hard way that the fastest way to lose a week is to forget what changed. In AI land, “what changed” is anything that teaches or nudges the model: training data slices, prompt templates, system instructions, retrieval schemas, embeddings pipelines, tokenizer versions, and the model binary itself. We treat each as code. Prompts live next to code with unit tests. We commit small evaluation sets in-repo for quick signals, and keep larger benchmarks in object storage with content hashes and a manifest. Containers pin CUDA, cuDNN, tokenizer, and model runtime versions. Dependency files pin minor versions; we upgrade on purpose, not by accident. When a pull request tweaks a system prompt, we demand evals. We tag the results with the commit SHA and the model reference, then store both in a minimal registry (SQL is fine). For training or fine-tuning, we record the full lineage: data manifest, hyperparameters, seed, container image digest, and the evaluation results that justified promotion. We don’t push “latest”; we release immutable artifacts. For inference, we embed model metadata in responses—model ID, prompt template version, retrieval index snapshot—so that we can trace a user ticket back to the exact inputs. Finally, we keep the rollback path obvious: at any moment, we can point the traffic router back to a prior model tag, the old prompt template, or the previous index snapshot. If we can’t roll it back in five minutes, we haven’t versioned enough.

3. Put AI In CI: Run Cheap Tests Early

Shiny demos hide flaky edges. We force those edges to show up in CI, where they’re cheap. Our pipeline runs fast unit tests, a tiny evaluation suite, and a couple of safety checks against handcrafted adversarial prompts. The goal isn’t to solve safety in CI; it’s to block footguns. We test the glue code around the model, we lint prompts for hard-to-diff formatting changes, and we run a 50-example eval that catches obvious regressions in latency, grounding, and accuracy. On security, we include basic checks for prompt injection and data exfiltration patterns inspired by the OWASP Top 10 for LLM Applications. CI should tell us if the change is boring enough to stage.

name: ai-ci
on: [pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt') }}
      - run: pip install -r requirements.txt
      - run: pytest -m "not slow" -q
      - name: Mini eval (50 samples)
        run: python tools/run_eval.py --limit 50 --model $MODEL_REF --seed 42
      - name: Prompt lint
        run: python tools/promptlint.py prompts/
      - name: OWASP LLM checks
        run: python tools/threat_checks.py --categories prompt_injection,data_exfiltration

We keep the “slow” stuff in nightly runs: larger evals, full red-team suites, and fine-tuned model candidates. The pull request path should stay under ten minutes and still catch silly regressions.

4. Guardrails As Code: K8s Policies For GPUs And Egress

Cluster guardrails beat stern Slack messages. We make the cluster say “no” by default and “yes” only to the things we intend. For AI pods, that starts with resource quotas and limits. GPU nodes are expensive; “just one more experiment” can melt the budget by lunch. We set namespace-level quotas for GPU and memory, and we stop requests that try to sneak past. For egress, we deny everything and allow only the API endpoints our apps need. When someone tries to point a staging pod at a random external endpoint “just to test it,” the policy does the talking.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ai-quota
  namespace: ai-prod
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
    requests.cpu: "40"
    requests.memory: 128Gi
    pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-internal-apis-only
  namespace: ai-prod
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          network-segment: internal-apis
    ports:
    - protocol: TCP
      port: 443

Quotas like the above are explained well in the Kubernetes docs on ResourceQuotas. For higher-level policy, we like codifying “you must label datasets,” “no privileged pods,” and “no internet egress” as reusable constraints with OPA Gatekeeper. We treat policy like application code: reviewed, tested, staged. If someone needs an exception, it goes through a pull request with an expiry date, not a quiet “kubectl apply” at 11 p.m.

5. Observe The Right Things, Not Everything

AI systems produce a staggering amount of telemetry. We’ll go broke if we collect it all, and we’ll still miss the signals that matter. So we select a lean set of golden metrics aligned to our SLOs and risks: p50/p95 latency per endpoint and per model tag; request failure rate per reason (provider error, timeout, guardrail block); cost per 1k tokens and per request; cache hit rate; safety violation counts; and “grounding score” or answerability rate. We tag every span and log with model ID, prompt template version, retrieval index snapshot, and user segment (anonymized). When we sample, we don’t sample away the weirdness: keep full traces for the top N slowest requests, first-time flows, and any request that triggered a guardrail. For drift, we monitor both input drift (distribution of key features changing) and output drift (answer length, refusal rate, toxicity score) and we alarm on deltas, not absolute numbers. A modest amount of anomaly detection helps, but explicit rules carry us far: “alert if p95 latency doubles for 10 minutes” is surprisingly effective. We prefer transparent alert definitions and link them to runbooks. If you need a refresher on expressive alert rules, the Prometheus alerting rules docs are crisp and worth the read. The punchline: fewer metrics, more tags, louder alerts, smaller dashboards.

6. Make Privacy The Default Path, Not A Heroic Effort

Nothing ruins a demo like realizing we reflexively sent customer data to a third-party endpoint we never reviewed. So we pave the safe path. Outbound calls to external AI providers are behind a controlled proxy with allowlists and DLP-style checks; environment variables and secrets are scrubbed from prompts by default; and logs get masked for obvious PII before they ever land in storage. We don’t let developers reinvent redaction logic in a sidecar; there’s a library and a proxy they can import or declare. For training and evaluation, we tag datasets with data classifications and TTLs, and we enforce dataset access with the same rigor we use for production credentials. If data is too sensitive to leave the VPC, we don’t negotiate—we bring the model to the data, even if it means taking a slight latency hit with an on-prem or VPC-hosted serving stack. We keep prompt, response, and context captures short-lived: enough to debug and evaluate, not enough to become an archive of user secrets. For user-facing products, we give customers control: opt-outs, data deletion, and clear disclosures. Most importantly, we make the safe way the easy way: client libraries that default to “no remote provider,” kube templates that deny egress, and a well-documented process to request exceptions. Every manual exception costs us time; paved paths earn it back.

7. Treat AI Incidents Like Production Incidents, With New Playbooks

When AI goes sideways, it’s still an outage. We page on-call, we gather facts, and we mitigate within 30 minutes. The twist is that “rollback” can mean swapping a model, reverting a prompt, changing a retrieval index, or reverting to a non-AI fallback. Our playbooks list each lever in order of speed and blast radius: first drop temperature or switch to a stricter safety mode; then route a portion of traffic back to the previous model tag; if retrieval is dirty, pin the last known-good index snapshot; if the model is hallucinating under load, temporarily enable a “just the facts” template that answers with citations or defers politely. We keep feature flags for model routing and prompt templates so we can flip them without redeploys. For confusing failures, we shadow traffic to a candidate model while still serving users from the stable one. We collect enough session context to reproduce the issue: the user path, prompt template version, retrieval hits, and model tag. After the fire is out, we do the boring things that prevent the next one: add an eval that would have caught it, tighten a guardrail, or extend the canary window. Yes, we also practice the incident: dry runs with a “prompt injection causes data exfil attempt” or “provider latency spikes 10x” help the team reach for the right knob under pressure. Predictable beats clever every time.

Share