Ship Faster With ai: 7 Sane DevOps Patterns
Italic sub-headline: Practical playbook to add ai without breaking prod.
Define Measurable Wins Before We Touch a Model
Let’s start with the least glamorous thing in tech: success criteria. If we can’t name a win, we’re just stapling shiny tools to our delivery process and hoping it looks like progress. So we pick two or three boring, measurable targets before we even whisper about models. Examples we’ve used: cut flaky test failures by 40%, trim mean time to resolution by 30% with smarter incident context, or lop 20 minutes off average code review time for internal services. DORA metrics and SLOs still matter here; they’re how we’ll prove the ai work helped, not just looked cool.
From there, we establish baselines. How long do reviews take today? How many deployment rollbacks happen weekly? What’s our current cost per production inference (yes, measure even the pilot)? We write down a simple hypothesis — “If we add AI-assisted test generation to our CI, flaky test failures drop 40% in two sprints” — and we commit to an A/B or phased rollout. No heroics, no “trust us, it’s better,” just real measurements in a feature flag.
We also agree on a “no-go” cutline. If cost per inference exceeds X cents or the hallucination rate (failed acceptance tests) exceeds Y%, we pause. That last piece keeps us honest. It’s too easy to excuse bad outcomes early on. Decide which UX or safety failures are disqualifying, and stop the train when they happen. It’s less romantic than bold vision statements, but it’s how we keep prod upright while we experiment.
Wire ai Into CI/CD, Not as a Sidecar
A common failure pattern is spinning up ai as an isolated demo. It’s more effective to wire lightweight, testable ai steps into the delivery path we already trust. Code review suggestions become comments on pull requests. Test cases are generated and added to coverage reports. Threat-model prompts post into PR checks for high-risk services. Put it where devs live, not in a separate tab nobody opens after week two.
That means the CI pipeline needs to call an internal inference service (or a managed endpoint), capture outputs, and enforce gates. If the AI step flags a high-risk change, the job fails, and we keep the transcript for audit. Keep the steps deterministic where possible: fix temperature, pin model versions, and limit context size so runs are reproducible enough for debugging.
For shops on GitHub, the workflow glue is straightforward. The following example posts diffs to an internal review endpoint and fails the job on high-risk findings. See the GitHub Actions documentation for runner details and secrets handling.
name: ai-pr-review
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Static Checks
run: make lint test
- name: Collect Diff
run: git diff --unified=0 origin/${{ github.base_ref }}... > diff.patch
- name: AI Review
env:
LLM_URL: ${{ secrets.LLM_URL }}
run: |
curl -s -X POST "$LLM_URL/review" \
-H "Content-Type: text/plain" \
--data-binary @diff.patch > review.json
- name: Fail on High-Risk
run: |
if jq -e '.risk == "high"' review.json >/dev/null; then
echo "::error::AI review flagged high risk"; exit 1
fi
We start small, measure, and expand: docs suggestions next, threat hints after that. The point is to make ai a helpful teammate inside our existing guardrails.
Treat Prompts Like Inputs and Threats Like Bugs
Prompts are inputs from untrusted sources. That sentence carries a lot of security gravity. We treat prompt injection the same way we treat SQL injection: validate and isolate. Our playbook looks like this. We redact secrets before sending anything off-box. We pin models and versions. We constrain context: strict allowlists for which files or tables can be referenced, and we avoid letting the model call arbitrary tools unless we can sandbox them. We also gate egress; inference services sit behind service meshes with domain allowlists and timeouts, and we log where requests go.
The bigger risks are subtle. Retrieved documents may contain “ignore the previous instructions” payloads. We mitigate by chunking, templating, and masking system prompts. We filter outputs too: profanity, PII leakage, and policy-violating content flows through deterministic checks before a response goes to a user or a pipeline. For user-facing features, we add a second-pass validator model only as a supplement to deterministic checks, never as the sole gate.
We threat model this surface explicitly. Start with the OWASP Top 10 for LLM Applications and adopt what’s relevant. Then we write unit tests for prompts, just like any API. Given input X, we expect Y structure and Z refusal behavior. We fail builds when structure isn’t met. Finally, we rate limit aggressively. Models look like magic until they take your platform offline; stick them behind a bounded queue, circuit breakers, and retry budgets. It’s not exciting, but it keeps phones quiet at 3 a.m.
Observe Inference Like a Service, Not a Science Fair
Production inference is a service with latency, cost, and correctness SLOs. We measure it like one. Baseline p50/p95 latency per route, token usage, cache hit rates, refusal rates, cost per successful response, and failure types split by model version. We tag everything with model, prompt template version, data snapshot version, and feature flags so we can correlate changes with outcomes. When users say “it got slower,” we can actually verify and fix.
We want traces that cross the boundary: inbound HTTP request, retrieval calls, model invocation, and outbound dependencies stitched together. OpenTelemetry makes this way less painful. We propagate trace context into the inference layer and expose spans for retrieval and LLM calls. The OpenTelemetry docs cover the plumbing; here’s a tiny example for annotating a model call:
from opentelemetry import trace
from time import perf_counter
tracer = trace.get_tracer(__name__)
def call_model(client, prompt, model="gpt-4o-mini"):
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_template_version", "v7")
t0 = perf_counter()
resp = client.generate(model=model, input=prompt, temperature=0.2)
dt = perf_counter() - t0
span.set_attribute("llm.latency_ms", int(dt * 1000))
span.set_attribute("llm.tokens_prompt", resp.usage.prompt_tokens)
span.set_attribute("llm.tokens_output", resp.usage.output_tokens)
span.set_attribute("llm.cache_hit", bool(resp.headers.get("x-cache-hit")))
return resp
We add structured sampling to keep costs sane, and we export business-level metrics: “docs drafted,” “tickets triaged,” “false suggestions.” Accuracy gets test sets: nightly regression jobs run prompts against curated examples and compare to expected outputs. If accuracy dips or cost spikes, we’ve got alarms and a rollback plan just like any other service.
Data Plumbing Over Model Tinkering
We love a clever model tweak, but nine times out of ten, data plumbing beats model wrangling. If you’re doing retrieval-augmented generation, the freshness of your index, how you chunk text, and whether you dedupe noisy content matters more than fiddling temperature from 0.2 to 0.25. We’ve seen bigger gains from simply separating reference docs from tickets and tagging each chunk with a source, date, and permission set than from swapping models.
So we treat data like code. Schemas are versioned. RAG corpora have update cadences and publish steps. We test the retrieval layer with golden queries: “Given this prompt, we expect to retrieve these three docs,” and we fail the build if recall falls off. We also put guardrails on the retrieval boundary: if a user can’t see a document, the model’s context builder can’t either. That means indices are permission-aware and the application enforces filters at query time.
Caching saves real money. Response-level caches with short TTLs help for repetitive requests. Embedding caches are a must for anything that’s reprocessed. We favor simple cache invalidation schemes tied to data snapshots over fancy heuristics we can’t explain at 2 a.m. We also schedule index compaction and quality checks like we schedule backups — boring and lifesaving.
Finally, we keep a human-in-the-loop where it matters. Labeling small evaluation sets, curating top documents, and rejecting bad suggestions are cheap steps that produce outsized improvements. The trick is making those moments deliberate and quick, not a slog.
Scale Costs With Traffic: GPUs, Batch, and Budgets
ai workloads don’t have to blow up our cloud bill if we scale with intent. We split two classes: online inference needs low-latency, while batch jobs (embeddings, fine-tuning, offline scoring) can flex. Autoscaling queues and nodes for batch keep GPUs busy when there’s work and cheap when there isn’t. For online paths, we right-size: small models for simple tasks, quantized variants when latency is king, and carefully plan GPU capacity for the few endpoints that need it.
On Kubernetes, we set explicit requests, limits, and node selectors to land GPU jobs where they belong. We wire Horizontal Pod Autoscalers to p95 latency or tokens-per-second where possible, or CPU as a proxy. We also budget: dashboards show cost per 1,000 requests by endpoint and model. When a feature gets popular, we know the price tag. The official guide on scheduling GPUs in Kubernetes is worth a read before any cluster experiments.
Here’s a minimal Deployment that requests a single GPU and isolates onto a GPU node pool:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
nodeSelector:
accelerator: nvidia
containers:
- name: server
image: registry.example.com/llm-inference:1.8.3
resources:
limits:
nvidia.com/gpu: "1"
cpu: "2"
memory: "8Gi"
requests:
nvidia.com/gpu: "1"
cpu: "1"
memory: "4Gi"
env:
- name: MODEL_ID
value: "small-fast-v3"
ports:
- containerPort: 8080
Set a max autoscale and a circuit breaker. Bound the blast radius so a “fun” demo can’t starve payroll.
Make Guardrails Enforceable: Policies, Tests, and Audit Trails
We don’t want policies that live only in slide decks. We want guardrails we can test. First, we pin model and tool versions in config and block unapproved changes at deploy time. Second, we store prompts, templates, and retrieval rules in git alongside code, with reviews and automated checks. Third, we write policies that keep inference traffic inside lanes: allowed destinations, allowed models, and minimum logging fields.
Admission controllers make this real. In Kubernetes, policies can stop a bad deployment that tries to point to “whatever’s newest” or egress to random domains. Kyverno makes policy-as-configuration approachable; it’s expressive and ships with good samples. The Kyverno documentation covers installation and testing. Here’s a simple rule that only allows images from our internal registry for anything labeled inference:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: enforce-internal-images
spec:
validationFailureAction: enforce
rules:
- name: allowed-registry-for-inference
match:
resources:
kinds: ["Deployment", "StatefulSet"]
selector:
matchLabels:
app.kubernetes.io/component: inference
validate:
message: "Inference images must come from registry.example.com"
pattern:
spec:
template:
spec:
containers:
- image: "registry.example.com/*"
We pair policies with tests. Conftest in CI catches violations before they reach the cluster. We also keep an audit trail: prompts, retrieved IDs, model versions, and decisions stored with TTLs and redaction. Finally, we run a red-team suite against prompts monthly with seeded attack patterns and publish a short report. It feels formal, but it’s the only way to avoid policy drift and cargo-cult safety.
What We’ll Be Glad We Did a Year From Now
Twelve months from now, nobody will remember the “wow” demo. We’ll remember whether incidents got shorter, if docs got better faster, and if the bill matched the promise. The teams we’ve seen succeed kept their ai footprint sane: they picked obvious, measurable wins; they wired small, reliable steps into CI/CD instead of dropping a chatbot into the foyer; they treated prompts like inputs and wrote tests; they instrumented inference like any other production service; they invested in data plumbing, not endless model fiddling; they right-sized infra and budgets; and they made guardrails enforceable, not inspirational.
If you’re starting today, pick one use case that already has a feedback loop — code review helpers, test generation, incident context — and land a small win. Measure, share, and then move the boundary a little. You’ll learn what fits your stack, your people, and your uptime goals. Along the way, you’ll write less glue you regret, avoid midnight pages, and build a reputation for shipping useful ai, not just talking about it.
And when your CFO asks, “Is this worth it?”, you’ll have graphs that say “yes,” a policy repo that explains “how,” and a pipeline that makes it repeatable. That’s the bar. Not magic, just good engineering with a dash of modern tooling and a healthy respect for prod.