Ship ai 37% Faster Without Melting GPUs

Practical playbook for stable, affordable, and safe ai in production.

What Makes ai Different in Production

If we ship a web app and it’s slow, we add caches, scale horizontally, and go for coffee. With ai, the coffee goes cold. The workloads are spiky, the latency is erratic, and the cost curve looks like a staircase to nowhere. Models introduce new failure modes: prompt sensitivity, context-window cliff edges, token explosions, and non-determinism that makes “works on my laptop” sound like a punchline. Instead of a single SLA, we’re juggling three: latency SLOs for user experience, accuracy or utility SLOs for business value, and cost guardrails for not getting a “please explain” from finance. On top of that, drift isn’t just about traffic changes; it’s also data drift, model drift, and prompt drift. The smallest tweak in safety settings or a subtle change in retrieved documents can flip results.

We also inherit a new observability layer: prompts, responses, token counts, model versions, safety blocks, and provider-side rate limiting. Our old dashboards don’t speak “token” yet. And while we love autoscaling, GPU nodes don’t pop out of thin air. Provisioning ends up being a chess game involving quotas, backoff, and placement constraints. Rolling back isn’t trivial either: a previous model may have different tokenizer rules or safety rails, changing system behavior under pressure. Finally, incident response gets… lively. A spike in “content flagged” errors may be a settings change, not a code bug. Cue the runbook that starts with “Is the model behaving differently today?” Welcome to production ai, where solid engineering beats clever heroics every time.

Sane Guardrails: Budgets, SLOs, and Kill-Switches

We don’t manage what we don’t measure. For ai, that means setting SLOs for latency and utility, plus hard budgets on tokens and GPUs. We like to publish three top-line numbers weekly: p95 latency, pass@k or rubric-based score on a curated eval set, and cost per 1k requests. If any two wobble, we pause rollouts. For operational consistency, we define explicit error budgets for “degraded quality” events (e.g., hallucination above threshold) alongside traditional latency and availability. When the budget burns down, releases slow, experiments go behind flags, and the team focuses on reliability work. And yes, we keep a kill-switch: a single flag to switch traffic to the last known good model or disable optional ai features if spend or SLOs go off the rails.

A small slice of YAML carries weight. Formalize expectations and control in code:

# slo-rules.yaml
slo:
  latency_p95_ms: 1200
  cost_us_per_request: 4.0
  quality_pass_rate: 0.92
budgets:
  monthly_token_usd: 25000
  gpu_hours: 1800
feature_flags:
  ai_responder_enabled: true
  model_candidate: "gpt-xyz-2025-06-01"

We wire these to alerts and automated actions. If cost_us_per_request breaches for more than N minutes, the deployment automation flips model_candidate back and reduces concurrency by 25%. The kill-switch toggles ai_responder_enabled to false, replacing responses with cached or heuristic fallbacks. Flags let us run controlled experiments without turning the whole product into a lab. Guardrails cut drama, not speed.

Observability That Reads Prompts, Not Tea Leaves

We’ve learned that two dashboards matter most: user experience and token economics. For user experience, we trace the full path from prompt to response with model ID, temperature, system message hash, and any retrieval corpus versions. For token economics, we trend input and output tokens, cache hit rate, and reject reasons (provider throttling, safety filters, bad routing). Structured events beat free-text logs. We tag every request with a privacy-safe prompt hash and a redaction policy so we can correlate issues without hoarding sensitive data. For traces, we’ve had success extending our OpenTelemetry spans with custom attributes like llm.model, llm.provider, llm.input_tokens, and llm.safety_blocked. The OpenTelemetry Semantic Conventions offer a sensible starting point; we just add ai-specific fields we actually use.

Sampling is tempting, but it bites during incidents. We keep 100% tracing for errors and safety blocks, with dynamic sampling for normal traffic. Model-specific dashboards are crucial. p95 latency on one model may hide a bad tail on another due to batching or provider queuing. Also, we graph quality signals from offline evals next to production rates. If the offline rubric score drifts while production looks fine, we check data freshness in our retrieval layer. Finally, we put “red flags” near the top: spikes in prompt length, drops in cache hit rate, and sudden changes in average output tokens often predict spend explosions and timeouts. Think less crystal ball, more hard counters wired to pagers.

Shipping With Confidence: Canaries and Offline Eval

Rolling a new model is not a leap of faith; it’s a well-lit crosswalk with a stop sign. We start with offline evals: a curated set of prompts, expected outcomes or grading rubrics, and adversarial cases (prompt injection, tricky context, multilingual edge cases). We gate candidates on a few metrics: rubric pass rate, side-by-side preference, toxicity/safety rates, and cost per success. Once a candidate clears the bar, it enters a canary phase: 1-5% of production traffic, guarded by budgets and fast rollback. We compare win/loss on matched prompts, not just aggregate averages, because averages lie. A model that’s great on short prompts may crater on long-context retrieval tasks.

Argo Rollouts makes this predictable with good guardrails. We’ve used a config like this:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ai-responder
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause:
            duration: 10m
        - setWeight: 25
        - pause:
            duration: 30m
      analysis:
        templates:
        - templateName: quality-check
        startingStep: 1

We wire the analysis template to live metrics: p95 latency, cost per request, and a “good answer” rate computed from lightweight rubrics. If any fail, Argo rolls back automatically. The Argo Rollouts docs are a handy reference when tuning thresholds: Argo Rollouts. The result: releases become boring, which is exactly what we want. Our team sleeps, users stay happy, and finance doesn’t ping us at 11 PM.

Cost Control Without Killing Creativity

Let’s talk money without killing momentum. Tokens are compute wearing a fashionable hat. We cap request size (system + prompt + retrieved context) and enforce max response tokens per endpoint. Cache aggressively: exact-match caches for deterministic tools, semantic caches for high-traffic queries, and pre-computed answers where freshness allows. Caching is the closest thing to “free speed” in ai. We also bias routing toward smaller, cheaper models for easy tasks, escalating only when confidence is low. That one change alone has saved us double-digit percentages without harming quality. When we do need bigger models, we batch where latency allows and prefer streaming partial outputs to keep users engaged.

Budget governance doesn’t need to be stodgy. We publish per-team token budgets, daily spend forecasts, and a top offenders list (friendly shaming works). We reset limits before big launches and set temporary concurrency caps when backlogs grow. Tradeoffs are explicit: “We can keep long-context queries at 1,024 tokens this week or add a 300ms tail.” We also keep humans in the loop: editors can mark queries as cacheable or set higher quality thresholds for premium users. It aligns engineering and product without endless meetings. For architecture sanity, we revisit cost patterns quarterly using guidance like AWS Well-Architected to check we’re not inventing expensive, fragile snowflakes. Cost won’t fix itself; we make it visible, discuss it openly, and keep experiments behind flags until they earn their keep.

GPUs Are Not Unicorns: Scheduling and Quotas

GPUs are finite and moody. We treat them like a shared utility, not a free-for-all. Start with placement: taint GPU nodes and use node selectors and tolerations so CPU jobs don’t squat on precious silicon. Request full GPUs only when needed, and use fractional options like MIG when supported. We set per-namespace quotas so a single experiment can’t evict production. Also, aim for bin-packing: a few large GPU nodes with workload consolidation often beats a scatter of tiny ones, as long as you watch for noisy neighbors. For schedulers, Kueue or queue-based admission helps throttle batch jobs behind latency-sensitive services. And we never assume a cloud provider has infinite supply. We maintain a warm pool sized for p95 demand and burst via spot or on-demand nodes with graceful degradation.

Here’s a simple deployment requesting a GPU with sanity baked in:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vector-reranker
spec:
  replicas: 3
  template:
    spec:
      nodeSelector:
        accelerator: nvidia
      containers:
      - name: reranker
        image: ourco/reranker:1.8.2
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "2"
          requests:
            nvidia.com/gpu: 1
            memory: "6Gi"
            cpu: "1"
      tolerations:
      - key: "accelerator"
        operator: "Exists"
        effect: "NoSchedule"

Don’t forget to enable and monitor the device plugin: Kubernetes Device Plugins. We also log queueing delay as a first-class metric; it’s the canary for capacity pain. When queueing grows, we scale replicas, downgrade models, or switch to CPU-friendly fallbacks temporarily.

Security, Privacy, and Change Control For ai

We don’t ship features we can’t defend. With ai, that means treating prompts and outputs as sensitive. We redact PII at the edge and enforce egress rules so prompts don’t wander into unexpected services. Prompt injection and data exfiltration aren’t parlor tricks; they’re live risks. The OWASP LLM Top 10 is a practical list to bake into our threat model. We gate outbound calls from model tools with allowlists and timeouts. For retrieval, we version corpora and keep provenance: which documents, which embeddings, which index version. That audit trail turns “huh?” tickets into fixable issues.

Change control needs an upgrade. A model isn’t just a binary; it’s weights, tokenizer, safety settings, and prompts. We pin versions, create signed manifests, and treat them like artifacts with SBOM-like metadata. Policy checks catch footguns: no model goes to prod without eval results and data-scope tags. We’ve had success with admission control to enforce constraints (e.g., model X can’t access dataset Y). Open Policy Agent helps when we want policies as code; Gatekeeper’s templates are a good starting point: OPA Gatekeeper. Finally, we audit human-in-the-loop decisions the same way we audit deployments. If someone overrides a safety filter or approves a pricey model route, it’s logged, attributed, and expires on schedule. Healthy paranoia plus tight feedback loops keeps creative features from turning into cleanup duty.