Ship Faster, Fail Less: Stubbornly Practical ai For Ops

Ship Faster, Fail Less: Stubbornly Practical ai For Ops
Concrete patterns, config, and guardrails to run AI without drama.

Why We Put AI In The Runbook

We don’t add shiny tools just to put them on a slide. We add them because pages wake us up at 2 a.m. and the pager doesn’t care how fun the tech was to build. AI earns a spot in the runbook when it removes toil, smooths peaks, and cuts the kind of latency graphs that make dashboards blush. The trick is to treat AI not as a magical oracle but as another dependency with nasty failure modes and clear operating limits. If we do that, we see practical wins: faster root-cause hints for noisy incidents, adaptive autoscaling that respects real traffic patterns, and code suggestions that actually pass tests without “tweaking” the entire codebase.

Let’s set expectations. AI is probabilistic; it will hallucinate, time out, and eat tokens like snacks at standup. Our job is to wrap it with the same discipline we use for databases and caches: SLOs, rate limits, observability, staged rollout, and quick rollback. We’ll start tiny, scope the problem tightly, and collect numbers that survive a skeptical postmortem. Think helpdesk ticket summarization instead of “AI everywhere.” Think topology-aware capacity hints instead of “reinvent the scheduler.” When we frame AI like this, the questions become operational, not existential: What SLI captures usefulness? How do we control spending per environment? Which fallback keeps the app useful when models flake out?

In this post, we’ll share patterns we’ve shipped without breaking prod (much). We’ll use flags, pipelines, traces, and plain old timeouts. We’ll also poke fun at ourselves, because nothing keeps us honest like remembering that we once tried to fix a 500 with a coffee refill.

Ship AI Behind Flags, Not Behind Smoke

We’ve learned the hard way that “just shipping it” is a great way to discover new incident categories. AI features deserve the same progressive controls we use for risky migrations: flags, cohorts, kill switches, and staged rollouts that can freeze on a noisy metric. Instead of wiring prompts directly into the UI, we wrap them in a feature flag with a sensible default and a runtime override. That lets us test in shadow mode, target internal users, and flip traffic without redeploying when the model gets moody.

Flags aren’t only on/off. They can carry parameters: model name, temperature, timeouts, or token caps. That means we tune behavior live, observe impact, and revert instantly when a spike tells us we guessed wrong. We like open standards here because vendor lock-in pairs poorly with production on fire. The OpenFeature spec and SDKs give us a portable way to manage flags across languages and providers. Keep control-plane access behind strong auth, log every change, and require a second set of eyes for risky toggles.

A tiny example flag bundle:

# flags/ai.yml
ai_rewrite:
  enabled: false
  rollout: 0.1         # 10% of users
  params:
    model: "gpt-4o"
    max_tokens: 512
    temperature: 0.2
    timeout_ms: 1200
ai_assist:
  enabled: true
  groups:
    - "internal"
    - "beta"

We initialize these at process start, fetch dynamic updates via a secure provider, and cache locally so a control-plane hiccup doesn’t brick the app. Ship behind flags, sleep better.

Data Hygiene Beats Model Hype Every Time

Our teams tend to jump straight to “which model?” when the better question is “what data do we trust?” If prompts are the steering wheel, data is the road; potholes there will blow a tire at highway speed. Before we wire a model into prod, we inventory where input text comes from, how it’s sanitized, and which fields should never leave the VPC. We strip secrets, redact PII, and bound prompt length. We test the sanitizers like we test auth—think fuzzing and regression, not vibes. We also define retention and access policies in plain language: who can see logs that contain prompts, for how long, and for what purpose.

Model outputs need the same scrutiny. We log output metadata, not just the text: latency, tokens in/out, model version, temperature, and whether a guardrail filtered the result. When bad data sneaks in, we want to trace the wrong answer back to the prompt and upstream source fast. “Trust but verify” looks like sampling outputs for human review, running quality checks, and using small labeled sets to catch drift. It’s not glamorous, but a week of tidy data will beat a month of prompt gymnastics.

We also plan for data isolation across environments. Staging should never leak into prod, and prod data mustn’t end up in a dev sandbox just because someone toggled a feature. Use separate buckets, separate keys, and, if your provider allows it, separate projects or accounts. Finally, communicate loudly: put the data contract in the repo, near the code, where reviews happen. The model can be clever; the pipeline must be boring and correct.

CI/CD Where Models Are First-Class

Let’s treat models and prompts like any other artifact: versioned, tested, promoted, and rolled back with a button. That means checking prompts into the repo, pinning model identifiers, and wiring CI to run evaluation suites that don’t flake. We keep unit tests small (prompt formatting, schema conformance) and evaluation sets pragmatic: a handful of representative examples that catch regressions in tone, accuracy, or schema adherence. When a model upgrade improves two cases and breaks one, we see it before users do.

A minimal GitHub Actions pipeline might look like this:

name: ci-ai
on:
  pull_request:
    paths: ["prompts/**", "models/**", "ai/**"]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install -r ai/requirements.txt
      - run: python ai/tests/test_prompts.py
      - run: python ai/eval/run_eval.py --dataset ai/eval/cases.json

We publish evaluation reports as artifacts and fail the PR if the score dips below threshold. For model registries, we like boring and visible. Tools such as MLflow give us versioned models, metadata, and promotion flows without building a shrine to YAML. Keep the model and prompt versions alongside app releases so rollbacks are clean: code v42 pairs with prompt v7 and model tag 2025-10-12.

Finally, we treat provider dependencies as infra. Changes to endpoints, quotas, or SDK versions go through the same review gates as database migrations. If we can’t reproduce a run locally or in a sandbox, it’s not ready for prod. Pipelines don’t need to be fancy; they need to be dull, fast, and repeatable.

Observability For Tokens, Prompts, And Drift

If we can’t see it, we can’t operate it. AI calls deserve first-class telemetry: spans around provider calls, structured logs for prompts and outputs (sanitized, of course), and metrics for latency, token usage, and error codes. We stitch these into traces that cross service boundaries so we can answer “Is the user slow or is the model slow?” in under a minute. We use standard libraries where we can, because custom tracing is a one-way door into yak-shaving. The OpenTelemetry ecosystem gives us instrumentation across languages and exporters, and it plays nicely with existing collectors.

We emit domain-specific metrics: tokens_in, tokens_out, prompt_size_bytes, filter_triggered, completion_retries, model_version, cache_hit, and evaluation_score. We set SLOs on latency and success rate, then alarm on error budgets rather than flapping on every spike. For logs, we sample aggressively. Full prompt logs are expensive and sensitive; we only keep enough to debug and evaluate, and we encrypt at rest with tight access.

Drift sneaks up on teams. We set up weekly canaries: a stable evaluation set run on the current model and prompt versions, with results compared to last week’s. If quality drops, we pause rollouts and investigate prompt changes, provider shifts, or data anomalies. In dashboards, we put token cost next to latency and error rate. When you see cost per request climb alongside a timeout, it’s easier to decide between caching, prompt slimming, or a provider change. Observability isn’t a checkbox; it’s how we stay honest.

Cost And Rate-Limits: Make 429 Your Friend

Nothing humbles a team faster than a surprise bill or a sea of 429s. We budget tokens like we budget CPUs: per environment, per service, and sometimes per user group. We cap tokens in both directions—input and output—and set a sane timeout so requests don’t turn into zombie connections. We also pre-warm caches for common prompts, because the cheapest token is the one we didn’t buy.

Rate-limits are not personal; they’re physics. We treat HTTP 429 as a normal response class with planned behavior. Per RFC 6585, 429 Too Many Requests should include a Retry-After header; we actually read it, back off, and jitter. We layer client-side budgets on top: per-host concurrency and request caps that adjust when we detect queuing. Costs deserve visibility too. We tag requests with team and feature IDs, then roll up spend in the same place we track infra cost. The AWS Well-Architected Cost Optimization guidance is surprisingly applicable: measure, allocate, set guardrails, and automate shutoffs.

We’ve had luck with adaptive concurrency that drops QPS before we trip provider limits, plus a “brownout” mode that returns a simpler response when the model is slow or pricey. For backends with burst tokens, we schedule heavier jobs off-peak and batch them. If your provider offers quota increases, treat them like you treat storage expansions: last resort, not first move. Cost control isn’t about stinginess; it’s about never choosing between uptime and a CFO message in Slack.

Failure Modes, Fallbacks, And Fast Kill Switches

AI services fail in creative ways: weird outputs, slow tail latencies, spiky timeouts. We don’t debate if they’ll fail; we design how. First, quick detection: timeouts per call, content filters that reject broken schemas, and validators that check outputs before they hit the user. Second, graceful degradation: cached responses, simpler deterministic code paths, or “good enough” templates when the model flakes out. Third, fast escape hatches: flags that disable features, controls that swap models, and a playbook that’s been rehearsed.

A small in-process fallback can absorb a lot of pain:

# ai/client.py
def rewrite(text, cfg):
  try:
    return call_model(text, cfg, timeout=1.2)
  except Exception as e:
    metrics.counter("ai_rewrite_errors").inc()
    if error_rate("ai_rewrite") > 0.05:
      flag.set("ai_rewrite.enabled", False)
    return simple_rules_based_rewrite(text)

def simple_rules_based_rewrite(t):
  return t.replace("!!!", "!").strip()[:512]

We complement this with circuit breakers at the edge and low TTL DNS so we can reroute quickly. Write the “kill it” path as carefully as the happy path, and make it idempotent. For UX, be honest. If we’re in brownout mode, tell users the feature is limited; it beats silently collapsing. After the fire, we don’t just bump timeouts until the graph stops screaming—we fix root causes, tune concurrency, or redesign prompts that are too fragile. A good fallback isn’t an apology; it’s a product decision that respects users and sleep.

From Breadboards To Boring: Making AI Routine

Our happiest AI features have one thing in common: they feel routine. They live behind flags, deploy through the same pipelines as everything else, and show up in dashboards without new tabs or secret handshakes. We document the data contract in the repo, add runbooks to the wiki, and rotate on-call with the same rules. When a new team wants to add an AI call, they copy a tiny library, not a 30-page doc. That’s the goal: boring excellence, not a hero project that only three people dare touch.

Let’s be clear about adoption. We start with a small, contained use-case and establish a review cadence that doesn’t drift into “set and forget.” We automate evals nightly, cap spend by environment, and rehearse failure drills twice a quarter. We give product managers the knobs that matter (rollout percent, quality thresholds) and hide the ones that don’t (temperature bikeshedding, anyone?). We bake in privacy from day one, not as a sprint at the end. And we keep trade-offs visible: what we gained, what we added in complexity, and what we’ll revisit next quarter.

This approach scales. The first feature takes a sprint; the next three take days. As we accumulate patterns, we say “no” more often to ideas that fight the grain. AI becomes another helpful tool in our box, not a personality trait for the company. When it works, users get snappier experiences, support gets better summaries, and we get fewer 3 a.m. puzzles. That’s enough magic for us.