Sharpen sre Instincts With 47-Minute Incident Drills

sre

Sharpen sre Instincts With 47-Minute Incident Drills
Practical patterns, configs, and metrics to make uptime less theatrical.

Reliability Is a Product Feature, Not a Vibe
Let’s say the quiet part out loud: sre isn’t about sprinkling dashboards on top of outages and calling it maturity. It’s an engineering discipline that treats reliability as a shipped feature with specs, constraints, budgets, and trade-offs. If we can’t quantify “good enough,” we’re stuck arguing during incidents or, worse, guessing. This is why service level objectives (SLOs), error budgets, and an explicit appetite for risk are our cornerstone. We pick what users actually experience—availability, latency, or correctness—and represent it with a crisp SLI. We set the SLO to a number customers care about and our teams can afford. We defend that SLO with guardrails, not vibes.

We also accept that incidents aren’t a moral failing; they’re a tax we pay for velocity. The point of an error budget is to make that tax visible. We spend it intentionally on releases and experiments, not by accident at 3 a.m. During “good times,” we move fast. When the budget burns too hot, we pause risky changes, fix reliability work, and earn the right to go fast again. That’s it—no drama.

If you want a solid, vendor-neutral primer that’s aged well, the Google SRE book’s chapter on Service Level Objectives is still the clearest frame. Keep your SLOs customer-facing, measurable, and reviewed as often as your roadmap. If we’re debating a number every sprint, it’s probably a process problem; if we’re never debating it, it’s probably not connected to user pain. Either way, treating reliability like a feature keeps it from becoming a trust-eroding surprise.

SLOs You Can Explain in an Elevator
We’ve seen heroic SLO spreadsheets that would confuse a math professor. Let’s aim for something we can explain between floors: “For API requests, 99.9% should complete under 300 ms measured over 30 days.” That’s one sentence with a clear SLI, threshold, and window. The 30-day window keeps it meaningful, the latency threshold maps to user experience, and 99.9% expresses a budget: 0.1% of requests can miss the target—call that your “oops allowance.”

Then we make it concrete. Record the SLI in your metrics system as both numerator and denominator, not just a pre-baked percentage. If we only have a ratio, we can’t ask “how many” or separate “few users badly hurt” from “many users slightly annoyed.” For HTTP, track total requests and “good” requests under your latency threshold. For queues, track messages accepted and messages processed within deadline. For batch jobs, successful runs within SLA window.

Here’s a simple Prometheus recording rule for a latency SLI and rolling SLO ratio (tweak label names to taste):

# Good events: requests <= 300ms
record: job:http_request_good:rate5m
expr: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)

# Total events: all requests
record: job:http_request_total:rate5m
expr: sum(rate(http_request_duration_seconds_count[5m])) by (job)

# SLI ratio over 5 minutes
record: job:sli_ratio:5m
expr: job:http_request_good:rate5m / job:http_request_total:rate5m

Stitch these into longer windows (1h, 6h, 30d) for trend and burn calculations. If you’re also tracing, map spans to SLIs so “slow” has a fingerprint. The OpenTelemetry traces docs are a good reference—attach service, route, user segment, and outcome so SLO misses can be traced to a specific path or dependency. The elevator test keeps SLOs human; the recordings make them actionable.

Alerting That Maps Directly to User Pain
If our alerts don’t map to SLOs, we’re either paging for noise or missing what matters. Tie alerts to burn rate, not raw error counts. A fast-burn alert says, “We’re consuming error budget so quickly we’ll be empty in a few hours,” which is a pager for humans. A slow-burn alert says, “We’ll deplete the budget this week if this continues,” which is a ticket for daylight. That split helps save sleep without ignoring smoldering fires.

Start simple. Suppose the SLO is 99.9% over 30 days. That’s a 0.1% budget. We’ll alert when the short-window error fraction exceeds 14.4 times budget (roughly “empty in 2 hours”), and when the long-window exceeds 6 times budget (roughly “empty in a day”). Prometheus makes this straightforward. The official alerting best practices explain the math, but here’s a usable skeleton:

groups:
- name: slo-burn
  rules:
  - alert: APIHighBurnShort
    expr: (1 - job:sli_ratio:5m{job="api"}) > (0.001 * 14.4)
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "API SLO burning fast (short window)"
      runbook: "https://internal/docs/runbooks/api-slo"

  - alert: APIHighBurnLong
    expr: (1 - avg_over_time(job:sli_ratio:5m{job="api"}[1h])) > (0.001 * 6)
    for: 1h
    labels:
      severity: ticket
    annotations:
      summary: "API SLO burning (long window)"
      runbook: "https://internal/docs/runbooks/api-slo"

Note the annotations: every page should have a runbook, a likely-cause list, and an action. Don’t page on host metrics unless they’re proven, persistent user-impact signals. CPU spikes aren’t a page; 10% of requests failing in the EU is. Finally, set expectations: one paging alert per symptom, not five clones per region, pod, and method. Aggregate where you can. Humans debug; machines count.

Runbooks, Automation, and the Gift of Sleep
Runbooks should feel like a helpful co-pilot, not a wiki labyrinth. We treat them as scripts with commentary, living next to the service code, tested in staging, and versioned. If the “fix” requires magic commands from someone’s memory, we codify them. If an action is reversible, we script the undo as well. We’d rather have a boring, mostly-correct tool than a perfect document nobody reads under pressure.

A healthy runbook answers three questions fast: How do I confirm impact? What’s the smallest safe mitigation? What data do I capture before the system recovers and hides evidence? That’s why we include “one-liners” to snapshot logs, grab a few relevant metrics, and note the request IDs of recent failures. If your stack emits traces and correlations, link them so responders don’t spelunk multiple UIs. The OpenTelemetry traces signal is especially useful: a sampled, tagged trail is better than a haystack of logs when the clock is ticking.

Here’s a tiny, boring mitigation helper we’ve shipped more than once:

#!/usr/bin/env bash
set -euo pipefail

svc="${1:-api}"
ns="${2:-prod}"

echo "[INFO] Checking rollout status for ${ns}/${svc}"
if ! kubectl -n "${ns}" rollout status deploy/"${svc}" --timeout=2m; then
  echo "[WARN] Rollout unhealthy; initiating rollback"
  kubectl -n "${ns}" rollout undo deploy/"${svc}"
  echo "[INFO] Rolled back ${ns}/${svc}"
fi

echo "[INFO] Capturing quick snapshot"
kubectl -n "${ns}" get pods -l app="${svc}" -o wide
kubectl -n "${ns}" top pods -l app="${svc}" || true

Not fancy, but it restores service fast and captures a tiny breadcrumb trail. Put these helpers in PATH for on-call. In a page, we prefer muscle memory over interpretive dance.

Shipping Safely: Flags, Canaries, and Probes
We don’t need a service mesh to reduce deployment risk. We need smaller changes, progressive exposure, and a signal that tells us when to stop. Feature flags let us decouple deploy from release. Canaries let us validate in production with blast radius set to “small.” Probes tell traffic routers whether to trust a pod. When we combine the three, rollouts feel routine instead of Russian roulette.

If you’re on Kubernetes, readiness and liveness probes are the unsung heroes. Liveness restarts crashed or wedged containers. Readiness gates traffic until the app is actually ready, not just “container started.” If we misuse them, we either blackhole traffic or mask failures. The Kubernetes probes guide is excellent; here’s a sensible baseline:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 1

Make the readiness endpoint check downstream dependencies that are critical for serving (e.g., DB connection pool, caches). Keep it fast; add a “last-10s error rate” gate if you must, but don’t embed full-blown diagnostics. Pair this with a canary: ship to 1% of pods or traffic, watch SLO-adjacent metrics for 5–10 minutes, then proceed. Feature flags let us stop rollout without redeploying. The art is deciding the stop condition: tie it to the same SLI you use for SLOs. If your canary is “green” but the SLI is “red,” trust the SLI.

Incidents: Roles, Radios, and Real-Time Clarity
Incidents go sideways when we improvise roles, tools, and updates. We can keep it simple: a single incident commander (IC) who coordinates, an ops lead who runs mitigations, a comms lead who updates stakeholders, and a scribe who writes down what actually happened. No heroics required—just fewer mouths talking over each other. We don’t need ten tools either. One chat channel, one video room if needed, and a lightweight status page. Practice on a Tuesday. Our future 3 a.m. selves will send thanks.

We also write time in the open. The IC should narrate: “At 14:22 we paged. At 14:26 we mitigated. Current impact is 5% of requests in EU failing.” That cadence lets us both coordinate and generate the outline of a solid post-incident review. Updates go out at a predictable frequency—say, every 10 minutes—and avoid surprises. Stakeholders don’t need the stack trace; they need a shared reality that we’re stabilizing and a window for the next update.

Tooling helps if it’s boring and consistent. Pre-created chat channels, incident templates, and slash commands get the admin out of the way. The free playbook from PagerDuty Incident Response is worth adopting even if you don’t use their product; the structure translates. And we timebox escalations. If a mitigation isn’t bearing fruit in five minutes, we try the next one. We can always do root cause analysis after we’ve stopped the bleeding. First we protect users, then we learn.

The 47-Minute Drill: Practicing Without Burning People Out
We don’t need a day-long game day to build reflexes. A 47-minute drill—even monthly—can keep our muscles warm without chewing calendars. Here’s how we structure it. In minute zero, someone on the host team picks a realistic but bounded failure: dependency timeout, misconfigured flag, a noisy neighbor, expired cert. We pretag a fake “customer impact” to make the stakes clear. We page the on-call, spin up the incident channel, and start the clock.

The first 10 minutes are about detection and scoping. Can we verify impact through our SLI dashboards? If not, that’s a gap to fix. Next 20 minutes, mitigation and evidence capture. We run the playbook, capture baseline metrics, and decide if a rollback or a flag flip is appropriate. The last 10 minutes are a debrief: what helped, what slowed us down, what we’ll tune before the next drill. We always write down the one thing we’ll fix this week—an alert threshold, a flaky script, a missing graph—and actually fix it. The goal isn’t theatrics; it’s making the real incident feel strangely familiar.

Keep it humane. Treat drills as practice, not surprise exams. Rotate times so nobody gets singled out. If drills are boring, we’ve basically won. If they’re thrilling, we’ve got work to do—but in a controlled setting. We’re not chasing perfection; we’re building habits. Reliability is a contact sport. Short, consistent reps beat grand gestures.

Cost, Risk, and the Error Budget We Can Afford
The grown-up part of sre is trading money and time for reliability on purpose. Margins aren’t infinite. We can buy multi-region, but not everywhere. We can overprovision, but not forever. The error budget turns this into a visible dial. If the budget is consistently untouched, we’re overspending on reliability; bump feature velocity or tighten the SLO to serve bigger customers. If we’re always in deficit, we invest in reliability improvements and slow down risky changes. The key is to agree on the policy before emotions run hot.

We track the budget alongside booking targets and product goals. When we plan a quarter, the product asks for headcount to ship features. The sre voice asks for time to reduce the most expensive classes of incidents: expensive in impact, toil, or human sleep. We don’t need a master spreadsheet to begin—just a top-five list of reliability tasks tied to clear outcomes: shave 20% tail latency by fixing N+1 in service X; cut paging alerts for Y by adopting backpressure; halve cold starts. If the list is too vague to estimate, it’s probably not the biggest problem.

We also weigh where redundancy actually pays off. Multi-AZ is cheap-ish insurance. Multi-region is expensive to build, test, and operate; we use it for services where minutes of outage equals unacceptable loss, and we measure the operational drag. For many systems, fast restore plus good comms beats exotic architectures. If that makes finance smile and users happy, we’ve landed in the sweet spot. Reliability isn’t about saying “yes” to every risk; it’s about saying “yes” to the right ones.

The Boring Road to Continuous Reliability
The trick to keeping reliability high is to make improvements feel routine. We bake reviews into the calendar: SLO review monthly, incident trend review biweekly, and a quarterly reset of error budget policies. We retire alerts that didn’t trigger action. We add exactly one new test or guardrail per incident class, and we delete one thing per week that no longer serves us—an unused dashboard, a zombie cron, a pet graph. Deleting is underrated reliability work.

We also make status visible. A small “reliability card” per service in our engineering readout shows current SLO status, burn rate trend, top two risks, and the one active reliability task. That card tempers planning and prevents arguments. If a team’s card has been red three weeks straight, they get cover to slow product delivery in favor of stability. If it’s green for a quarter, they’ve earned leeway to ship bold changes—still under the watch of canaries and flags, of course.

Finally, we welcome boredom. Doing the basics well—good SLOs, sensible alerts, crisp runbooks, safe rollouts, and practiced incident management—outperforms silver-bullet tooling. If we can explain our reliability approach to a new hire without sighs or Shakespearean metaphors, we’re on the right track. And when an outage inevitably arrives, it won’t be a cliff; it’ll be a speed bump. We’ll handle it, learn one thing, and make the next one a little less exciting. That’s the kind of progress our customers actually notice.

Share