Shockingly Boring sre Practices That Save 40% Toil

sre

Shockingly Boring sre Practices That Save 40% Toil
Practical routines, scripts, and guardrails we use to keep systems calm.

Make SLOs Real With Budgets And Gates

SLOs are only helpful if they change our behavior. We treat them like contracts—clear, human-readable promises customers can feel. We define a small set of SLIs (latency, availability, correctness) and attach an error budget that funds our risk-taking. When the budget’s healthy, we ship faster. When it’s burning too fast, we slow down and fix production. The trick is to wire that contract into the toolchain so it’s not optional. We set alerts on burn rates, show the budget status on PRs, and block risky deploys automatically if the budget’s red. Rather than argue about “feelings,” we point to the numbers and decide. If you want the theory and examples, Google’s write-up on Service Level Objectives is the most practical starting point we’ve found.

We’ve had luck standardizing SLO definitions as code. Sloth lets us keep SLOs versioned next to the service:

# sloth SLO: 99.9% success rate over 30d with burn alerts
service: payments-api
slos:
  - slo:
      name: availability
      objective: "99.9"
      window: 30d
      labels: { team: payments }
      sli:
        events:
          error_query: sum(rate(http_requests_total{service="payments",code!~"2.."}[5m]))
          total_query: sum(rate(http_requests_total{service="payments"}[5m]))
      alerting:
        name: availability-burn
        labels: { severity: page }
        annotations: { runbook: https://runbooks/payments/availability }

We then wire a CI job that reads the generated Prometheus rules and enforces a deploy gate: if the 2-hour burn rate predicts budget exhaustion inside the window, we block merging. Sounds harsh, but it’s actually relaxing; instead of arguing release timing, we ask a calmer question: what fixes buy back budget? That turns debates into backlog items we can close.

Incidents Without Panic: Roles, Radios, And Runbooks

We don’t “rise to the occasion”; we fall to our training. Incidents get calmer when the choreography isn’t reinvented under stress. We use lightweight Incident Command: one Incident Commander (IC) who isn’t typing fixes, a single comms channel, and clear owner handoffs. We keep a short bench of ICs on rotation, with a cheat sheet for their first five minutes. During the event, we fight for clarity: timestamped updates, hypotheses separated from facts, and explicit “hold” on risky changes until the IC buys in. The secret ingredient is scripts we can paste without thinking.

Our incident channel starts with a pinned template:

Incident: payments-api latency spike
Start: 2025-05-14T09:32Z | IC: @oncall
Impact: 40% of checkout requests > 2s (SLO at risk)
Hypothesis: DB connection pool exhaustion after deploy 1.42.0
Actions:
- [ ] Enable feature flag rollback to 1.41.x (owner: @dev1)
- [ ] Increase pool from 50->100 temporarily (owner: @sre2) - get IC approval
Comms:
- External status: yellow (owner: @comms)
- Executive update every 15m
Next Update: 09:47Z

We also prewrite “radios checks”—quick pings to confirm who’s listening before we fling changes. Paging policies, escalation paths, and post-incident steps live in one place, and we actually rehearse them. If you’re building your first playbook, the PagerDuty Incident Response guide is a solid field-tested reference. Keep it small, put it where your team already works, and trim it after each incident. We’re aiming for boring, predictable, and repeatable—like buckling your seatbelt, just with fewer beeps.

Kill Toil Week After Week: Automate The Boring

Toil is repetitive, manual, and scaling with service size. It steals time from improvements and gives nothing back except carpal tunnel. We track it like any other debt: timeboxed, tagged, and burned down. Each sprint, we grab a few repeat offenders—rotating credentials, restarting stuck jobs, patching base images—and automate or eliminate them. Not glamorous, but every hour we save this week we save again next week. That’s how we got to that suspiciously round “40% less toil” claim: we measured hours spent on recurring tasks and watched the chart go down once the scripts landed. The trick is to cease debating whether something is “real toil” and measure it: count occurrences, average duration, and the interrupt cost. Numbers end arguments.

We keep a tiny “ops toolkit” repo with scripts, Make targets, and snippets that are safe-by-default. One example: a guarded rollout helper for hotfixes that refuses to run if the error budget is red or the incident channel is active.

#!/usr/bin/env bash
set -euo pipefail

if ./budget status | grep -q "RED"; then
  echo "Error budget RED. Aborting deploy."; exit 1
fi

if ./incident status | grep -q "ACTIVE"; then
  echo "Active incident. Aborting deploy."; exit 1
fi

kubectl rollout restart deploy "$1" -n "$2"
kubectl rollout status deploy "$1" -n "$2" --timeout=5m

It’s simple, but it encodes our “don’t make it worse” instinct into a habit. Every script like this turns loud opinions into quiet guardrails and gives us back focus for the harder problems.

Safer Changes: Canaries, Flags, And Fast Rollbacks

We like changes small, observable, and reversible. That means progressive delivery by default: a canary takes a slice of traffic, we watch a few SLO-proxy metrics (error rate, tail latency), and we promote or abort automatically. Feature flags handle risk at the code path level, keeping deploys and releases separate. You don’t need a moonshot to start—just make the safe path the easy path, so rushing is no longer faster. Over time, the change failure rate drops, the on-call sleeps better, and product keeps shipping without the cinnamon-roll of dread.

For Kubernetes, we’ve had good mileage with Flagger. It integrates with your gateway and Prometheus to run canaries with sane defaults. The sample below rolls out at 5% steps, aborts on elevated 5xx, and annotates the incident runbook:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payments-api
  namespace: prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  service:
    port: 8080
  canaryAnalysis:
    interval: 1m
    threshold: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    steps: 5
    stepWeight: 5
    rollback:
      onFailure: true
      annotationPrefix: runbook=https://runbooks/payments/canary

It’s worth reading the Flagger README for examples and the knobs that matter. Whatever tool you pick, keep the signal tight: 95th/99th percentiles, success rate, and error budget burn. If your canary relies on five dashboards and vibes, it’s not a canary; it’s a weather report.

Observability That Guides Decisions, Not Screensavers

Pretty dashboards don’t help if they arrive late or drown us in noise. We invest in traces and structured logs that answer two questions fast: “What’s slow?” and “Who broke what?” OpenTelemetry gives us a portable way to instrument without tattooing the codebase. We’re careful about sampling: we keep head-based sampling high for errors and rare events, and use tail sampling to preserve the most informative traces while keeping costs in check. That way, when the pager fires, the traces we need already exist; we’re not chasing ghosts through “most popular” panels.

A minimal OpenTelemetry Collector tail-sampling config we like:

processors:
  tailsampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 500
      - name: key-routes
        type: string_attribute
        string_attribute:
          key: http.target
          values: ["/checkout", "/pay"]

We align logs with trace IDs, echo SLO state in spans, and tag deploy versions. It sounds fussy, but the payoffs are crisp: “The 99th percentile jumped on route /checkout after version 1.42.0; the slow span is ‘charge-card’ talking to payments-db.” From there, it’s a short path to a rollback or a targeted fix. If you’re starting instrumentation, the OpenTelemetry docs are practical and grounded. Keep an eye on ingestion limits and default verbosity; nothing hurts credibility like observability causing the outage.

Capacity Planning You Can Actually Explain To Finance

Capacity doesn’t have to be mystical. We pin it to SLOs and historical demand, then add a buffer sized by variability, not vibes. We estimate the concurrency headroom needed to keep the 99th percentile within bounds, note the busiest hour, and run load tests to validate. To simplify explanations, we translate fancy models into a tiny set of numbers: peak RPS last 30 days, p99 latency at peak, error rate at peak, and spare headroom percentage. Then we tie spend to outcomes: “This extra node keeps checkout under 500 ms at Black Friday load.” Engineers nod; Finance nods; the pager takes a nap.

Prometheus makes it easy to turn this into daily checkups. Two queries we use a lot:

# Headroom ratio: how many more requests until hitting SLO latency guard?
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
  / 0.5  # 0.5s p99 budget

# Saturation signal: top 5 hot DB connections
topk(5, avg_over_time(db_active_connections[10m]) / db_max_connections)

We also keep an eye on the cloud bill in the same breath—waste is a risk, too. Scheduled scale-to-zero for non-prod, rightsizing, and reserved capacity are not exciting, but they’re measurable. When in doubt, we revisit the AWS Well-Architected performance and cost sections to sanity-check our trade-offs. The mantra is simple: keep SLOs green with the fewest dollars necessary, and show the math.

Blameless Practices That Reduce Pager Fatigue 30%

Culture’s not fluff; it’s systems repeated. We’ve seen pager fatigue drop by roughly a third when we make a few habits non-negotiable. First, blameless post-incident reviews that produce two kinds of fixes: local (patch the service) and systemic (fix the guardrails). We write them within a week while details are fresh, we assign owners, and we track completion just like feature work. Second, we invest in on-call health: clear escalation ladders, predictable rotations, load balancing across teams, and explicit “no-meeting recovery time” after a rough night. Third, we prune alerts mercilessly. If an alert hasn’t driven a decision or action in 90 days, it’s on probation. If it wakes someone without telling them what to do, it’s out.

We measure what matters. Alert volume per person per week. Mean time in incident. Interruptions during sleep hours. Percentage of alerts auto-remediated. These numbers make it safe to talk about pain without anyone needing to be a hero. We also budget time for resilience work in the open—no “spare time” fantasies. If a team can’t keep an SLO without ruining their evenings, we reduce scope or increase staffing. We’d rather be blunt than burn people out.

A final nudge: narrate what you learn. Short internal notes—“We killed 12% of noisy alerts; here’s how”—build trust and make it easier for neighboring teams to adopt the same practices. Fewer dings on the phone, fewer meetings about the dings, more time to build the stuff we’re proud of.

What We’ll Do By Friday

If we were starting from scratch this week, we’d pick one service and do four things. Define one availability SLO and error budget in code and wire a deploy gate to it. Add a tiny incident template to the on-call channel and rehearse it once over coffee. Flip one risky change to a small canary via a tool we already have, even if it’s a crude weight change. And pick one recurring chore and script it with a safety check. That’s all. Within a month, we’ll have fewer arguments and fewer surprises, and we’ll stop pretending adrenaline is a plan. None of this requires a platform team, a new vendor, or a ritual. It just requires us to write down how we want the system—and ourselves—to behave, then let the tools nudge us toward it every day. Boring on purpose. That’s the point.

Share