Cut 37% More Pager Noise With Pragmatic sre Habits

sre

Cut 37% More Pager Noise With Pragmatic sre Habits

We’ll trade heroics for boring reliability—without slowing delivery.

sre Isn’t a Team Name; It’s a Set of Promises

If we say “sre” and everyone pictures a separate squad with darker dashboards and lighter sleep, we’ve already lost a little. Site Reliability Engineering works best when it’s a way we make promises—about availability, latency, and recovery—and then keep them with repeatable practices. The trick is making those promises explicit enough to measure, but not so rigid that product work grinds to a halt. That’s where SLOs, error budgets, and a healthy respect for failure come in. When we treat reliability as an engineering problem (not a personality trait), we stop rewarding the colleague who “saved prod at 2am” and start rewarding the system that didn’t need saving.

A useful mental model: operations asks “is it up?”; sre asks “is it delivering the level of service users expect, and can we prove it?” That difference forces us to define what “good” looks like from the user’s point of view. If we’re not careful, we’ll measure CPU and call it a day. But users don’t buy CPU. They buy “checkout completes” and “search returns results quickly.” So we anchor on user journeys, define a service level indicator (SLI) for each, and set a target (SLO) that matches the business reality. Google’s classic framing still holds up: Google SRE Book — Service Level Objectives. The payoff is clarity: when we’re inside the SLO, we ship; when we’re outside, we fix. No debates, no vibes, just math with a human purpose.

Start With One SLO That Actually Hurts

If we try to roll out SLOs for every endpoint, every microservice, and every shade of “kinda slow,” we’ll drown in spreadsheets and lose goodwill. Our favourite way to begin is picking one SLO that’s both meaningful and slightly uncomfortable—something that forces us to face trade-offs. For many teams, that’s the “golden path” of login → browse → checkout, or “API requests succeed within X ms.” The key is to define the SLI in user terms, with a clear event boundary and a clean numerator/denominator. Then we choose a target that’s realistic enough to hit most weeks, but strict enough that we feel it when we’re sloppy.

For HTTP services, a common SLI is the proportion of “good” requests. “Good” might mean 2xx/3xx responses under a latency threshold. We should be explicit about what counts as “bad”: timeouts, 5xx, and maybe even “technically successful but painfully slow.” Then we timebox the first target. A 30-day window keeps the math understandable, and it matches how people think about “this month was rough.” If we want a standards reference for the shape of the measurement, the telemetry world has matured nicely—OpenTelemetry’s docs are a solid anchor for how we instrument and emit signals without painting ourselves into a vendor corner: OpenTelemetry Documentation.

Here’s what a first SLO spec can look like when we keep it plain, short, and reviewable:

service: checkout-api
slo:
  name: "Successful checkouts"
  window: 30d
  sli:
    type: request-based
    good: "http_status in [200,299] AND latency_ms <= 400"
    total: "all requests to POST /checkout"
  target: 99.5

We’ll refine thresholds later. The early win is simply agreeing on what we’re promising.

Error Budgets: The Only “No” That Engineers Accept

Once we have an SLO, the error budget becomes the lever that makes sre feel fair instead of preachy. An error budget is just the allowed amount of unreliability within the SLO window. If the target is 99.5% over 30 days, we’re allowed 0.5% “badness.” That budget is shared: product can spend it by shipping risky changes faster; engineering can preserve it by hardening systems and slowing releases when things get wobbly. The beauty is that it turns subjective arguments (“this feels unsafe”) into an explicit trade (“we’ve burned 80% of the budget; maybe we stop deploying on Fridays and fix the flake”).

We don’t need fancy tooling to start—just clear policy. For example: if we burn more than 50% of the monthly budget in a week, we trigger a reliability focus period. That might mean pausing non-critical releases, doing targeted load testing, or paying down a few gnarly operational debts. This isn’t punishment; it’s restoring the contract with users. And it protects engineers from the worst kind of pressure: the push to ship while systems are actively on fire.

This approach also forces leadership maturity in a good way. We can’t demand 99.99% outcomes while funding 99.5% engineering time. If we want to ground these ideas in a widely accepted reference, the original Google treatment remains the clearest explanation of why budgets work socially, not just mathematically: Google SRE Book — Error Budgets. When we implement error budgets well, we’re not saying “no” to shipping; we’re saying “yes, as long as we’re meeting our promises.” That’s a no engineers can live with—because it’s really a conditional yes.

Instrumentation That Doesn’t Make Us Hate Ourselves

Observability is where many sre initiatives go to die—not because it’s unimportant, but because we overcomplicate it. We don’t need 400 dashboards to know users are sad. We need a small set of signals that map directly to SLOs, plus enough context to debug quickly when we fall out of bounds. A practical stack starts with the classics: metrics for rates/errors/latency, logs for forensic detail, and traces for distributed pain. The discipline is in choosing what we instrument and naming it consistently so we can query it under pressure without reciting an incantation.

For a Prometheus-style world, we like to express the SLI in terms that can be computed from counters/histograms. That means ensuring we actually emit the right labels and buckets. Latency especially needs care: “average latency” is a liar, and p95 without context can be a drama queen. We usually start with p50/p95 plus a threshold-based “good events” counter to compute the SLO cleanly. The Prometheus docs explain histogram trade-offs better than most internal wiki pages ever will: Prometheus Histograms and Summaries.

A minimal PromQL-ish sketch for a latency-and-status SLI might look like this:

good = sum(rate(http_requests_total{service="checkout-api",route="/checkout",code=~"2.."}[5m]))
       - sum(rate(http_request_timeouts_total{service="checkout-api",route="/checkout"}[5m]))

total = sum(rate(http_requests_total{service="checkout-api",route="/checkout"}[5m]))

sli = good / total

Then we add a latency condition using histogram buckets:

good_latency = sum(rate(http_request_duration_seconds_bucket{le="0.4",service="checkout-api",route="/checkout"}[5m]))
total_latency = sum(rate(http_request_duration_seconds_count{service="checkout-api",route="/checkout"}[5m]))
latency_sli = good_latency / total_latency

We keep it boring: if we can’t explain the query during an incident, it’s too clever.

On-Call That Doesn’t Ruin Anyone’s Week

On-call is where sre becomes real. It’s also where we accidentally create a very expensive employee retention problem. Our rule is simple: pages must be actionable, urgent, and tied to user impact. Anything else is a ticket. If a page wakes us up but doesn’t require a human within minutes, it’s not a page—it’s a notification cosplaying as an emergency. The fastest way to reduce burnout is to get ruthless about what triggers a pager and to make “fix the alert” an acceptable, celebrated piece of work.

We like to start by auditing alert volume for a two-week slice: how many pages, how many were useful, and how many were duplicates or non-actionable. Then we pick the top three noisy alerts and fix them properly. “Properly” might mean changing thresholds, adding a for: clause so we don’t page on blips, or replacing low-signal infrastructure alerts with SLO-based alerts. The best alerts are about user experience: “checkout success rate below SLO burn rate” beats “CPU is 82%” almost every time.

In Alertmanager-style config, tiny changes can save a lot of sleep. For example, grouping and inhibition prevent a cascade of pages when one upstream dependency falls over:

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

inhibit_rules:
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '.*LatencyHigh|.*ErrorRateHigh'
    equal: ['service']

We should also maintain runbooks that match alerts one-to-one. If the runbook begins with “check dashboards,” we can do better. It should start with a decision.

Incident Response: Practice the Boring Parts

We’re surprisingly bad at the boring parts of incidents: declaring them early, setting roles, writing updates, and keeping a timeline. The technical work often gets the spotlight, but coordination is what prevents a 20-minute outage from becoming a two-hour mess with a side of confusion. A lightweight incident process is an sre superpower because it scales with stress. We don’t need theatre; we need repeatable moves that reduce cognitive load when everyone’s adrenaline is doing laps.

Our baseline is: one incident commander (IC), one communications lead, and one or more investigators. The IC doesn’t debug; they keep the system moving—making sure we have a hypothesis, an owner for each action, and a clear “next update in 15 minutes.” The communications lead posts user-facing updates so investigators can focus. If we run public services, we also align with industry guidance on how to communicate and learn from incidents. The NIST incident handling guide is old enough to have seen some things, but it’s still a solid reference for roles and phases: NIST SP 800-61r2.

We also practise. Not with weekly chaos extravaganzas, but with occasional, scoped game days where we rehearse the mechanics: paging, declaring, handing off, and rolling back. And we make rollback boring. If rollback feels scary, it won’t happen under pressure. That’s why we invest in safe deploy patterns (canaries, feature flags) and verify them outside incidents. Incident response maturity isn’t about never having incidents; it’s about containing blast radius, shortening MTTR, and learning without blame. If we can consistently produce a clean timeline and a handful of concrete follow-ups, we’re already ahead of most teams.

Postmortems That Lead to Fewer Repeat Incidents

A postmortem that ends with “be more careful” is just a guilt document wearing a lab coat. In sre, postmortems exist to change systems, not people’s personalities. We want a write-up that captures what happened, why it made sense at the time, and what we’ll change so the same failure mode is less likely—or at least less damaging. The win isn’t the document; it’s the follow-through. If action items die quietly in a backlog, we’re doing literature, not reliability.

We keep postmortems blameless and specific. “Engineer deployed broken code” is neither. Better: “Our deploy pipeline allowed a config change to bypass integration tests, and we lacked a canary metric on checkout failures.” Then we add a small set of fixes that map to the failure chain: guardrails in CI/CD, better alerts, and maybe a runbook update. We also tag each action item with an owner and an expected date. Not because we love bureaucracy, but because unowned work doesn’t get done, and undone reliability work has a habit of paging us later.

A helpful framing is to ask: what would have prevented this entirely, what would have detected it sooner, and what would have reduced impact? That keeps us from over-investing in one kind of fix. We also track repeat offenders: if the same class of incident happens twice in a quarter, it’s a signal our “fix” was cosmetic. Finally, we tie postmortem learnings back to error budgets. If we burned budget due to a known issue we’d previously postponed, we say that out loud. The goal isn’t shame; it’s making trade-offs visible so we can choose better next time.

Share