Practical sre Habits That Actually Reduce Pager Noise

sre

Practical sre Habits That Actually Reduce Pager Noise

Small changes, fewer 2 a.m. surprises, and happier on-calls.

Start With “What Woke Us Up?” Not “Who Broke It?”

If we want our sre effort to stick, we’ve got to begin with the right reflex: when something pages, we don’t go hunting for a culprit—we go hunting for the mechanism. Most teams say they do blameless work; fewer teams practice it when the incident channel is spicy and leadership is watching. The practical move is to make every page answer two questions: “What condition triggered this?” and “What user pain did it represent?” If we can’t answer both, the alert probably isn’t ready to page humans.

We’ll also want a lightweight incident loop that’s boring on purpose. A simple template beats a heroic memory. For example: impact, timeline, contributing factors, detection gaps, and follow-ups with owners and due dates. Keep it short enough that people actually complete it, and strict enough that you can compare incidents over time. The killer feature is tagging follow-ups by category: alerting, capacity, dependency, deploy, data, and “unknown.” Over a month, patterns show up quickly.

A subtle sre habit: every post-incident action item should land in the same backlog as product work, not in “that doc nobody checks.” If we treat reliability as optional homework, it’ll always be late. We don’t need a huge process—just the discipline to trade something else off when we add reliability work in.

If you want a shared baseline for incident language and severity, the Google SRE workbook is still one of the clearest references we’ve got.

Make SLOs Small Enough To Use Weekly

SLOs get a bad reputation because teams try to boil the ocean: a 40-page doc, five dashboards, and zero decisions made differently. Let’s keep it practical. One service, one user journey, one SLI, one SLO. If we can’t explain it in a few sentences, it’s not helping.

A good starting point is a user-visible outcome: “Requests complete successfully within X ms” or “Jobs finish within Y minutes.” Then pick an error budget window you can act on—28 or 30 days is common. The trick is to use the error budget in weekly planning. If we’re burning it too fast, we slow down risky changes. If we’re healthy, we ship. That’s the whole deal.

We should also be honest about what we can measure. If we don’t have reliable instrumentation, we can begin with a proxy SLI (like HTTP 5xx rate) while we build better signals. The goal isn’t perfection; it’s repeatable decision-making.

Here’s a minimal Prometheus-style recording rule approach, so we compute a clean availability SLI and can build alerts off the budget burn, not random thresholds:

groups:
- name: sli-rules
  rules:
  - record: job:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by (service)

  - record: job:http_requests_errors:rate5m
    expr: sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)

  - record: job:availability_sli:rate5m
    expr: 1 - (job:http_requests_errors:rate5m / job:http_requests:rate5m)

For SLO math and alerting patterns, we often point folks to Google’s SRE site on SLOs and the practical ideas in The Art of Monitoring (not exclusively SRE, but very grounded).

Alert On Symptoms, Then Rate-Limit Our Panic

Most pager noise comes from alerts that detect internal feelings rather than user pain. CPU at 90% can be fine; 2% of log lines containing “error” can be normal. We want symptoms: elevated latency, increased error rate, saturation leading to dropped work, and missing heartbeats for scheduled jobs. Then we tune for “actionable.” If the alert doesn’t have a clear first step, it shouldn’t page.

We also need to stop treating paging like a group chat. Pages should be rare, specific, and routed to the right place. Everything else is a ticket or a dashboard. Our favourite question when reviewing alerts: “Would you wake up for this if it happened every week?” If not, downgrade it.

A great sre habit is using multi-window, multi-burn-rate alerting. Instead of paging because “availability dipped below 99.9% for five minutes,” we page when we’re burning through the error budget fast enough that we’ll miss the SLO if it continues.

Here’s an example Prometheus alert pattern (adapt as needed). The key is two windows: fast burn (page) and slow burn (ticket).

groups:
- name: slo-alerts
  rules:
  - alert: SLOFastBurn
    expr: (1 - job:availability_sli:rate5m) > (14.4 * (1 - 0.999))
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Fast error budget burn for {{ $labels.service }}"
      runbook: "https://internal/runbooks/{{ $labels.service }}/slo-fast-burn"

  - alert: SLOSlowBurn
    expr: (1 - job:availability_sli:rate5m) > (3 * (1 - 0.999))
    for: 2h
    labels:
      severity: ticket
    annotations:
      summary: "Slow error budget burn for {{ $labels.service }}"

If you want a solid, vendor-neutral read on alert quality, Google’s Monitoring chapter is worth the time. And yes, it will gently shame a few of our “CPU is high” pages. That’s healthy.

Ship Smaller, Safer Changes (We’re Not Allergic To Speed)

A lot of reliability pain isn’t “production is fragile,” it’s “our changes are too chunky.” Big releases create big blast radiuses, which create big incident calls, which create big tired humans. The sre move is to make changes smaller, boring, and reversible.

Let’s focus on three mechanics: progressive delivery, quick rollback, and feature flags. Progressive delivery means we don’t go from 0% to 100% in one leap; we ramp and watch. Rollback means the previous version is always one button away (and actually works). Feature flags mean we can disable risky behaviour without redeploying.

We can do this without fancy tooling. Even a simple staged rollout in Kubernetes plus a health-based pause is a huge upgrade. If we do have service meshes and canaries, great—but the principle is the same.

Also: define “release success” in advance. If latency increases by 10% but stays under the SLO, is that acceptable? What’s the abort condition? If we decide during the incident, we’ll decide poorly.

A good rule of thumb we’ve used: if a change can’t be rolled back in under 10 minutes, we’re accepting a reliability tax. That doesn’t mean “never,” but it does mean we should acknowledge the cost.

For a pragmatic deployment philosophy, we still like the ideas behind Accelerate—not for the slogans, but for the measurable links between small batch sizes and fewer failures.

Build Runbooks That Start With “Check This First”

Runbooks are where sre meets reality. They’re not meant to be literature; they’re meant to reduce thinking when thinking is hardest. The best runbooks assume the on-call is tired, new, and slightly annoyed. So we keep them short, command-heavy, and ordered by probability.

A practical template: what the alert means (in plain language), quick checks, likely causes, safe mitigations, escalation path, and “how to verify recovery.” If a runbook needs a diagram to work, we’ve probably overcomplicated it.

We also want runbooks to be close to the service, not buried in a wiki maze. Put the link in the alert. Put the “how to” in version control. And once a month, do a runbook drill where someone who didn’t write it tries to follow it. You’ll find the missing steps immediately—usually at the exact point where the doc says “just restart it.” (Restart what, Karen? There are 12 things.)

Here’s a simple Markdown runbook skeleton we’ve used:

# Alert: SLOFastBurn - payments-api

## Meaning
User requests are failing fast enough to miss the 99.9% SLO if it continues.

## Quick Checks (5 minutes)
1. Dashboard: https://grafana/... (errors, latency, saturation)
2. Recent deploys: `kubectl rollout history deploy/payments-api`
3. Dependency health: `curl -sf https://status.vendor.com`

## Safe Mitigations
- Roll back: `kubectl rollout undo deploy/payments-api`
- Reduce load: enable rate limit flag `PAYMENTS_RATE_LIMIT=true`
- Scale up: `kubectl scale deploy/payments-api --replicas=10`

## Verify Recovery
- Error rate < 0.1% for 10 minutes
- p95 latency back under 250ms

## Escalation
- #payments-oncall
- DBA on-call if database latency > 50ms p95

If you need inspiration for “what good looks like,” the incident write-ups on the Cloudflare blog often show strong timelines and clear mitigations.

Fix The Top Three Repeat Offenders With Error Budgets

We don’t get reliability by fixing everything; we get it by fixing the same things repeatedly breaking. A very sre way to do this is to track “repeat incidents” as a first-class metric. If the same alert fires three times in a month, it should trigger a reliability task automatically. Not a discussion. A task.

We can make this concrete by running a monthly “top three” review: which incident types cost us the most user pain or engineer time? Then we fund fixes proportionally. The fix can be anything: caching, better indexes, circuit breakers, queue backpressure, retries with jitter, or just turning a noisy page into a ticket and improving the dashboard.

Error budgets help keep this fair. If we’re within budget, we can spend more time on feature work. If we’re out of budget, we focus on stability until we’re back in the green. This is the reliability version of “we can’t keep adding floors to the building while the foundation is cracking.” Not dramatic. Just physics.

One move we like: create a “reliability tax” label and track it alongside product work. If leadership sees that 25% of the team’s time is going to toil and incident cleanup, it becomes easier to justify investments that reduce it.

Also, let’s not forget dependencies. Vendor issues, DNS weirdness, and cloud service hiccups will happen. We can’t stop them, but we can reduce blast radius with timeouts, bulkheads, and graceful degradation.

Reduce Toil Like We Mean It (Automate The Boring Parts)

Toil is the sneaky villain of sre: repetitive, manual, and not getting better over time. The goal isn’t “automate everything,” it’s “automate the stuff that steals focus and doesn’t teach us anything anymore.” If we’re doing the same operational task weekly, it’s a candidate.

Start by listing toil sources from on-call notes: manual restarts, log spelunking, one-off database cleanups, certificate renewals, access requests, capacity bumps. Then pick one and build a small automation. The best toil automations are unglamorous but immediate: a script that gathers context for an incident, a one-click rollback, automated cache warmups, or a scheduled cleanup job with guardrails.

We also like “golden signals” dashboards per service: latency, traffic, errors, saturation. When a page fires, the dashboard should answer “is it real?” in 30 seconds. If it can’t, the dashboard needs love.

Finally, protect the on-call. Rotate fairly, cap after-hours load, and enforce recovery time. There’s no prize for being the most exhausted team with the most “learning opportunities.” We’d rather be boring and well-rested.

For deeper background and shared terminology across teams, Google’s Site Reliability Engineering book remains a solid common reference.

Share