Practical sre Habits That Keep Services Boring

Simple routines we can adopt to ship faster and sleep more.

Define “Good” With SLOs (Not Vibes)

If we want reliability without endless debates, we need a shared definition of “good.” That’s what SLOs (Service Level Objectives) give us: an agreed target for user-facing reliability, expressed as a percentage over a time window. Not “the API feels slow today,” but “99.9% of requests under 300ms over 28 days.” The trick is to keep SLOs tied to user journeys—login, checkout, search—not internal metrics that only we love.

We start by picking a few SLIs (Service Level Indicators): latency, availability, correctness, freshness. Then we set SLOs that reflect what users actually notice. If we’re early-stage, we can begin with one or two SLOs per critical service. Too many SLOs becomes a spreadsheet hobby.

Once we have SLOs, we get something even more valuable: an error budget. If our SLO is 99.9% over 30 days, we’re allowed ~43 minutes of “badness.” That budget is a decision-making tool. Spending it on a risky launch might be worth it. Spending it on repeated flaky deploys… less cute.

We also write down what happens when we blow the budget: freeze feature releases, prioritize reliability work, or run a stability sprint. This is where sre stops being theory and starts being a team sport.

Helpful references we keep bookmarked:
– Google’s overview of SRE principles: https://sre.google/
– A practical intro to SLOs and error budgets: https://sre.google/workbook/implementing-slos/

Make Error Budgets a Release Gate (Politely)

Error budgets are only useful if they influence behaviour. The gentlest way is to make them part of our release process—like a seatbelt, not a police checkpoint. We don’t need a giant committee. We need a simple rule: if we’re out of budget, we slow down and fix reliability; if we’re within budget, we can ship.

In practice, we implement this as a release gate in CI/CD. When a PR is ready to deploy, the pipeline checks whether the relevant SLO is currently healthy. If we’re burning budget too fast, we require an explicit override (and a human explaining why).

Here’s a sketch of what that can look like with a lightweight “budget check” step. The exact implementation depends on your metrics stack, but the pattern is the point:

# .github/workflows/deploy.yml
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Check error budget
        env:
          SLO_API: ${{ secrets.SLO_API }}
          SERVICE: checkout-api
        run: |
          curl -fsS "$SLO_API/v1/budget?service=$SERVICE" -o budget.json
          remaining=$(jq -r '.remaining_ratio' budget.json)
          burnrate=$(jq -r '.burn_rate_1h' budget.json)

          echo "Remaining budget ratio: $remaining"
          echo "1h burn rate: $burnrate"

          awk "BEGIN {exit !($remaining > 0.1 && $burnrate < 2.0)}" \
            || (echo "Budget too low / burn too high. Blocking deploy." && exit 1)

      - name: Deploy
        run: ./scripts/deploy.sh

This doesn’t have to be perfect. Even a crude gate forces the right conversation at the right time. And if we’re worried about blocking too often, that’s usually a sign we need to invest in safer deploys (we’ll get to those) rather than removing the gate.

If you want a deeper runbook-style approach, the Google SRE Workbook is still one of the most usable references out there.

Treat Alerts Like a Product (Reduce Noise Relentlessly)

Most teams don’t have an alerting problem—they have a noise problem. Alerts should be actionable, urgent, and tied to user impact. If an alert wakes someone up, it should be because users are hurting (or will hurt soon), and the on-call can do something about it.

We get there by making alert rules boring and consistent:
– Symptom > cause: Alert on “checkout error rate high,” not “CPU 85%.”
– Page less, ticket more: Paging is for immediate action. Everything else becomes a ticket or dashboard note.
– Route to owners: If nobody owns it, it’s just ambient anxiety.

One practice that works well: a weekly “alert triage” where we kill or fix the noisiest 5 alerts. Not discuss them. Fix them. Add missing thresholds, adjust time windows, dedupe, or downgrade to non-paging.

Here’s an example Prometheus alert rule set that aims for sanity. Notice the use of a short “for:” to avoid flapping, and grouping on service labels:

# prometheus-alerts.yml
groups:
- name: sre-symptoms
  rules:
  - alert: High5xxRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      /
      sum(rate(http_requests_total[5m])) by (service)
      > 0.02
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "High 5xx rate for {{ $labels.service }}"
      description: "5xx rate > 2% for 10m. Check recent deploys and upstream dependencies."

  - alert: LatencyP95High
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      ) > 0.4
    for: 15m
    labels:
      severity: ticket
    annotations:
      summary: "p95 latency high for {{ $labels.service }}"
      description: "p95 > 400ms for 15m. Investigate DB, cache, and dependency latency."

A good sanity check: if an alert triggers and the runbook says “look at graphs,” we’re not done. The runbook should say what graph, what “bad” looks like, and what lever we can pull.

For more on pragmatic alerting, the Prometheus docs are solid: https://prometheus.io/docs/alerting/latest/overview/

Build Runbooks That Work at 3 a.m.

Runbooks are not documentation theatre. They’re what we reach for when our brains are half awake and our coffee is still negotiating terms. A good sre runbook is short, specific, and biased toward actions.

We aim for:
– Trigger: what alert fired, what it means.
– Impact: what users see.
– Immediate actions: safe steps to stop the bleeding.
– Diagnosis: where to look next (links to dashboards/log queries).
– Escalation: who to call and when.
– Rollback/mitigation: known good options.

We also keep them close to the code. If the service lives in a repo, the runbook lives there too (README, /runbooks, or docs/ops). When we fix an incident, we update the runbook in the same PR as the code change that prevented it. That’s the difference between “we should update the runbook” and actually updating it.

One small trick we like: include a “first five minutes” section at the top. It’s amazing how calming that is when Slack is on fire. Another: include copy-paste commands (safe ones) so nobody improvises under stress.

If you’re building this practice from scratch, it helps to standardize a template and enforce it during service onboarding. The goal isn’t perfect prose. The goal is: can someone unfamiliar with the service reduce user impact quickly?

We also link runbooks directly from alerts. If an alert has no runbook link, it’s not done. It’s just yelling.

If we need inspiration, incident write-ups from companies that share openly can help us shape our own style. The Cloudflare blog has some good examples: https://blog.cloudflare.com/tag/postmortem/

Ship Safer With Progressive Delivery

Most reliability pain isn’t caused by servers having a bad day. It’s caused by us, deploying changes with a bit too much confidence and not enough safety rails. sre doesn’t mean “don’t deploy.” It means “deploy in ways that limit blast radius.”

We like progressive delivery patterns:
– Canary releases: send a small percentage of traffic to the new version and watch key SLIs.
– Blue/green: switch traffic between two environments with an easy rollback.
– Feature flags: decouple code deploy from feature exposure.

The operational payoff is huge: smaller incidents, faster rollbacks, and fewer “all-hands” moments. The cultural payoff is also real—engineers stop fearing deploys, which means we deploy more often, which means each change is smaller, which means fewer surprises. It’s a virtuous cycle that feels suspiciously like common sense.

We also insist on a rollback path for every change. If rollback is hard, we’re not “being brave,” we’re borrowing stress from future us. Rollback might be redeploying the previous container tag, flipping a flag, or reverting a config change. Whatever it is, we make it rehearsed.

One more boring but effective tactic: deploy during business hours when possible. Yes, global teams exist, but most orgs still have a “most awake” window. Shipping during that window gives us more eyes and quicker fixes. Late-night deploys should be intentional, not habitual.

For feature flags and progressive delivery, LaunchDarkly’s learning resources can be useful even if you don’t use their product: https://launchdarkly.com/blog/

Run Postmortems Without the Blame Olympics

Incidents happen. What matters is whether we learn anything besides new ways to panic. Blameless postmortems aren’t about being soft—they’re about being accurate. If we blame individuals, we miss the system issues that set them up to fail: unclear ownership, risky deploy patterns, missing tests, confusing dashboards, noisy alerts, or insufficient capacity.

Our postmortems focus on:
– Timeline: what happened, with timestamps.
– Customer impact: what users experienced and for how long.
– Contributing factors: technical and organisational.
– Detection: how we noticed (and how we should’ve noticed sooner).
– Mitigations: what stopped the impact.
– Follow-ups: small, owned actions with due dates.

We keep postmortems lightweight and consistent. A two-page write-up that leads to real changes beats a ten-page novel nobody finishes. We also track follow-ups like real work—because they are real work. If we don’t schedule them, we’re just collecting regrets.

One practice we’ve found useful: tag follow-ups as either “reduce likelihood” or “reduce blast radius.” Both matter. It’s not always possible to prevent a class of failure, but we can often make it less damaging.

If you want a canonical reference, Google’s postmortem culture write-up is worth reading: https://sre.google/sre-book/postmortem-culture/

Automate the Toil (So We Can Stay Sane)

Toil is the kind of work that’s repetitive, manual, and doesn’t get better the more we do it. sre isn’t about heroics; it’s about removing the need for heroics. That means we aggressively automate toil: repetitive deploy steps, manual log gathering, hand-built dashboards, and “SSH and poke around” routines.

We start by tracking toil openly. In retros or weekly ops reviews, we ask: what work did we do that we never want to do again? Then we pick the top candidate and automate it. Not everything needs a big platform rewrite. Sometimes a small script, a CI job, or a Terraform module is enough to retire a recurring annoyance.

We also invest in self-service. If developers can provision a staging environment, view service dashboards, and roll back safely without begging for access, we reduce handoffs and delays. That’s good for delivery and good for reliability, because fewer manual steps means fewer mistakes.

A subtle but important habit: automation must be owned. Scripts that “someone wrote once” tend to rot. We treat automation like any other code: reviewed, tested, versioned, and documented.

Finally, we keep an eye on operational load. If on-call is constantly busy, we’re probably underinvesting in reliability work. Error budgets should help justify that investment. If the budget is consistently low, it’s a signal: we need to fix the system, not just get better at suffering.