Practical sre Habits We Actually Keep Doing

Small, repeatable moves that reduce pager noise and raise confidence.

Start With Error Budgets, Not Hero Budgets

If we had to pick one sre habit that pays rent every month, it’s treating reliability as a product feature with a budget—not a vague vibe. Error budgets turn the conversation from “Ops is blocking releases” into “We agreed on the risk we’re willing to take.” The trick is to keep it simple enough that people will use it without a committee and a three-week spreadsheet pilgrimage.

We usually begin with one user-facing SLI (e.g., successful requests) and one supporting SLI (e.g., latency). Then we pick an SLO that matches what the business actually needs. Not “five nines because it sounds expensive,” but “what level of failure do customers notice and care about?” When the service spends its error budget too fast, we slow down changes and invest in reliability work. When we’re comfortably under budget, we ship features with less hand-wringing.

A practical workflow:
– Define SLOs per user journey, not per microservice trivia.
– Review error budget burn weekly with engineering + product.
– Tie high-severity incident follow-ups directly to budget impact.

Two helpful references we still point folks to:
– Google’s SRE book for the core concepts.
– The SLO documentation in the SRE workbook for more hands-on examples.

The real win: error budgets create a neutral referee. Instead of arguing feelings, we argue math. It’s much harder to “win” an argument against a graph.

Write SLOs People Can Read (And Automate Them)

An SLO that requires a sacred PromQL incantation known only to one engineer is not an SLO; it’s a single point of failure with punctuation. We aim for SLO definitions that are readable, versioned, and reviewable like code. That means storing them in Git, keeping naming consistent, and making the “what” obvious even if the “how” is complex.

If you’re on Prometheus, the Prometheus docs are great, but teams often benefit from a thin layer that standardises SLOs. We’ve used Sloth successfully because it turns intent into Prometheus recording/alert rules without too much ceremony.

Here’s an example sloth SLO spec that we’d happily review in a pull request:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-api
spec:
  service: checkout
  labels:
    owner: payments
  slos:
    - name: availability
      objective: 99.9
      description: "Successful checkout requests over 30d."
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))
          totalQuery: sum(rate(http_requests_total{job="checkout"}[5m]))
      alerting:
        name: CheckoutAvailability
        labels:
          severity: page
        annotations:
          summary: "Checkout availability burning error budget"

Why this works for us:
– It says “availability” and “99.9%” in plain terms.
– It’s easy to attach ownership (owner: payments).
– It’s reviewable alongside application changes.

If we can’t explain an SLO in one sentence, it’s usually too complicated—or the system is.

Make Alerts Boring (In a Good Way)

Alerting is where good sre intentions go to die, usually at 3:17 AM. Our rule: pages must be actionable, urgent, and tied to user impact. Everything else is either a ticket, a dashboard, or a Slack notification that people can mute without guilt.

A few habits that reduced our noise:
– Alert on symptoms, not causes. “High 500 rate” beats “CPU 87%” nine times out of ten.
– Use multi-window, multi-burn alerts for SLOs. Fast burn catches big outages; slow burn catches creeping issues.
– Require runbooks for paging alerts. If there’s no runbook link, it’s not ready to page.

Here’s a Prometheus alert rule pattern we like for SLO burn (simplified for readability):

groups:
- name: checkout-slo-alerts
  rules:
  - alert: CheckoutErrorBudgetBurnFast
    expr: |
      (1 - (sum(rate(http_requests_total{job="checkout",code!~"5.."}[5m]))
            / sum(rate(http_requests_total{job="checkout"}[5m]))))
      > 0.01
    for: 10m
    labels:
      severity: page
      service: checkout
    annotations:
      summary: "Fast burn: elevated 5xx rate"
      runbook: "https://internal.example.com/runbooks/checkout-5xx"

We also keep a “no shame” policy for improving alerts. If someone says “this page was pointless,” we treat it as a bug report on our monitoring—not a personal failure. The goal isn’t to prove we’re tough. The goal is to sleep and still run reliable systems.

Build Runbooks That Work Under Stress

Runbooks shouldn’t read like a novel. They should read like a checklist written by someone who’s half-awake and mildly annoyed (we’ve all been there). Our best runbooks are short, structured, and ruthlessly practical. They answer: “What’s broken, how do we confirm it, what’s the safest first action, and how do we escalate?”

We standardise on a simple format:
1. Impact statement (what users see)
2. Immediate checks (links to dashboards/log queries)
3. Safe mitigations (restart this, scale that, toggle feature flag)
4. Risky mitigations (with big warnings)
5. Escalation path (names/roles, not just a team name)
6. Aftercare (what to capture for the postmortem)

We’ve learned to include copy-paste commands and exact links. “Check Grafana” is not a step. “Open dashboard X and look at panel Y” is a step.

When people ask why we invest time here, we point out a boring truth: incidents are a throughput problem. The faster we get from “alert fired” to “known failure mode,” the less time we spend guessing and the fewer mistakes we make. This is also where we quietly improve onboarding—newer folks can contribute during incidents without needing telepathy.

If we’re using Kubernetes, we’ll often include a tiny “known good” snippet like:
– kubectl -n payments get deploy,po
– kubectl -n payments describe pod <pod>
– kubectl -n payments logs <pod> --since=10m

Runbooks are how we turn tribal knowledge into team knowledge. Also: they’re the difference between “We fixed it” and “We fixed it again next week.”

Reduce Risk With Progressive Delivery

We like shipping. We also like not detonating production. sre doesn’t mean shipping slows down; it means we get disciplined about how we ship. Progressive delivery—canaries, feature flags, and gradual rollouts—lets us detect issues while the blast radius is still small enough to manage without dramatic sighing.

Our basic playbook:
– Canary new versions with a small percentage of traffic.
– Watch SLO/SLI dashboards during rollout (not just CPU graphs).
– Automatically abort on clear user-impact signals.
– Use feature flags for risky changes, especially schema or behaviour shifts.

If you haven’t tried Argo Rollouts, it’s a solid option for Kubernetes-based canaries. Their docs are well done: Argo Rollouts.

A minimal Argo Rollout example:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 8
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
      - name: checkout
        image: our-registry/checkout:1.2.3
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

We’ll pair that with a clear “what do we watch?” checklist: error rate, latency, saturation, and any business metric that screams when things break (checkout conversion is a classic). The payoff is huge: fewer rollbacks, smaller incidents, and a team that’s less afraid of deploying on a Thursday afternoon.

Do Postmortems That Don’t Feel Like Court

If postmortems feel like blame sessions, people stop bringing the truth. Then we get shallow fixes, and the same incident comes back wearing a fake moustache. We keep postmortems blameless and practical. The question isn’t “Who did it?” It’s “How did our system and process allow this to happen?”

Our structure is consistent:
– Timeline with key detection/decision points
– Customer impact in plain language
– Root causes and contributing factors (often multiple)
– What went well / what didn’t
– Action items with owners and due dates
– Follow-up validation (how we’ll know it worked)

We’ve also learned to separate:
– Learning (what we understand now)
– Remediation (what we’ll change)
– Prevention (what we’ll automate/guardrail)

Good action items are specific and testable. “Improve monitoring” becomes “Add alert on 5xx rate > 1% for 10m with runbook link” and “Add synthetic checkout probe every 1 minute.” We also limit the number of action items; drowning in 37 tasks is how we guarantee none get done.

Finally, we track repeat incidents. If the same class of failure happens twice, we treat it like technical debt with interest. That’s the sre part: we don’t just patch symptoms, we change the system so it’s harder to fail in the same way again.

Keep Toil on a Diet (And Measure It)

Toil is the stuff we do that’s manual, repetitive, automatable, and not getting better with time. If we’re not careful, toil expands to fill every quiet moment—like a cat that has discovered a warm laptop. sre teams that don’t manage toil become ticket routers, not reliability engineers.

We handle toil with two habits:
1. Track it. We do lightweight toil logging for a couple of weeks each quarter. Nothing fancy—just categories and time spent.
2. Budget it. If toil exceeds our agreed cap (say 30-40%), we stop and automate or simplify.

Common toil sources we’ve eliminated:
– Manual user provisioning (moved to self-serve with approvals)
– Repeated “what changed?” investigations (standardised release notes + deployment annotations)
– One-off database fixes (added guardrails and safer migrations)

We also try to design systems that reduce operational foot-guns:
– Prefer “paved road” templates over bespoke snowflakes.
– Standardise logging fields (trace IDs, request IDs) so debugging is less of a scavenger hunt.
– Automate the boring parts of incident response: opening an incident channel, assigning roles, capturing context.

The humour here is that nobody joins DevOps or sre because they love copying values between dashboards. We want time for the work that actually moves the needle: designing safer systems, improving deploys, and making outages rarer and shorter. Toil management is how we protect that time.