Practical sre Habits That Keep Systems Boring

sre

Practical sre Habits That Keep Systems Boring

Less drama, fewer pages, more sleep for everyone.

Start With Service Promises, Not Heroics

If we’re doing sre well, our systems feel… boring. Not because nothing ever breaks, but because we’ve agreed what “good enough” looks like and we’re organised when it isn’t. The quickest way to drain joy from ops is to run on vibes: “it should be fast,” “it should be up,” “customers will yell if it isn’t.” Instead, we write down service level indicators (SLIs) and service level objectives (SLOs) that reflect user experience. A classic starting point is availability and latency for a handful of key endpoints—measured from the edge, not from a server that’s having a great day in its own little world.

We also pick a time window that matches reality: 28 days is popular because it smooths weird weekends without hiding month-long pain. Then we turn SLOs into error budgets. That’s the bit that makes this operationally useful: it’s a shared “budget” for how much unreliability we can tolerate while shipping changes. When the budget is healthy, we move faster; when it’s burning down, we slow down and fix things. Not as punishment—more like putting the kettle back on before the kitchen catches fire.

If you need a crisp model, Google’s SRE book is still a solid reference: Site Reliability Engineering (free online). For SLO implementation patterns, the SRE Workbook is the practical companion.

The punchline: sre starts by agreeing what we’re optimising for—then we measure it like we mean it.

Make Alerts Earn Their Keep

Let’s be honest: most alerting starts as a well-intentioned cry for help and turns into a haunted house of beeps. In sre, we want alerts that are actionable, urgent, and tied to user impact. If an alert doesn’t require a human to do something right now, it’s not a page. It might still be a ticket, a dashboard widget, or a “watch this” notification—but paging is sacred because humans are squishy and need sleep.

A good trick is to route everything through SLOs: page on symptoms that threaten the objective, not on every CPU wiggle. High CPU can be fine; high latency for checkout is not. We also keep alert thresholds stable. An alert that flaps teaches people to ignore it; an alert that’s always on teaches people to mute it.

Here’s a minimal Prometheus-style example using a latency SLO burn-rate approach. It pages when we’re burning the error budget too quickly over short and long windows:

groups:
- name: slo-burn
  rules:
  - alert: CheckoutLatencySLOBurn
    expr: |
      (
        sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.5"}[5m]))
        /
        sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
      ) < 0.99
      and
      (
        sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.5"}[1h]))
        /
        sum(rate(http_request_duration_seconds_count{service="checkout"}[1h]))
      ) < 0.995
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Checkout latency SLO burn"
      runbook: "https://runbooks.example.com/checkout-latency"

If you’re newer to Prometheus alerting patterns, the upstream docs are clear and pragmatic: Prometheus Alerting. The rule of thumb we use: fewer pages, better pages, faster fixes.

Ship Reliability Like Code (Because It Is)

We wouldn’t accept “hand-edited production code” as a lifestyle, so we shouldn’t accept hand-tweaked production infrastructure either. In sre, reliability work is engineering work: versioned, reviewed, tested, and repeatable. That doesn’t mean we never click buttons; it means the button-clicking isn’t the only copy of the truth.

A practical place to start is with runbooks and operational checks in the repo alongside the service. When someone changes a dependency, the runbook updates in the same pull request. When we add a new queue, we add the dashboards and alerts in the same change. This reduces the “tribal knowledge tax,” where the most experienced person becomes the single point of exhaustion.

We also add reliability gates to CI/CD. Not “block the world forever,” but lightweight checks that catch obvious problems: missing health endpoints, broken metrics, misconfigured timeouts, unsafe database migrations. We’ve found it helps to keep these gates boring and predictable; overly clever gates become another flaky system you’ll need to… maintain. (Yes, we see the irony.)

Here’s a simple GitHub Actions example that runs a smoke test and validates Kubernetes manifests before deploy:

name: reliability-checks
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate manifests
        run: |
          kubectl kustomize k8s/ | kubeconform -strict -ignore-missing-schemas
      - name: Run unit + smoke tests
        run: |
          make test
          make smoke

If you want a strong baseline for delivery hygiene, DORA’s research is a useful compass (and refreshingly evidence-based). The goal is simple: make the safe path the easy path, so reliability doesn’t rely on someone having a “good feeling” on a Tuesday.

Treat Incidents Like Product Feedback

Incidents aren’t moral failures; they’re feedback from the system, delivered with all the subtlety of a brick. The sre move is to handle incidents with consistent process so we learn quickly and reduce repeat pain. That starts with roles (incident commander, communications, operations), clear severity definitions, and pre-written comms templates. When stress is high, decision fatigue is real; checklists help.

Then we run blameless postmortems. “Blameless” doesn’t mean “no accountability.” It means we focus on conditions and decisions that made sense at the time, given what people knew. The output we want is concrete follow-ups: fix the unsafe default, add the missing alert, document the confusing dependency, make rollback fast. If a postmortem ends with “be more careful,” we’ve essentially written a fortune cookie and called it engineering.

We also keep the postmortem short enough that it actually gets written. A two-page write-up with timeline, impact, contributing factors, and action items beats a 20-page epic that never leaves draft. And we make action items visible: in the backlog, with owners, with due dates. Otherwise, we’re just collecting PDFs like they’re Pokémon.

If you need a canonical reference for incident management and organisational learning, the Google SRE Workbook chapters on incidents are a great starting point. For a broader view of learning from failure, this classic is worth reading: How Complex Systems Fail.

In sre, we don’t aim for “no incidents.” We aim for “incidents that teach us something and don’t repeat.”

Build Observability Around Questions We Actually Ask

Dashboards that look impressive in screenshots are rarely the ones that help at 3 a.m. The dashboards we love are the ones that answer specific questions: “Is it broken for users?”, “What changed?”, “Where’s the bottleneck?”, “Is it getting worse?” That’s why we start with the golden signals (latency, traffic, errors, saturation) and then add service-specific context.

We also treat logs, metrics, and traces as a single story. Metrics tell us something’s wrong, traces tell us where, logs tell us why (usually). If we only have one of the three, we’ll spend a lot of time guessing. And guessing is fun in pub quizzes, not in outages.

A very practical sre habit: standardise instrumentation. Same label keys, same HTTP metric names, same trace propagation. Otherwise, every service becomes a bespoke mystery novel. If you’re on OpenTelemetry, lean into its conventions and auto-instrumentation where it makes sense. The spec and docs are solid: OpenTelemetry.

We also keep cardinality under control. If we label metrics with user IDs, we’ll have a fun time explaining the bill. Better to aggregate at meaningful boundaries (endpoint, status code class, region, tenant tier) and use exemplars/traces for deep dives.

Finally, we practice “debuggability drills.” Once a sprint, we pick a realistic failure mode—slow DB queries, dropped messages, throttling—and see if the current telemetry gets us to root cause quickly. If not, we improve it. This turns observability from a one-time project into a habit, which is very on-brand for sre.

Engineer Capacity and Resilience (Before You Need It)

Performance issues are reliability issues wearing a different hat. If the system falls over during a marketing event, users don’t care that our error rate SLO was technically about “availability.” They just know the button didn’t work. So we plan capacity intentionally: set targets, load test critical paths, and understand where the cliffs are.

We like to keep a simple capacity model per service: expected peak RPS, CPU/memory per request, database IOPS expectations, queue depth behaviour, and third-party rate limits. Then we watch leading indicators: saturation, thread pool utilisation, connection pool exhaustion, and any retry storms. Retries deserve special mention: they’re great until they’re catastrophic. The sre trick is bounded retries with jitter, plus circuit breakers and timeouts that reflect reality.

Resilience is similar: we decide what failures we tolerate and design for them. Multi-AZ is table stakes for many workloads, but it doesn’t magically fix a bad deployment or a poisoned cache. We need safe rollouts (canary, blue/green), feature flags for rapid mitigation, and backups that are tested—not just “configured.” A backup you’ve never restored is a comforting story, not a plan.

If you want a strong mental model, this paper is still gold: The Twelve-Factor App. Not because it’s trendy, but because it nudges us toward statelessness, config discipline, and clean separation—things that make reliability easier.

In sre, we don’t buy resilience with one big redesign. We earn it with small, consistent choices that keep failure boring.

Make Reliability a Shared Deal, Not a Separate Team

The final habit is cultural, but it shows up in tickets and calendars. sre fails when reliability becomes “that team’s job.” It succeeds when product, dev, and ops share the same scoreboard and the same trade-offs. Error budgets help because they create a neutral mechanism for decision-making. When the budget is fine, we ship. When it’s not, we pay down reliability debt. No drama, no blame—just prioritisation with numbers.

We also keep a visible reliability backlog. Not a vague “tech debt” bucket, but concrete items tied to incidents, near-misses, or SLO gaps: “add timeout to payment client,” “index hot query,” “reduce deploy time,” “remove flaky dependency,” “add synthetic checks.” Then we budget time for it. If reliability work only happens “when there’s time,” it won’t happen—because there’s never time, only choices.

We’ve had good results with a simple operating rhythm:
– Weekly: SLO review + top alert review (delete at least one noisy alert)
– Per-incident: postmortem within 5 business days
– Per-sprint: one reliability improvement per critical service
– Quarterly: game day / failure drill

None of this requires a massive org chart reshuffle. It requires consistency, a bit of humility, and the willingness to delete things that don’t help. (Especially alerts. We can’t stress that enough.)

Share