SRE That Actually Works In Production

sre

SRE That Actually Works In Production

Practical reliability habits for teams who carry the pager

Why SRE Matters Once Real Users Show Up

We all love a clean architecture diagram. It’s neat, symmetrical, and somehow every arrow points in the right direction. Production, sadly, has other hobbies. Once real users arrive, they click the wrong button, retry too fast, upload odd files, and show up all at once after a product launch. That’s where sre stops being a fancy acronym and becomes a survival skill.

At its core, site reliability engineering is about applying engineering discipline to operations work. Instead of treating outages as random bad luck, we build systems, tooling, and habits that reduce the chance of failure and improve recovery when things go sideways. Google’s original SRE book made this idea mainstream, but the value isn’t limited to hyperscale companies with endless headcount and suspiciously tidy dashboards.

For most teams, sre gives us a way to answer practical questions. What level of reliability do users actually need? Which alerts matter at 2 a.m.? What should be automated before we hire another person to stare at graphs? It helps us draw a line between “important problem” and “background noise pretending to be urgent.”

The big shift is this: we stop chasing perfection and start managing risk. A service does not need 100% uptime to be useful. It needs predictable behaviour, fast recovery, and clear priorities. That mindset changes how we plan incidents, release software, monitor systems, and talk to stakeholders who think “just make it stable” is a technical strategy. It isn’t. We’ve checked.

Start With Service Level Objectives, Not Vibes

If we want sre to work, we need a clear definition of “good enough.” Otherwise every slowdown becomes a crisis, every outage becomes a political argument, and every team invents reliability from scratch. This is why service level objectives, or SLOs, matter. They turn reliability into something measurable.

An SLO defines the target performance of a service over time. That usually means availability, latency, correctness, or some combination. The key is choosing metrics that reflect user experience, not infrastructure vanity. CPU at 40% may look comforting, but users do not care. They care whether the page loads and whether the API responds before they give up and make tea.

A healthy setup often starts with SLIs, the indicators behind the objective. For an API, we might measure successful requests under a latency threshold. Then we set an SLO such as 99.9% of requests completing successfully within 300 milliseconds over 30 days. That’s concrete. We can discuss trade-offs from there.

Error budgets are where this gets useful. If we miss the target, we spend budget. If the budget is gone, we slow releases, fix risk, and stop pretending reliability can be wished into existence. This approach helps teams balance feature delivery and operational safety without endless debate. Google explains the model well in the workbook, and the Nobl9 guide is also a decent practical reference.

The nice part is cultural as much as technical. SLOs give product, engineering, and operations a shared language. Instead of “the system feels flaky,” we can say what failed, by how much, and what it cost.

Build Alerting That Wakes Us Up For Real Reasons

Bad alerting is one of the quickest ways to make sre feel like theatre. If every threshold breach creates a page, the team stops trusting alerts. If nothing pages until customers are already furious, we learn about downtime from social media, which is never a proud moment.

Good alerting starts with one uncomfortable truth: not everything broken is urgent. Some issues need a ticket, some need a Slack message, and only a small set should wake someone up. Paging should be reserved for symptoms that indicate user harm right now or very soon. That means alerts should be tied to service health, not random component behaviour.

A common mistake is threshold overload. Disk at 80%, CPU at 75%, queue depth above arbitrary number 14. These can be useful signals, but not all of them belong in a pager rotation. We prefer multi-window, multi-burn-rate alerts based on SLO consumption. They catch serious degradation quickly without spamming the team over tiny blips. The Prometheus alerting practices are worth reading if we’re building this ourselves.

Here’s a simple example for an availability burn-rate alert:

groups:
  - name: api-slo-alerts
    rules:
      - alert: HighErrorBudgetBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{job="api",status!~"5.."}[5m])) /
              sum(rate(http_requests_total{job="api"}[5m]))
            )
          ) > (14.4 * 0.001)
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "API is burning error budget too fast"
          description: "User-facing failures exceed short-window burn threshold."

That expression isn’t exactly bedtime poetry, but it aligns alerts to risk. Add clear runbooks, ownership, and noise reviews, and suddenly on-call becomes manageable instead of mildly haunted.

Incident Response Is A System, Not A Hero Story

Every team says they care about incident response. Fewer teams treat it as a repeatable practice. Without structure, incidents become chaotic group chats where twelve people guess loudly while one brave soul pokes production and hopes for the best. We can do better, and sre gives us the scaffolding.

A solid incident response model starts with roles. We usually want an incident commander, a communications lead, and one or more responders. The commander coordinates; they do not debug everything personally. That distinction matters because decision-making falls apart when the same person is both steering and deep in logs. During stressful moments, a bit of role clarity saves us from turning a bad outage into an interpretive dance.

Communication should be boring and predictable. Use a dedicated incident channel, timestamp major actions, capture hypotheses, and note customer impact clearly. If status pages are part of the process, update them early and honestly. Atlassian’s incident guide offers a decent operational template, and the PagerDuty incident response overview covers the basics without too much chest-beating.

After the incident, the real work begins. A blameless postmortem is not about avoiding accountability; it’s about finding system weaknesses that allowed the failure. We ask what happened, what signals were missed, what assumptions failed, and what changes will reduce recurrence. If the outcome is “engineer should be more careful,” we haven’t learned enough.

Incidents are expensive teachers. We may as well take notes while they’re here.

Automation Should Remove Toil, Not Add Mystery

One of the more useful ideas in sre is toil reduction. Toil is repetitive, manual, automatable work that scales linearly with service growth. If we’re doing the same operational task every week, by hand, with no lasting improvement, that’s a candidate for automation. The trick is automating in a way that reduces effort instead of building a fragile machine nobody understands.

We’ve all seen “automation” that creates a bash script graveyard and a deeply held fear of touching anything in /ops/legacy/final_v2_reallyfinal. Proper automation should be versioned, observable, documented, and safe to run repeatedly. It should also have clear ownership. If a script can reboot half the platform, we should know who maintains it and how it fails.

A great place to start is common response tasks: scaling a service, rotating credentials, validating backups, restarting failed jobs, or creating standard incident artefacts. Infrastructure as code helps here because it turns manual changes into reviewable changes. Terraform is a familiar option, and Ansible remains handy for procedural tasks.

A tiny example of reducing toil with a Kubernetes CronJob might look like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-verification
spec:
  schedule: "0 3 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: verify
              image: alpine:3.20
              command: ["/bin/sh", "-c"]
              args:
                - "echo verifying backup && ./verify-backup.sh"
          restartPolicy: OnFailure

Simple, visible, repeatable. That’s the goal. We don’t need clever automation. We need trustworthy automation that gives the team time back and lowers the odds of human error during busy periods.

Capacity Planning Beats Last-Minute Panic

Reliability problems are not always caused by bugs. Sometimes the service works exactly as designed, right up until traffic doubles and the database starts making sad noises. Capacity planning is one of the least glamorous parts of sre, which is probably why it gets ignored until dashboards resemble a medical drama.

Good capacity planning means understanding demand, resource limits, and growth patterns before the cliff edge. We want to know which parts of the stack saturate first, what normal peak looks like, and how much headroom is actually needed. This does not require mystical powers. It requires trends, load tests, and a willingness to look beyond today’s average traffic.

We usually begin with a few practical questions. What are the critical user journeys? Which dependencies are hardest to scale? Are there fixed quotas hiding in managed services? What happens during batch jobs, regional failover, or a successful marketing campaign that nobody mentioned to engineering? Capacity planning is where these awkward questions become cheaper than outages.

This is also where performance testing earns its keep. Synthetic load is imperfect, but it helps reveal bottlenecks before customers do. Tools like k6 and Grafana make it easier to connect load, latency, and saturation into one picture. We’re not looking for an imaginary maximum; we’re looking for safe operating ranges and warning signs.

A useful habit is to review capacity as part of release planning, not after incidents. New features change traffic shape, storage patterns, and background processing loads. If we treat scale as an afterthought, production will eventually send us a stern reminder. Production enjoys those.

SRE Works Best As A Team Habit

The biggest mistake we see with sre is treating it like a job title instead of a way of working. Hiring one “SRE person” and pointing them at every outage, dashboard, and YAML file is not a reliability strategy. It is a fast route to burnout with a side of organisational confusion.

Sre works best when reliability is shared, even if specialist roles exist. Product teams should understand their SLOs. Developers should help design observability. Operations-minded engineers should influence architecture before release day. Leadership should accept that reliability has a cost and that sometimes the right answer is slowing down to pay operational debt. None of this is glamorous, but it is effective.

This shared model also improves handoffs. If the team that builds the service understands how it behaves under failure, on-call quality improves. If the people carrying the pager can influence code, runbooks, and deployment safety, they can fix recurring issues rather than repeatedly absorbing them. That is a much healthier loop than “throw over wall, then apologise during incident review.”

It also helps to be honest about maturity. Not every team needs elaborate platform engineering, custom error budget math, or five layers of policy. Early on, a few strong practices go a long way: useful metrics, actionable alerts, documented runbooks, blameless postmortems, and automation for repetitive work. Start there. Add complexity when it solves a real problem.

In the end, sre is less about prestige and more about discipline. We build systems that fail more gracefully, teams that respond more calmly, and processes that produce fewer nasty surprises. That’s not flashy, but it keeps production from becoming a full-contact sport.

Share