SRE That Actually Works In Production
Practical habits, sane metrics, and fewer 3 a.m. surprises
Why SRE Exists Beyond Fancy Job Titles
If we strip away the conference talks and heroic incident stories, SRE exists for a very boring reason: production systems are messy, and somebody has to make them reliably useful. That’s the whole game. We can build fast, deploy often, and automate everything in sight, but if users can’t log in, pay, search, or save their work, none of that counts for much.
We tend to think of SRE as the bridge between software engineering and operations, but that description is a bit too neat. In practice, SRE is about making trade-offs visible. How much risk can we tolerate? What level of service do users actually need? Where should we spend engineering effort: shipping new features or improving reliability? Those aren’t abstract questions; they shape roadmaps, staffing, and pager fatigue.
Google’s original write-up on Site Reliability Engineering gave the industry a vocabulary for this work, and that’s still useful. But the real value comes when we stop treating SRE like a badge and start treating it like a discipline. We’re not here to worship uptime charts. We’re here to create systems that fail gracefully, recover quickly, and don’t require folklore to operate.
A healthy SRE practice also gives teams permission to say “not yet” when reliability is already stretched. That’s not obstruction. That’s basic adult supervision for production. If a service is one deploy away from chaos, adding more traffic and features won’t somehow make it sturdier. It’ll just make the postmortem longer.
Start With Service Level Objectives, Not Vibes
A lot of teams say they care about reliability, but then measure it with whatever happened to be easy to graph. That’s how we end up with dashboards full of CPU percentages and almost no clarity about user experience. SRE starts with a more disciplined question: what does “good enough” service look like from the user’s point of view?
That’s where Service Level Indicators and Service Level Objectives come in. An SLI is the thing we measure, like request success rate or latency for a core transaction. An SLO is the target we agree to meet over a defined window. If 99.9% of checkout requests should complete successfully in 30 days, that’s a meaningful reliability promise. It’s concrete, testable, and tied to what users care about.
The key is resisting the urge to measure everything. Good SLOs focus on a few user-critical journeys, not every internal moving part. The Google SRE workbook is still one of the better practical guides here, especially when teams are trying to avoid inventing ten “important” metrics that nobody will use. We also like the framing from Noble9’s SLO guide, which keeps the conversation grounded in outcomes rather than dashboard decoration.
One more thing: SLOs should create decisions. If your target is routinely missed and nothing changes, it’s just a sad number in a panel. If your target is always met with huge margin, maybe it’s too loose. The point of an SLO isn’t to look professional in a quarterly review. It’s to force useful trade-offs before production forces them for us.
Error Budgets Keep Everyone Honest
Once we’ve defined an SLO, the next useful idea is the error budget. This is where SRE stops being theory and starts becoming a governance tool. If our availability target is 99.9%, then we’re effectively saying 0.1% of requests can fail or degrade within the measurement window. That margin is the budget. Spend it wisely.
What’s nice about error budgets is that they replace emotional arguments with a shared rule. Product teams want to ship. Operations folks want stability. Both instincts are reasonable, and neither side needs to become the villain. The budget tells us whether we can move fast, need to slow down, or should freeze risky changes until reliability recovers. It’s a traffic light, not a moral judgement.
This works best when the policy is simple. If a service burns through its monthly budget in a week, we pause non-essential launches and focus on fixes. If burn rate is healthy, we keep shipping. No dramatic speeches required. The SRE workbook’s chapter on alerting and error budgets is useful for turning that into a real operating model rather than a poster on the wall.
There’s also a cultural benefit. Error budgets make reliability a team sport. Developers don’t get to toss code over the fence and hope observability will sort it out. Platform teams don’t get to block every release because something might go wrong. Everyone can see the same budget, the same trends, and the same consequences. That transparency matters.
And yes, sometimes the budget conversation reveals that leadership wants five nines on a two-nines budget. That’s always a fun meeting. Better to discover that mismatch in a spreadsheet than during a customer incident.
Monitoring Should Answer Questions, Not Just Collect Data
We’ve all seen “monitoring” setups that are really just large-scale metric hoarding. Thousands of time series, dozens of dashboards, and still no fast answer to the only question that matters during an incident: what’s broken, who’s affected, and what changed? SRE asks more of observability than data collection for its own sake.
Useful monitoring starts with user-facing signals. Availability, latency, saturation, traffic, and errors are still a solid base. The Four Golden Signals remain popular because they’re practical, not because they sound grand. From there, we add logs, traces, and domain-specific events where they genuinely help us reduce detection and diagnosis time.
Alerting needs the same discipline. We should alert on symptoms users feel, not every internal twitch. Paging someone because CPU hit 82% for two minutes is not reliability engineering; it’s workplace vandalism. Better alerts are tied to SLO burn rates, sustained error spikes, or severe latency regressions on critical paths. Everything else can route to a ticket, a chat channel, or a daylight-hours review.
Here’s a simple Prometheus-style alert example based on burn rate:
groups:
- name: slo-burn-rate
rules:
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
) > 0.02
for: 10m
labels:
severity: page
annotations:
summary: "Checkout service is burning error budget too quickly"
description: "5xx rate exceeded 2% for 10 minutes."
That’s not magical, but it’s actionable. It says something meaningful about customer impact. For a broader observability foundation, Prometheus and OpenTelemetry are worth a look, especially if we want portable instrumentation without locking ourselves into one vendor’s idea of truth.
Incidents Are Inevitable, Chaos Is Optional
No mature SRE team believes incidents can be eliminated. Complex systems fail. Dependencies wobble. Human beings misread diffs. DNS decides to become the main character. The goal isn’t perfection; it’s controlled failure, quick recovery, and learning that sticks.
That means having clear incident response mechanics before we need them. Who’s the incident commander? Who handles communications? Where do we track timeline and actions? Which changes get frozen, and who can override that? During an incident, ambiguity is expensive. People duplicate effort, miss handoffs, or start “helping” in ways that generate fresh surprises. We’ve all seen the message thread with twenty people and one useful update.
A simple runbook culture helps more than giant binders nobody reads. For high-risk services, we want documented first steps: how to confirm impact, roll back safely, shift traffic, disable non-critical features, or fail over to a secondary dependency. The PagerDuty incident response guide is a decent reference for teams building basic operating discipline without turning every outage into theatre.
We also need to normalise communication. Internal updates should be regular, short, and specific. External updates should be honest and calm. Users can handle “we’re investigating elevated checkout failures” much better than silence. Tools like Statuspage exist for a reason.
The underrated bit is recovery ergonomics. If restoring service depends on one engineer remembering a sequence of shell commands from six months ago, that’s not resilience. That’s a hostage situation. SRE pushes us to make common recovery paths reproducible, tested, and boring.
Automation Helps, But Guardrails Matter More
Automation is one of SRE’s best tools, and also one of its favourite ways to make mistakes at machine speed. We absolutely should automate repetitive operational work: provisioning, rollouts, restarts, backups, certificate renewal, scaling actions, and routine policy checks. But automation without guardrails is just faster chaos.
The best automation removes toil. In SRE terms, toil is repetitive, manual, low-value work that scales linearly with service growth. If we’re still hand-editing configs or manually validating every deploy for a common service, we’re paying a reliability tax forever. Infrastructure as code and deployment pipelines help us make the right thing the easy thing.
Here’s a small example with a Kubernetes deployment using a rolling strategy and readiness checks:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: checkout
template:
metadata:
labels:
app: checkout
spec:
containers:
- name: checkout
image: example/checkout:1.4.2
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
Nothing fancy there, and that’s the point. Safe defaults beat clever tricks. We also like progressive delivery patterns, policy checks in CI, and automatic rollback triggers when health indicators degrade. The Kubernetes documentation and Terraform docs are solid references if we’re building these foundations.
Automation should widen the safe path, not eliminate human judgement. If a script can delete production faster than a human can say “hang on,” we’ve built a trap, not a platform.
Postmortems Should Improve Systems, Not Assign Blame
A bad postmortem is basically a formal way to waste everyone’s time. It lists timestamps, points vaguely at “human error,” and ends with action items nobody owns. A good postmortem is one of the most useful habits in SRE because it turns incidents into system improvements rather than folklore and finger-pointing.
Blameless doesn’t mean consequence-free or reality-free. It means we assume people acted with the context, tools, and incentives they had at the time. If someone ran the wrong command, we ask why that was easy to do, hard to detect, and possible to execute in production. Maybe naming was confusing. Maybe confirmation steps were weak. Maybe the runbook was stale. Maybe the system allowed too much blast radius. The answer is almost never “be more careful” and call it a day.
The postmortem should capture impact, detection, timeline, contributing factors, what worked, what didn’t, and clear remediation items with owners and due dates. We also want to distinguish between fixes that reduce likelihood and fixes that reduce impact. Both matter. Better validation might prevent one class of incident, while faster rollback and isolation reduce damage when something new breaks anyway.
Writing these up well also helps future responders. Over time, postmortems become a reliability knowledge base, not just an archive of embarrassing Tuesdays. The Google SRE book’s section on postmortems remains a strong reference because it treats learning as part of operations, not an optional ceremony after the pain has faded.
If we finish a nasty incident and only update the slide deck, we’ve learned nothing. If we change the system, the alerting, or the rollout path, then the outage has at least paid some rent.
Building An SRE Practice Without Overcomplicating It
The easiest way to get SRE wrong is to turn it into a separate priesthood with its own language, dashboards, and opinions about everybody else’s code. That usually creates distance instead of reliability. We’ve had better results when SRE is introduced as a set of operating practices that product and platform teams can actually use together.
Start small. Pick one important service. Define two or three meaningful SLIs. Set an SLO that reflects real user expectations. Measure it properly. Create a simple error budget policy. Tighten alerts so the pager reflects user pain, not metric anxiety. Write a couple of runbooks for the failure modes we already know about. Then review incidents and toil honestly for a quarter. That alone will expose plenty.
We also need to be realistic about team shape. Not every company needs a large central SRE function. Some need a small enablement team. Some need embedded reliability engineers. Some just need better production discipline across existing engineering teams. The model matters less than the outcomes: fewer noisy pages, faster recovery, clearer priorities, and shared accountability.
One practical test is whether SRE work changes planning. If reliability goals don’t affect release pacing, tech debt prioritisation, capacity planning, or platform investment, then we’re probably doing reliability theatre. Nice dashboards, no steering wheel.
SRE at its best is refreshingly unglamorous. It helps us make better promises, keep more of them, and recover sensibly when we don’t. That’s not flashy, but production rarely rewards flashy for long. It rewards teams that are prepared, measured, and just a little bit suspicious of anything described as “totally safe.”



