Practical sre Habits That Keep Teams Sane
Small changes in process and tooling that save big weekends.
Define Reliability Like We Mean It (Not Like a Poster)
If “reliability” is just a warm feeling, we’ll end up arguing in circles during incidents. In sre, we treat reliability as a product feature with measurable targets, trade-offs, and consequences. The simplest way to start is to write down what “good” looks like for users: page loads in under X seconds, checkout succeeds Y% of the time, alerts fire only when customers are actually impacted. That turns opinion into something we can manage.
A practical anchor is the SLO: Service Level Objective. We pick a user-facing metric (availability, latency, freshness, correctness), set a target, and measure it over a window. Then we track error budget—the allowed unreliability. Error budgets stop us from doing two unhelpful extremes: shipping recklessly or freezing change forever “for stability.” When the budget’s healthy, ship. When it’s burned, slow down and fix what’s hurting.
We also need a service catalog entry for each system: owners, dependencies, runbooks, dashboards, and SLOs. When incidents happen (not “if”), we shouldn’t be playing detective just to find who owns the thing. The goal isn’t bureaucracy; it’s to reduce surprise.
If you’re pitching this internally, keep it grounded: “We want fewer customer-impacting outages and fewer 2 a.m. pages.” If anyone asks where this came from, Google’s sre material is still one of the clearest references for the core ideas: Google SRE book.
Start With Observability That Answers Real Questions
Our monitoring shouldn’t be a museum of graphs. In sre practice, we aim for observability that helps us answer: “Is the user impacted?”, “What changed?”, and “Where do we look next?” The easiest trap is alerting on everything we can measure. That creates noise, people mute alerts, and the only time we notice a problem is when Slack catches fire.
A dependable baseline is the classic “four golden signals”: latency, traffic, errors, saturation. They’re not magic, but they map well to how systems fail and how users feel it. You don’t need 50 dashboards; you need a small set you can trust, that are tied to SLOs and customer impact. We also like separating symptoms from causes: page on symptom alerts (SLO burn, elevated error rate), and route cause alerts (CPU high, queue depth) to tickets or lower-severity notifications.
For logs, standardise structure early—consistent fields like service, env, trace_id, user_id (where appropriate), and error.kind. For traces, don’t chase 100% sampling on day one; sample intelligently and increase when debugging. If you’re on Kubernetes, make sure we can correlate pods to deployments to commit SHAs, otherwise we’ll do archaeology every time a rollout misbehaves.
If you need a north star, the OpenTelemetry project is a solid, vendor-neutral place to start for metrics, logs, and traces.
Use SLOs and Error Budgets to Set Shipping Pace
This is the bit that makes sre feel “real” to product and engineering: we connect reliability to delivery decisions. Without that link, SLOs become another spreadsheet that nobody reads until the postmortem.
We start by choosing one or two SLOs per service that represent what users care about. For a public API, that might be “99.9% of requests succeed” and “p95 latency under 300ms.” For a data pipeline, “freshness under 15 minutes.” Then we calculate error budget and agree what happens when it’s burned: freeze risky changes, do reliability work, or rollback certain features. The key is pre-agreement. During an incident is a terrible time to negotiate.
Here’s a minimal Prometheus-style example of an SLO burn alert that pages only when we’re trending toward missing the objective. It’s intentionally blunt; you can refine it later.
# prometheus alert rule (example)
groups:
- name: slo-burn
rules:
- alert: ApiErrorBudgetBurnFast
expr: |
(
1 - (
sum(rate(http_requests_total{job="api",code=~"2.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
)
) > 0.01
for: 10m
labels:
severity: page
annotations:
summary: "High error rate burning SLO fast"
description: "Errors >1% for 10m. Check recent deploys and upstream deps."
You can also use purpose-built tooling like Sloth to generate SLO rules from concise definitions, which keeps us from hand-crafting fragile queries.
The cultural win: error budgets make reliability a shared constraint, not a punishment. Teams don’t get blamed for shipping—they get a measured guardrail.
Build Incident Response That Works at 3 a.m.
The true test of our sre setup is whether it helps when we’re tired, stressed, and slightly regretting our life choices. Incident response should be boring, repeatable, and easy to follow. The goal is not heroics; it’s consistency.
We like a simple incident structure: an incident commander (IC), a communications lead, and a primary investigator. One person can do multiple roles for small incidents, but naming the roles helps avoid the “everyone debugging, nobody coordinating” failure mode. We also standardise severity levels and what they mean: who gets paged, what the update cadence is, and when we open a customer status update.
Runbooks matter, but only if they’re used. The best runbooks are short, linked from alerts, and start with “How to confirm impact” plus “Top 3 likely causes.” If a runbook is longer than the incident, it’s a novel, not a runbook. During an event, we also want a shared timeline doc where we log actions and observations in real time. That timeline becomes gold for the postmortem.
If we publish status updates, keep them factual and time-bound: what’s affected, what we’re doing, when the next update is. No speculation. Tools help, but process matters more. If you need inspiration, PagerDuty’s incident response resources offer practical templates and role definitions you can adapt.
Finally, practise. A lightweight game day once a quarter is enough to reveal missing dashboards, unclear ownership, and the one alert that pages the intern for a non-issue.
Make Deployments Safer With Small Steps and Fast Rollbacks
Most outages aren’t caused by meteors; they’re caused by change. So in sre, we make change safer and easier to undo. That doesn’t mean “never change”—it means we reduce blast radius and shorten detection time.
Our favourite pattern is boring: small deployments, feature flags, and progressive delivery. Ship code dark, enable gradually, watch SLOs, then roll forward or back quickly. If the system is containerised, make rollbacks a first-class path, not an emergency improvisation. That means keeping backward-compatible migrations, versioning APIs, and testing rollback in staging. Yes, testing rollback feels weird until the first time we actually need it.
Here’s a simple Kubernetes Deployment snippet that nudges us toward safer rollouts with readiness checks and a rolling update strategy:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
containers:
- name: api
image: example/api:1.2.3
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
This won’t prevent every issue, but it avoids the classic “we deployed broken pods and drained all the good ones” catastrophe. If you want to go further, tools like Argo Rollouts support canary and blue/green strategies with metrics-based promotion.
The punchline: safe deploys are cheaper than incident cleanups—and way less character-building.
Treat Toil Like a Bug (Then Delete It)
Toil is the sneaky morale killer in sre: repetitive, manual work that doesn’t create long-term value. Think “restart the stuck job,” “clean up disk on node 12,” “hand-hold every deploy,” or “answer the same alert that never changes.” A little toil is normal; a lot of toil is a sign we’re using humans as a control plane.
We track toil the way we track defects. Put it on a board. Tag it. Measure it. If we’re spending 30% of the week doing manual ops, we should be uncomfortable. The goal isn’t to shame anyone—most toil exists because the system asked for it, not because people chose it.
The antidote is automation with guardrails: self-service scripts, runbook automation, and making the “right” path the easy path. We also stop treating operational improvements as “nice to have.” If the on-call load is high, reliability work becomes product work because it directly impacts delivery speed and quality.
A simple technique: every time on-call does something twice, we consider automating or eliminating it. Every time we get an alert that doesn’t lead to action, we either fix the alert or delete it. Ruthlessly. Noise is not harmless; it trains us to ignore real problems.
If you need a sanity check for what counts as toil and why it matters, the Google SRE workbook has pragmatic examples and exercises.
Over time, lowering toil is what makes on-call sustainable—and keeps good engineers from mysteriously “wanting to focus on other opportunities.”
Run Blameless Postmortems That Actually Change Things
Postmortems can be healing or they can be theatre. In sre, we aim for blameless postmortems that produce concrete changes, not guilt and vague promises. “Blameless” doesn’t mean “no accountability.” It means we focus on systems and decisions, not personal flaws. People generally did what made sense with the information they had at the time.
A useful postmortem answers a few questions clearly: What was the impact? What happened (timeline)? Why did it happen (contributing factors, not a single “root cause” fairy tale)? How did we detect it? What slowed us down? What are we changing? Then we track actions to completion, with owners and dates. If actions don’t get done, we’re just writing incident fan fiction.
We also look for themes across incidents: same alert patterns, same dependency failures, same missing runbooks. That’s where the big wins are. One well-chosen reliability improvement can prevent five future outages. The best postmortems also update operational artefacts: dashboards, alerts, runbooks, and capacity plans. The incident should leave the system slightly more resistant to chaos than before.
And yes, we keep postmortems readable. No one wants a 14-page PDF with screenshots of graphs that nobody can zoom. A solid one-pager in Markdown is often enough.
If you want a public example of transparency done well, many teams learn from how large providers communicate and learn, like the Cloudflare outage postmortems.


