Cut 38% Pages: itops That Actually Sleeps

Cut 38% Pages: itops That Actually Sleeps
Practical, measurable practices to calm the pager and speed change.

What We Mean by ItOps in 2025

Let’s get aligned before we touch a single alert. When we say itops, we mean the very practical craft of keeping production healthy while helping product teams ship changes fast. We’re not a cost center; we’re the force that turns change into a routine event instead of a root-cause hunting expedition. We measure success with a small handful of numbers we’d happily put on a kitchen fridge: time to detect, time to restore (MTTR), change failure rate, and the page rate per on-call engineer. We also watch SLO burn as a leading indicator—if an error budget is evaporating before lunch, something’s off. If you’ve never set SLOs, start with a simple availability or latency target and iterate using the guidance from Google’s SRE practice on Service Level Objectives here.

Let’s add two people metrics too: on-call fairness (the same few folks shouldn’t carry the pager every weekend) and page impact (pages during sleep hours are expensive in ways the budget sheet can’t show). A practical baseline we’ve used: keep critical pages under two per week per engineer on average, and keep most non-urgent noise out of the pager entirely. This isn’t about perfection. It’s about making failure cheap to detect and fast to fix. If we can trim 30–40% of pages by getting the basics right—signal hygiene, routing, and release tactics—we free up real time for the work we want to do: automating, scaling, and removing sharp edges. That’s itops that actually sleeps.

Build a Quiet, Trustworthy Signal Path

Noisy telemetry ruins good on-call rotations. We start by instrumenting what matters and muting what doesn’t. Metrics, logs, and traces should be structured and correlated. OpenTelemetry gives us a neutral way to emit data from most stacks, then pipe it anywhere we like; the docs are solid and vendor-agnostic here. For logs, stick to structured JSON and predictable fields. If you’re shipping classic syslog, follow RFC 5424 for consistent severity and facility tagging. The hard part isn’t collecting—it’s deciding what we ignore. High-cardinality labels (user IDs, sessions) turn simple metrics into time-series fire hoses. Decide what’s countable (requests, errors), what’s measured (latency), and what’s sampled (traces).

On the metrics side, we keep histograms for latency, counters for rates, and gauge sparingly for resources. Then we prune labels at scrape time. Here’s a tiny Prometheus example that keeps latency buckets and drops user-level labels that create spray-and-pray series:

scrape_configs:
  - job_name: 'apps'
    static_configs:
      - targets: ['app-a:9100','app-b:9100']
    metric_relabel_configs:
      # Drop identity-style labels that explode cardinality
      - action: labeldrop
        regex: "user_id|session_id|request_id"
      # Keep histogram buckets for request duration
      - source_labels: [__name__]
        regex: "http_request_duration_seconds_bucket"
        action: keep

We also tag everything with service, version, and environment so we can group alerts by blast radius. The aim: a signal path we trust because it’s intentional. Once we control inputs, alert design becomes a scalpel, not a siren.

On-Call That Lets Humans Be Human

Healthy on-call starts with the rule we all wish someone had written years ago: only wake humans for things a human must do in the next few minutes. Everything else goes to a ticket queue or a chat channel. That means we need strict severity and routing rules, grouping to avoid notification hailstorms, and inhibition so warnings don’t page when a critical alert already did. We make this concrete in Alertmanager. Keep pages for severity=critical, route warnings to a queue, and group by service so one outage triggers one page, not fifteen. The Prometheus team’s Alertmanager docs are the canonical reference here.

Here’s a practical starter:

route:
  receiver: oncall-critical
  group_by: ['service','region']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  routes:
    - matchers: ['severity="critical"']
      receiver: oncall-critical
      group_wait: 0s
    - matchers: ['severity="warning"']
      receiver: backlog-tickets
      continue: false

receivers:
  - name: oncall-critical
    pagerduty_configs:
      - routing_key: ${PD_KEY}
  - name: backlog-tickets
    webhook_configs:
      - url: https://ticketing.example.com/hooks/alerts

inhibit_rules:
  - source_matchers: ['severity="critical"']
    target_matchers: ['severity="warning"']
    equal: ['service','region']

Two more guardrails: paging budgets and auto-silences. We cap critical pages per week and auto-silence known noisy alerts during deployments or failovers (if they’re “normal”, they shouldn’t page). And yes, we write down time-of-day policies: prod-only at night, non-prod never at night. If we need proof, we report weekly page counts by service. Sunlight is a great noise reducer.

Ship Smaller, Fail Smaller: Release Tactics

Most midnight pages hide earlier sins in how we ship. Our itops stance: make releases boring and breakage small. We favor short-lived branches, pre-merge tests, and fast rollouts behind probes and gates. Probes catch the “it works on my laptop” fallacy before users do, and progressive techniques (canary or 10% chunks) mean we only hurt a few users for a few minutes when something goes sideways. Kubernetes makes the basics easy—readiness and liveness probes, zero-downtime rolling updates—and the docs are clear here.

A minimal deployment spec that helps us sleep:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: web
        version: v2
    spec:
      containers:
        - name: web
          image: example/web:v2.3.1
          readinessProbe:
            httpGet: { path: /healthz, port: 8080 }
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet: { path: /livez, port: 8080 }
            initialDelaySeconds: 20

Pair this with SLO guardrails: halt or roll back automatically if P95 latency or error rate exceeds the error budget burn for, say, 10 minutes. We tag alerts with deployment metadata (version, commit) so a bad rollout is obvious at 3 a.m. Small releases, quick rollbacks, and probes that tell the truth are the cheapest way to reduce pages.

Runbooks That Run Themselves (Gradually)

If we can write it down, we can script it. We’re not trying to autosolve everything on day one—we’re putting the common stuff on rails while keeping humans in charge. Start by inventorying the top ten manual runbooks from the last quarter. For each, define inputs (service, instance, region), safety checks (SLO budget not burning, primary healthy), and a reversible action. Then lift them into code with a human-in-the-loop toggle. A classic example: cache hot-fix. The wiki says, “Flush cache if hit rate < 70% for 10 minutes.” Great. The script checks the metric, drains one instance at a time, flushes, warms with a synthetic read, and re-enters the pool. Pager goes to warning with a “Run fixed playbook?” button in chat. If on-call approves, the job runs and posts a link to the logs.

We do the same for node drains, stuck queues, and pool downsizes at night. For safety, every automated step must be idempotent, logged, and reversible. Keep rules in git, not a mystery box; reviews should look like code reviews, with tests for the guardrails. Document the failure cases first; it’s where the dragons live. After a few weeks, we’ll see a theme: the pager stops ringing for work a computer can do, and the human work becomes deciding which fix to run, not copying commands out of a wiki at 3 a.m.

Capacity, Cost, And Performance: One Set of Numbers

We’ve all watched “temporary” autoscaling rules turn into budget gremlins. Let’s get disciplined and simple: tie capacity to the same SLOs we page on, and pick one utilization target per key resource. For CPU-bound services, we like 60–70% average CPU at steady state, keeping P95 latency under the SLO. For I/O-bound or GC-heavy services, latency is the boss; CPU is just a hint. We watch saturation (queue depth, run queue, connection pool waits) and treat it as an early warning. If latency spikes first, we scale horizontally. If saturation creeps up while latency stays flat, we tune batch sizes or concurrency.

Autoscaling is not a personality test. Keep it deterministic: scale on a stable metric (P95 latency or RPS per pod) with conservative cool-downs. Schedule load tests monthly (or before big launches) to re-baseline. Cloud bills get the same treatment as error budgets: we set a spend SLO that tracks cost per 1,000 requests or per tenant, not just total. If spend-per-thing rises for a week, we investigate. Often the fix is boring—turn off debug logging, lower scrap rate for ephemeral metrics, or right-size instance types. We also encode cost visibility into dashboards. If an engineer can correlate a toggle to a $ value in a minute, they’ll use it wisely. We don’t need perfect forecasting; we need enough headroom and fast feedback to avoid paging people or finance.

The 90-Day ItOps Reset: A Realistic Sequence

Here’s how we make that “38% fewer pages” sticker look plausible. Days 1–30: we harden the signal path. Instrument critical paths with OpenTelemetry where it’s easy, drop high-cardinality labels, and agree on severity levels. Write down three SLOs that matter (availability, P95 latency, error rate), wire burn-rate alerts, and demo dashboards that show “are we okay?” at a glance. We also pick the top five noisiest alerts and either fix or delete them. Days 31–60: we rewire on-call. Cut paging down to critical-only, add grouping and inhibition, and pilot a paging budget per team. While we’re here, we bake readiness and liveness probes into every service and switch to smaller, more frequent deploys. The rule: if you can’t roll back in one command, you can’t go to prod on Friday.

Days 61–90: we ship the first three self-serve runbooks and connect them to alerts with approval gates. We baseline capacity and add a simple autoscaling rule for one pilot service, then schedule a one-hour load test to verify it. Throughout, we publish weekly page counts, MTTR, and change failure rate. The win we want by day 90 is simple: pages-per-engineer down 30–40%, MTTR down 20%, and fewer “mystery spikes” in latency. If a service misses the mark, it’s not a shame parade—it’s a clue that our signals or release tactics need more love. Then we rinse and repeat by service. Quiet systems are built the boring way, one sharp edge at a time.