Cut Ticket Volume by 37%: Pragmatic itops That Works

Practical patterns to calm ITOps chaos and ship more, safely.

Stop the Pager: Shape Alerts Around SLOs, Not Feelings

If the pager is running your schedule, the system’s running you. Let’s stop guessing and start alerting on business impact by tying alerts to SLOs. The rule of thumb: alarms should fire when users are at risk or we’re burning too much error budget. That means fewer “CPU is 85%” pings and more actionable burn-rate alerts. We keep two kinds of alerts: “fast burn” pages that wake a human when we’ll blow the budget in hours, and “slow burn” tickets for days-long smoldering issues. Multi-window, multi-burn-rate alerting keeps noise down while still catching sharp spikes. The math is simple once we wrap our heads around it; the trick is to codify it and stick with it. And yes, we still keep a few hard pages for existential things like “all regions are down,” but those should be rare.

Below is a Prometheus rules snippet that pages on fast burn for a 99.9% availability SLO, and opens a ticket for slow burn. Adjust labels and windows to match your world. If this looks new, the SRE Workbook’s alerting guidance and the Prometheus docs are solid companions.

groups:
- name: slo-burn
  rules:
  - record: job:http_error_ratio:5m
    expr: sum(rate(http_requests_total{job="api",status=~"5.."}[5m])) 
          / sum(rate(http_requests_total{job="api"}[5m]))
  - record: job:http_error_ratio:1h
    expr: sum(rate(http_requests_total{job="api",status=~"5.."}[1h])) 
          / sum(rate(http_requests_total{job="api"}[1h]))

  # 99.9% SLO => error budget 0.1% = 0.001
  - alert: SLOBurnRateFast
    expr: (job:http_error_ratio:5m > 0.001 * 14.4)
          and (job:http_error_ratio:1h > 0.001 * 14.4)
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Fast burn on API SLO (paging)"
      runbook: "https://internal.wiki/slo/api"

  - alert: SLOBurnRateSlow
    expr: (sum(rate(http_requests_total{job="api",status=~"5.."}[6h])) 
           / sum(rate(http_requests_total{job="api"}[6h])) > 0.001 * 6)
          and (sum(rate(http_requests_total{job="api",status=~"5.."}[24h])) 
           / sum(rate(http_requests_total{job="api"}[24h])) > 0.001 * 6)
    for: 30m
    labels:
      severity: ticket
    annotations:
      summary: "Slow burn on API SLO (ticket)"
      runbook: "https://internal.wiki/slo/api"

Ship Changes Safely: Versioned, Observable Infrastructure

We don’t need a crystal ball to make safer changes; we need version control, small diffs, and visibility. Every serious ITOps team treats infrastructure as product code: plans in CI, gated applies, drift detection, and annotated change logs that correlate with metrics. It’s not glamorous, but neither is fixing typos in a console at 2 a.m. We prefer a simple pattern: a module registry with clearly versioned modules, environments as code, and CI that runs terraform plan on pull requests and posts the diff for review. We tag everything with owner and cost center, and we output key identifiers to feed dashboards and incident tooling. State locking, policy checks, and runbooks round out the basics.

Here’s a compact Terraform example we’ve used to stop “who created this?” archaeology. Notice the consistent tagging, explicit version pins, and outputs that play nicely with both monitoring and inventories. If you’re starting out, the Terraform language docs cover these primitives well.

terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.46"
    }
  }
}

provider "aws" {
  region = var.region
}

resource "aws_s3_bucket" "logs" {
  bucket        = "org-${var.env}-logs"
  force_destroy = false

  tags = {
    env         = var.env
    owner       = var.owner
    cost_center = var.cost_center
    managed_by  = "terraform"
  }

  lifecycle {
    prevent_destroy = true
  }
}

output "logs_bucket_arn" {
  value = aws_s3_bucket.logs.arn
}

variable "env" {}
variable "owner" {}
variable "cost_center" {}
variable "region" {
  default = "us-east-1"
}

We pair this with a small “change summary” step in CI that posts plan outputs to our chat channel, so humans see the context without opening ten tabs. The side effect? Easier rollbacks, real peer review, and much less sweat.

From Tickets to APIs: Self-Service That Actually Reduces Work

We’ve all triaged the same tickets 400 times: “need a DNS record,” “open port 443,” “provision a queue.” If a task is frequent and reversible, we should turn it into an API, not a queue item. The key is guardrails: let teams self-serve through versioned config in a repo, validate with policy, and run through a safe pipeline. That delivers control without ping-pong. We aim for four traits: auditable (every change is a commit), predictable (plans and approvals are visible), reversible (rollbacks are code), and boring (no ad-hoc scripts on prod).

A practical pattern is “PR triggers infra plan; approval triggers apply.” Here’s a minimal GitHub Actions workflow we like. It runs a plan on pull requests, comments the diff, and only applies on merge to a protected branch. Combine this with OIDC to your cloud and environment approvals. Is it fancy? No. Does it delete toil? Absolutely.

name: infra
on:
  pull_request:
    paths: [ 'infra/**' ]
  push:
    branches: [ main ]
    paths: [ 'infra/**' ]

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    defaults: { run: { working-directory: infra } }
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init -input=false
      - run: terraform validate
      - run: terraform plan -no-color -out=plan.tfplan
      - run: terraform show -no-color plan.tfplan > plan.txt
      - uses: marocchino/sticky-pull-request-comment@v2
        with: { path: infra/plan.txt }

  apply:
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    environment: production
    defaults: { run: { working-directory: infra } }
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init -input=false
      - run: terraform apply -auto-approve

The result is fewer repetitive tickets and a friendlier review culture. People stop arguing in comments and start improving modules.

Incident Command Without Drama: Prepare, Drill, and Write It Down

Incidents are inevitable; chaos is optional. When things go sideways, we want lightweight rituals that quiet the room and move us toward mitigation. We designate an Incident Commander (IC), a communications lead, and a scribe by default for Sev-1 and Sev-2. Everyone knows their role because we practice. We invest in concise, linked runbooks: one-page checklists with clear entry/exit criteria, data-gathering commands, and rollback steps. The best time to write a runbook was last quarter; the second-best time is right after this paragraph. We keep post-incident reviews blameless and brutally specific about systemic fixes and detection gaps. “Work harder” isn’t a fix; “add a throttle at the queue boundary and page on saturation” is.

We also align with proven guidance so we’re not inventing process mid-crisis. The playbook mirrors the phases in NIST SP 800-61: preparation, detection/analysis, containment/eradication/recovery, and post-incident activity. It’s old but gold. To keep us honest, we schedule short, spicy game days. Ten minutes of synthetic pain beats ten hours of real pain.

Here’s the skeleton runbook we reuse per service. It’s intentionally plain-English and checklist-driven.

# API Service Runbook (Sev-1/Sev-2)

Entry: Error budget burn > fast threshold OR 5xx rate sustained > baseline x10
Exit: Burn below slow threshold for 1h AND rollback (if any) complete

IC: Rotating primary on-call
Comms: #incidents channel, status page update every 30m
Scribe: Incident bot + human backup

Immediate Actions:
- Declare incident, assign roles, start timeline
- Freeze risky deploys (blast radius!)
- Pull dashboards: SLO, saturation, dependencies
- Identify last change; consider quick rollback
- Triage: user impact? payment flows? data loss?

Rollback Steps:
- Revert last deploy: git revert -> pipeline
- Toggle feature flag FFLAG_API_CACHE off
- Shift 20% traffic to canary cluster

Verification:
- Watch SLO, latency p95, error rate for 30m
- Confirm downstream queues draining
- Update status page + ticket with end state

Observability That Answers Questions, Not Just Draws Charts

If monitoring tells us “what,” observability helps us ask “why” without redeploying. We keep three guiding ideas: make high-signal metrics first-class, trace user journeys across services, and log sparingly but with context. Sampling is our friend. We use RED for services (rate, errors, duration) and USE for infrastructure (utilization, saturation, errors). For deep dives, distributed traces expose the slow hop or the chatty neighbor. The ecosystem is rich, but we standardize on OpenTelemetry for instrumentation to avoid vendor lock-in. The CNCF Observability Whitepaper is a great map of the terrain.

The collector is our Swiss Army knife. It lets us receive telemetry, process it (batch, tail-based sampling), and export to multiple backends. Under load, we crank up tail sampling to keep hot issues visible without setting money on fire. We also feed infra outputs (like bucket ARNs and service names) into tracing attributes so investigating a user complaint links straight to the right resource. When dashboards, logs, and traces tell the same story, on-call gets quieter and MTTR moves the right way.

Here’s a compact OpenTelemetry Collector config we deploy per cluster. Tweak exporters to match your stack.

receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
  hostmetrics:
    collection_interval: 30s
    scrapers: { cpu: {}, memory: {}, filesystem: {} }

processors:
  batch: {}
  tail_sampling:
    decision_wait: 5s
    num_traces: 10000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ ERROR ] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: key_ops
        type: string_attribute
        string_attribute:
          key: http.target
          values: [ "/checkout", "/login" ]

exporters:
  otlp:
    endpoint: tempo:4317
    tls: { insecure: true }
  logging:
    logLevel: warn

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ otlp, logging ]
    metrics:
      receivers: [ otlp, hostmetrics ]
      processors: [ batch ]
      exporters: [ otlp ]

Cost and Capacity: Make Waste Boring and Predictable

Nothing spikes tickets like a capacity crunch or a surprise bill. We keep costs and capacity visible to ITOps the same way we keep latency visible to developers: with budgets, dashboards, and a few simple rules we actually follow. First, tag everything with owner and cost center so chargeback isn’t a scavenger hunt. Second, reserve capacity where it matters (databases, stateful stores) and autoscale aggressively where it doesn’t (stateless compute). Third, set practical SLOs for saturation and queue depth at key chokepoints; when they drift, investigate before the fire starts. We’ve shaved incident volume by 37% in one environment simply by right-sizing requests/limits, enabling horizontal autoscaling, and scheduling non-prod to sleep outside business hours.

What habits help? We prune long-lived sandboxes monthly, we enforce lifecycle policies on buckets and logs, and we put a Slack reminder on lingering “temporary” overrides. In Kubernetes, we prefer vertical autoscaling for the platform components and horizontal for workloads with spiky traffic; we cap pod limits to prevent noisy neighbors from stampeding. Cloud-wise, we turn on intelligent tiering for large buckets and right-size RDS every quarter. None of this makes a splashy slide, but it keeps the pager quiet and gives us headroom to say “yes” when a team needs a burst.

Finally, we tie spend back to outcomes. If a new feature sells, spend will rise; that’s good. Our job is making sure it rises linearly, not exponentially, and never at 3 a.m.

Governance Without Red Tape: Policy as Code That Unblocks

Governance has a PR problem because it’s often delivered as form-filling and email approvals. We prefer policy as code that runs in pipelines and clusters: fast feedback, clear messages, and a path to exception. The goal is guardrails that unblock, not gates that stall. We write small, readable policies, publish “what good looks like,” and instrument exceptions so we see where friction lives. Over time, exceptions either become the new normal (and we update policy) or they fade (and we tighten enforcement). Change tickets turn into audit trails automatically because every decision is in code and every deviation is deliberate.

Open Policy Agent (OPA) and friends make this pleasant enough that we don’t dread it. Here’s a simple Rego policy we’ve used to keep random public load balancers from appearing. It allows only whitelisted teams to create a public Service in Kubernetes unless they annotate a justified exception. The message tells engineers exactly how to fix it, which beats mystery failures any day.

package kubernetes.admission

default deny = false

deny[msg] {
  input.request.kind.kind == "Service"
  svc := input.request.object
  svc.spec.type == "LoadBalancer"
  not allowed_team(svc.metadata.labels["owner"])
  not justified(svc.metadata.annotations)
  msg := sprintf("Public Service denied. Owner '%s' not in allowlist. Add annotation 'policy.justification' and request exception.", [svc.metadata.labels["owner"]])
}

allowed_team(owner) {
  owner == "payments"
}
allowed_team(owner) {
  owner == "edge"
}

justified(ann) {
  ann["policy.justification"] != ""
}

We pair this with a weekly “policy digest” in chat that lists denials and exceptions by team. It’s amazing how quickly drift drops when the feedback loop is fast, fair, and a bit visible.

Glue It Together: Small Bets, Measurable Wins, Fewer Headaches

Grand overhauls look good on slides and bad on calendars. We prefer small, linked bets: SLO-based alerts this sprint, self-serve DNS next sprint, a collector rollout the one after. Each change should issue fewer pages, close tickets faster, or make a common task self-serve. When we show teams that we killed 60% of noisy pages or shaved 20 minutes off MTTR, adoption takes care of itself. Keep the receipts: graphs, before/after dashboards, and plain-English write-ups. It’s very hard to argue with sleep.

A last note on culture. We’re not trying to be the Department of No or the Midnight Heroes. Modern itops is about making the right thing the easy thing. Version the infrastructure. Alert on outcomes. Write the runbooks. Automate the routine. Add just enough policy to stay out of trouble. And leave a breadcrumb trail for future us who’ll have to explain what we did and why. If we do this well, we’ll spend more of our week shaping systems and less of our weekend fixing them. Let’s be boring on purpose—the good kind of boring that lets the business do exciting things.

References worth a coffee:
– SLO alerting patterns in the SRE Workbook
– Prometheus basics and alerting overview
– Terraform language and module docs
– CNCF Observability whitepaper on signals and tradeoffs here
– NIST incident handling guide