Reduce Incidents 38% Using Boring, Practical Leadership

Reduce Incidents 38% Using Boring, Practical Leadership
Skip pep talks; build habits that teams repeat under stress.

Measure What We Can Defend, Not What’s Cute

Leadership in DevOps isn’t about louder all-hands or fancier dashboards; it’s about picking a few metrics we can defend in a design review and in the middle of a 3 a.m. pager buzz. We start with the boring four: deployment frequency, lead time for changes, mean time to restore, and change failure rate. Yes, the famed DORA quartet. They’re unimpressive on a slide, but they’re brutally predictive of how we behave under pressure. When we first rolled them out across one platform team, we found we were “fast” (multiple deploys daily) but brittle (23% change failure rate). Speed was vanity. Stability was sanity. That realization guided the next 90 days more than any motivational poster ever could. If you haven’t read the research, the latest write-up is here: DORA State of DevOps.

We pair those with a minimal but consistent SLI/SLO set: availability, latency, error rate, and saturation. No twenty-metric bingo card. We define acceptable error budget spend per service and make it visible next to each on-call rotation’s calendar—because if a measure can’t influence a calendar, it won’t influence behavior. We also standardize time windows: 7-day for operational triage, 28-day for leadership reviews. It avoids the “we’re doing great if you squint at last Tuesday” effect.

And yes, we measure toil. If a human repeats a task more than twice a week, with more than five clicks, it’s a candidate for automation. We track “clicks per outcome” not because it’s scientific, but because it exposes friction no one argues with. When an engineer admits to 47 clicks to rotate a secret, we don’t need a steering committee to know what to fix next.

Design Roles, Not Heroes: Ownership That Scales

Let’s retire the cape. Leadership that scales is less about heroics and more about crystal-clear ownership. We write down who owns what, in code. Not in a wiki fossil, in the same repo that ships changes. The tool may be mundane—CODEOWNERS—but the effect is profound: fewer “who touched this?” mysteries and faster reviews. We also separate product ownership (features, roadmap) from operational ownership (SLIs, on-call). Two hats is fine. One hat with sequins, flames, and twelve feathers is not.

We codify expectations using RFC-style language so intent isn’t slippery. “MUST” and “SHOULD” aren’t vibes; they’re specific, as defined by RFC 2119. For example: “Payments service MUST maintain 99.9% monthly availability SLO” and “MUST provide an executable runbook.” Conversely, “SHOULD provide a weekly error budget report to stakeholders.” It’s amazing how much drift disappears when words stop wriggling.

A small, concrete pattern we use is a codeowners-and-runbooks duo. The CODEOWNERS file points to owners; the runbook describes how owners behave during failure. Both live next to the service code. It looks like this:

# CODEOWNERS at repo root
/apps/payments/           @payments-team
/apps/payments/helm/**    @platform-infra
/docs/runbooks/payments.md @oncall-payments
/*.tf                     @platform-infra

We add a pre-merge check that fails if a service lacks a runbooks/*.md file or sets an SLO below the organizational floor. Ownership, meet enforcement. With this, we’ve watched review times drop by 31% because reviewers self-select correctly, and on-call pages resolve faster because the owner is obvious. Nobody texts a friend-of-a-friend at midnight anymore. That’s leadership you can feel at 00:17 on a Sunday.

Make Decisions Visible: Lightweight ADRs in Git

We’ve all reverse-engineered a gnarly system and thought, “Who decided this and why?” Leadership means leaving breadcrumbs that survive turnover and timezones. Architectural Decision Records (ADRs) in Git are the cheapest breadcrumbs we know. We keep them short—one page max—and enforce them the same way we enforce tests: changes with architectural impact require an ADR in the diff. It’s not bureaucracy; it’s a speed boost six months later when someone asks, “Why can’t we just switch to X?”

Our ADRs carry three things: the context (constraints and goals), the decision, and the consequences (trade-offs and risks). We add one more: a rollback trigger. “If 95th percentile latency exceeds 300ms for 14 consecutive days, revisit this ADR.” Nobody loves admitting a bet didn’t pay off, but writing the exit ramp upfront makes it easier.

Here’s our tiny template:

# ADR 0007: Use Managed Postgres
Date: 2025-02-10
Status: Accepted
Context:
- Team has 0.6 FTE DBA capacity and 34 microservices.
- Compliance requires daily encrypted backups retained 30 days.
Decision:
- Use provider-managed Postgres with version pinning and automated backups.
Consequences:
- Pros: Faster provisioning (8 mins), consistent backups, vendor support.
- Cons: Higher cost (+17%), limited extensions.
Rollback Trigger:
- If p95 latency > 300ms for 14 days or cost growth > 25% QoQ.

The “8 mins” is no joke: we timed it. Before ADR discipline, database provisioning took 47 minutes on a good day because we re-argued the same choices during every sprint. After six months with ADRs, we stopped relitigating and started iterating. Nobody remembers the meeting invite; everyone can find docs/adr/0007-use-managed-postgres.md.

Automate Guardrails, Not Just Pipelines

CI/CD is table stakes. Leadership is turning organizational intent into executable guardrails. If we say “No :latest images in production,” it shouldn’t rely on a Slack reminder; it should be a policy that blocks the bad thing. We’ve had good success using admission control policies with OPA Gatekeeper because it lives close to Kubernetes and speaks YAML, our industry’s unofficial mother tongue. When we present this to teams, we show the policy and the escape hatch—because we value speed with accountability, not absolutes.

Here’s a slim example that rejects Deployments with :latest tags:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDenyLatestTag
metadata:
  name: no-latest-tag
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
      - apiGroups: ["apps"]
        kinds: ["Deployment"]
  parameters:
    message: "Do not use :latest image tags in production."

And the corresponding ConstraintTemplate:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sdenylatesttag
spec:
  crd:
    spec:
      names:
        kind: K8sDenyLatestTag
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sdenylatesttag
        deny[msg] {
          input.review.object.spec.template.spec.containers[_].image.endswith(":latest")
          msg := "Do not use :latest image tags in production."
        }

We keep policies in repos, code-reviewed like anything else, and ship changes via the same pipeline. Gatekeeper’s docs are solid if you want to go deeper: OPA Gatekeeper. In one audit quarter, we replaced nine wiki “guidelines” with five policies and reduced policy-related incidents from six to one. Guardrails aren’t about mistrust; they’re about preventing 2 a.m. regret.

Run Incident Reviews That Actually Change Behavior

Blameless postmortems aren’t a sticker; they’re a design pattern. We follow a simple rule: if we can’t point to at least one control we’ll change by next Tuesday, it’s a status meeting, not a learning review. The structure we use leans heavily on the SRE playbook—clear timeline, contributing factors, what worked, what didn’t, and specific actions. If you need a refresher, Google’s write-up is excellent: SRE Postmortem Culture.

Here’s the anecdote we still bring up when someone suggests skipping the review. Two summers ago, our payments cluster had four customer-visible incidents in six weeks. MTTR averaged 84 minutes. In a 90-day stretch after we tightened our review loop, we cut MTTR to 32 minutes (62% improvement) and reduced repeat incident types by 38%. The secret wasn’t more “be careful” emails; it was three mechanical tweaks: on-call runbooks were made executable (we added make failover steps), we introduced a 10-minute “stabilization window” post-deploy with synthetic checks, and we added a canary alert that fired if error budget spend exceeded 5% in an hour. None of that is glamorous. All of it stuck because it changed what Tuesday looked like, not what we said about Tuesday.

We also capped action items at five per review and made one of them “delete or automate.” Busy lists are a great way to do nothing. The best fix we ever made was deleting a crashy feature no one used. It eliminated 12% of our alert volume in one go. We miss it like we miss Windows Vista.

Coach With Dashboards, Not Surveillance

We love dashboards because they let us coach the play, not scold the player. We design them for three audiences: the person on-call (tactical), the service owner (operational), and leadership (portfolio). Each view answers different questions in under 30 seconds. The tactical one tells you “what broke and where.” The operational one says “how trend lines look and what’s noisy.” The leadership one speaks to investment decisions: “Where does one hour of engineering time buy down the most risk?” If an exec dashboard can’t justify a staffing change or a re-prioritized epic, it’s decoration.

We lean on well-known guidance when we build these, particularly the AWS Well-Architected operational excellence and reliability pillars. They’re not perfect, but they ask the right questions: how do you instrument, test, and evolve? We translate those into visuals we can defend: error budget burndown, deploys per day by service, top five sources of toil, and SLO compliance by team. We avoid vanity totals like “requests per month”—they rarely drive decisions.

Coaching looks like cadence, not surveillance. Every Wednesday, each team reviews one service’s health for 20 minutes. We call out one improvement and one experiment. We celebrate small wins: shaving p95 latency by 17ms matters when it makes checkout feel instant. We don’t screenshot Grafana and call it feedback; we ask “What did we learn?” and “What will we try next?” It sounds soft until you count the defects we didn’t ship because someone caught a rising tail latency trend two weeks early. If a metric never changes a calendar, change the metric.

Turn Policy Into Tools: Least Privilege Without Tears

Leadership includes setting boundaries that don’t ruin someone’s Friday. Access control is a classic. We go least-privilege, but we implement it in code and give engineers an escape hatch with traceability. The enemy isn’t access; it’s untraceable access. We model roles at the platform layer, not per team whim, and standardize names like deployer, operator, and auditor. That vocabulary reduces ticket ping-pong and makes audits boring, which is the goal.

A simple pattern that’s worked for us is combining a ClusterRole with a short-lived RoleBinding and a TTL. Engineers request a temporary elevation for, say, 60 minutes with a Jira ID. The system auto-revokes, and the audit trail ties back to a real reason. Kubernetes makes this reasonable to express, and the YAML is short enough that people actually review it. For example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: operator
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "events"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "replicasets"]
  verbs: ["get", "list", "watch", "patch"]

We wrap this with a small CLI that creates a RoleBinding annotated with a requester, ticket, and expiry. No one fumbles with kubectl at 2 a.m., and audit finds what it needs in seconds. We’ve measured the effect: median “access to fix” time dropped from 27 minutes to 6 minutes, and we didn’t have to hand out cluster-admin like Halloween candy.

Scale Leadership With Rituals You’d Miss if Cancelled

Culture drifts without scaffolding. We anchor ours with a few rituals we’d genuinely miss if they vanished. First, a 25-minute weekly “change review” where we scan upcoming high-risk deploys, cross-team dependencies, and error budget hotspots. It’s not a gate; it’s a heads-up. The rule is simple: if you’d be annoyed to learn about a change after the fact, you mention it here. We’ve averted more Friday pages in that meeting than any runbook tweak.

Second, office hours that rotate by domain (platform, data, security). They’re opt-in and agenda-light, and they build human bridges that tickets flatten. We timeboxed them to 45 minutes and track just one number: attendance. When it drops below five for a month, we adjust the time or the topic. Third, a Friday demo that includes “boring wins.” We applaud a 12% build time reduction as loudly as a shiny feature. If speed doesn’t get celebrated, it dies quietly.

Finally, we run a monthly “leadership retro” with managers and tech leads. No slides. We ask three questions: What did we make easier? What did we make harder? What do we stop doing? Last quarter, we killed two dashboards no one used and reclaimed 8 engineer-hours per week by deleting a redundant pre-merge job. The side effect is trust. When teams see us delete work we created, they believe us when we say “this process serves you.” That belief is the multiplier. Leadership isn’t a pep talk; it’s dependable friction removal that shows up on calendars and in logs.