Practical DevOps leadership Without the Drama

leadership

Practical DevOps leadership Without the Drama

How we lead calmly, ship safely, and keep our humans intact

Set Expectations Like You’re Writing an On-Call Runbook

DevOps leadership starts with clarity, not charisma. If people don’t know what “good” looks like, they’ll invent it—and usually at 2 a.m. during an incident. We’ve had the best results when we treat expectations the way we treat operational docs: explicit, discoverable, and boring (boring is beautiful).

Let’s define outcomes in plain language: uptime targets, error budgets, delivery cadence, and what “done” means. Then we add constraints: security gates, compliance checks, and the fact that we will not “just hotfix prod” unless it’s truly necessary. This isn’t about being strict; it’s about removing guesswork so teams can move faster without stepping on rakes.

We also make expectations bidirectional. Leaders should be clear about what they need from the team, and the team should be clear about what they need from leadership (tools, staffing, time to pay down debt, fewer surprise projects). A simple practice: every new initiative gets a one-page “working agreement” that includes who’s on point, how decisions get made, and what happens when priorities collide.

Finally, we write down escalation paths. Nothing says “we don’t actually have leadership” like people arguing in Slack while the database smoulders. Decide in advance: who declares incidents, who can roll back, and who talks to stakeholders. If you want a handy reference for incident roles, Google’s SRE incident guidance is still one of the clearest.

Build Trust With Small Promises, Kept Relentlessly

Trust is the currency of DevOps leadership, and it’s earned in tiny transactions: we say we’ll review a PR today, and we actually do. We say we’ll protect focus time, and we don’t book over it with “quick syncs” that turn into courtroom dramas.

One practical trick: we keep a visible “leadership backlog.” Not a vague list of aspirations—real items the team cares about: “fix flaky CI,” “reduce pager noise,” “replace that mystery VM named ‘do-not-delete’.” When we complete items, trust goes up. When we ignore them, trust leaks out the sides.

We also treat transparency as a default. Teams can handle bad news (“this migration is risky,” “we don’t have headcount”), but they can’t handle surprises. When we’re uncertain, we say so. When priorities change, we explain the why. The goal isn’t to win arguments; it’s to help people make good local decisions without needing a permission slip every time.

And yes, we apologise quickly when we mess up. Not the corporate “sorry you feel that way,” but the real kind: “We pushed a deadline without asking, that was on us, we’ll fix the process.” It’s amazing how far that goes.

If we want a north star for this style of leadership, Accelerate’s research consistently points back to culture and psychological safety as performance drivers—less fear, more learning, fewer blame-fuelled meetings.

Make Incident leadership Boring (In a Good Way)

Incidents are where leadership gets stress-tested. Our job isn’t to look heroic; it’s to make the system—and the response—predictable. We aim for “calm, procedural competence,” like an airline cockpit, minus the hats.

During an incident, we separate three things: coordination, communication, and technical work. Mixing them creates chaos. One person runs the incident (keeps time, assigns owners, watches risk). Another handles stakeholder updates. Everyone else fixes the problem. This reduces the classic failure mode where the best engineer becomes an involuntary project manager while trying to debug packet loss.

We also keep a bias for reversible actions: roll back, disable the feature flag, shed load. If we’re not sure, we reduce blast radius first, then investigate. And we log decisions in real time—because post-incident memory is a creative writing exercise.

Here’s a lightweight incident checklist we’ve used to keep things steady:

INCIDENT CHECKLIST (quick)
1) Declare incident + severity
2) Assign roles: IC / Comms / Ops
3) Stabilize: rollback, feature flag off, scale, rate-limit
4) Establish facts: dashboards, recent deploys, error spikes
5) Communicate every 15-30 mins (even if "no change")
6) Capture timeline + decisions
7) Close only when metrics confirm recovery
8) Schedule blameless retro within 48 hours

For the retro, we stick to “what in the system made this likely?” rather than “who touched it last?” If you need a solid template, Atlassian’s incident postmortem guide is a decent starting point—then we simplify it until people actually use it.

Use Metrics That Help Humans, Not Just Dashboards

Metrics can be leadership tools or leadership theatre. We’ve all seen the dashboard that looks impressive and answers exactly zero questions. The goal is to choose a small set of measures that drive better decisions and healthier systems.

We typically start with DORA metrics—deployment frequency, lead time, change failure rate, and MTTR—not because they’re trendy, but because they force useful conversations. If lead time is high, we ask: is it review queues, flaky tests, or oversized changes? If MTTR is high, we ask: do we lack runbooks, observability, or safe rollbacks?

But we don’t stop at delivery. We also measure toil (time spent on repetitive operational work) and pager noise (pages per week, after-hours pages, repeat offenders). If the team is always interrupted, shipping will slow down no matter how many “be more agile” posters we hang.

The leadership move is to treat metrics as signals, not weapons. If metrics are used to punish, people will game them. If metrics are used to learn, people will improve them.

Here’s a simple example of how we might codify error-budget policy so it’s not a vibe-based argument every sprint:

# error-budget-policy.yml
service: payments-api
slo:
  availability: 99.90
window: 28d
error_budget_actions:
  - when_burn_rate_gt: 2.0
    for: 2h
    actions:
      - "pause non-essential releases"
      - "assign incident review owner"
  - when_budget_remaining_lt: 30
    actions:
      - "focus sprint on reliability work"
      - "require rollback plan for all changes"

For SLO thinking, Google’s SRE Workbook remains one of the most practical references we can keep on the shelf.

Grow Engineers by Giving Them Real Ownership

DevOps leadership isn’t just about shipping; it’s about growing people who can ship without us hovering. The fastest way to stall a team is to centralise every decision in “the lead.” The fastest way to scale is to distribute ownership with guardrails.

We aim to give engineers problems, not tasks. “Implement these five Terraform resources” is a task. “Make environment creation take under 30 minutes with auditability” is a problem. When people own problems, they learn to reason about trade-offs, not just follow instructions.

A few tactics that work well:
Rotating tech lead for a quarter: not a title, a responsibility.
Decision records (ADRs): small, written decisions that explain “why,” so future-us doesn’t curse present-us.
Paved roads: make the right thing the easy thing—templates, golden paths, sensible defaults.

We also pair ownership with safety. If someone owns a service, they need good observability, a rollback plan, and clear escalation. Ownership without tools is just stress with a fancy ribbon on it.

And we’re deliberate about feedback. We don’t save it for annual reviews like it’s a rare wine. We give quick, specific notes: “Your incident write-up was clear and actionable,” or “Next time, pull in comms earlier.” This builds capability without turning every interaction into a performance evaluation.

If you’re looking for a practical approach to structuring teams around ownership, Team Topologies has a lot of sensible language for it—then we can adapt it to our reality.

Standardise the Boring Stuff With Automation (and Consent)

Standardisation is a leadership choice: we decide which things are “one way,” so teams don’t waste time debating tabs versus spaces while prod is on fire. But we also avoid standardising for sport. If it doesn’t reduce risk or cognitive load, it’s probably not worth a policy.

The best candidates for standardisation are repeatable workflows: CI pipelines, environment provisioning, secret handling, dependency updates, and release processes. When these are inconsistent, teams spend energy relearning basics instead of solving customer problems.

Here’s a minimal GitHub Actions workflow that demonstrates a clean, repeatable pipeline: lint, test, build, and only then deploy. It’s not fancy—and that’s the point.

# .github/workflows/ci-cd.yml
name: ci-cd

on:
  push:
    branches: [ "main" ]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run lint
      - run: npm test

  deploy:
    if: github.ref == 'refs/heads/main'
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh

The “and consent” part matters: we involve teams in defining the standard, and we keep an escape hatch for legitimate exceptions. Leadership isn’t “because I said so.” It’s “because it measurably reduces pain, and we all agree the trade-off is worth it.”

Lead Up, Lead Across, and Protect the Team’s Focus

A big chunk of DevOps leadership is translating between worlds: product wants features, security wants controls, finance wants predictability, and engineering wants fewer surprise “priority zero” projects. Our role is to make these needs compatible—or at least to stop them from colliding at full speed.

We lead up by bringing options, not complaints. “We can launch this by Friday if we accept higher incident risk, or we can launch next Wednesday with a staged rollout and rollback plan.” Executives understand trade-offs; they don’t understand “it’s complicated” (even when it is).

We lead across by building relationships with adjacent teams before we need them. The worst time to meet security is during a breach. The worst time to meet networking is during an outage. We set regular touchpoints and keep them short: what’s changing, what’s risky, what needs help.

And we protect focus like it’s production data. Context switching kills throughput. We’ve had success with two simple rules: no-meeting blocks a few times a week, and a single intake path for “urgent” requests—so “urgent” doesn’t mean “whoever shouted loudest in chat.”

The fun part: when we do this well, we don’t look like heroes. Things just work. Which, honestly, is the dream.

Share