Practical DevOps Leadership Without the Heroics

leadership

Practical DevOps Leadership Without the Heroics

How we lead calmly, ship steadily, and keep our weekends.

Leadership Starts With Fewer Surprises, Not Louder Opinions

Leadership in DevOps looks less like big speeches and more like quietly removing “gotchas” from everyone’s week. We’re not here to win debates in incident channels; we’re here to make outcomes boring (in the best way). That means we lead by reducing uncertainty: clear priorities, predictable pipelines, and fewer “only Sam knows how that works” systems.

A helpful mental shift: our job isn’t to be the smartest person in the room—it’s to make the room work without requiring a genius on call. When we do that, teams move faster because they’re not tip-toeing around fragile processes. We also avoid the trap of “leadership theatre,” where we mistake activity for progress.

We can start small. Ask: what are the top three recurring surprises this month? Maybe it’s flaky tests, unclear ownership, or deployments that behave differently across environments. Then pick one and fix it with the team, visibly. People follow leaders who turn chronic pain into solved problems.

This is also where we set tone: calm curiosity over blame, written decisions over hallway lore, and “show your work” over “trust me.” If we want a reference point for the culture piece, Google’s old-but-still-useful write-up on Site Reliability Engineering is a solid reminder that reliability is mostly about systems, not superheroes.

The punchline: good leadership makes the easy path the safe path.

Set Direction With Guardrails, Not Micromanagement

If we’re constantly answering “can I deploy this?” we don’t have a team—we have a queue. Leadership means setting direction and boundaries so people can make decisions without waiting for permission. The trick is to be specific about what must be true, and flexible about how we get there.

We like to define guardrails in three layers:

1) Product and risk priorities: what matters most right now (latency, cost, feature delivery, uptime).
2) Operating constraints: compliance requirements, data handling rules, change windows if they exist.
3) Engineering defaults: standard pipeline steps, required checks, and rollback expectations.

Then we write it down. A one-page “How we ship” doc beats a dozen tribal rules. If someone new joins and can’t find the answer in 10 minutes, our process isn’t a process—it’s folklore.

Where teams get stuck is ambiguity: “move fast, but don’t break things” is not guidance. “All services must have a rollback path and a monitored SLO before going to production” is guidance.

It also helps to make trade-offs explicit. For example: “We accept a 5% cost increase this quarter to eliminate manual deploy steps.” When the team understands the trade, they can make consistent calls without escalation.

If we need inspiration for crisp written standards, the DORA research is useful—not as a scoreboard, but as evidence that clear delivery practices correlate with healthier outcomes. Guardrails aren’t red tape; they’re how we scale trust.

Make Incidents a Leadership Moment (Without the Drama)

Incidents are where leadership shows up, whether we planned to attend or not. The goal isn’t to be the loudest voice on the call. The goal is to keep the system and the humans stable enough to recover quickly and learn something real afterward.

During an incident, we like three roles: Incident Commander, Operations/Comms, and Investigators. Even in a small team, naming who’s doing what prevents the classic failure mode: five people chasing symptoms while nobody updates stakeholders.

The tone matters. “What changed?” beats “Who did it?” every single time. When we blame, we get silence. When we stay curious, we get signal. A good leader protects the channel from side quests, ensures breaks happen, and calls time when things are spiralling.

After the dust settles, we do a blameless review that’s actually blameless—meaning we look for contributing factors: missing alerts, confusing dashboards, risky deploy patterns, unclear ownership. If your postmortems always end with “engineer needs to be more careful,” congratulations: you’ve built a system that will keep failing in creative ways.

For a solid model of incident learnings, Atlassian’s incident practice notes are a good read, even if you don’t use their tools: Atlassian Incident Management. The point isn’t the template; it’s the habit of turning pain into prevention.

Leadership here is simple: keep people safe, keep information flowing, and make the next incident smaller.

Codify Standards: A Tiny Pipeline Policy That Saves Hours

We can talk about consistency all day, or we can encode it into the delivery path. Leadership that scales is usually written in YAML somewhere (unromantic, but effective). When our standards live in pipelines, we stop re-litigating basics in every PR.

Here’s an example GitHub Actions workflow that enforces a few “grown-up” defaults: linting, tests, and a policy check that blocks merges when coverage drops below a threshold. It’s not fancy—just dependable.

name: ci

on:
  pull_request:
  push:
    branches: [ "main" ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Node
      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install
        run: npm ci

      - name: Lint
        run: npm run lint

      - name: Unit tests
        run: npm test -- --ci

      - name: Coverage gate
        run: |
          COVERAGE=$(node scripts/read-coverage.js)
          echo "coverage=$COVERAGE"
          node -e "process.exit($COVERAGE < 80 ? 1 : 0)"

The leadership move isn’t the coverage number; it’s the shared agreement that quality checks are automatic and non-negotiable. Teams argue less when the system is the referee.

If you want a broader view on why automated checks improve flow, Martin Fowler’s continuous integration guidance is still sharp: Continuous Integration. The more we automate the “should we?” questions, the more time we have for the “what’s next?” questions.

We can start with one or two checks. The win is making the default path safe—so nobody needs permission to do the right thing.

Use Infrastructure as Code to Clarify Ownership

Leadership also means making ownership visible. If production is a maze of manually tweaked settings, nobody truly owns it—and everyone fears touching it. Infrastructure as Code (IaC) turns “I think that’s how it works” into “here’s the diff.”

A small Terraform example shows the idea: we define a service’s baseline—tags, logging, and environment separation—in code. This makes reviews possible and drift obvious.

terraform {
  required_version = ">= 1.6.0"
}

variable "env" {
  type    = string
  default = "staging"
}

locals {
  common_tags = {
    service = "payments"
    env     = var.env
    owner   = "platform-team"
  }
}

resource "aws_cloudwatch_log_group" "app" {
  name              = "/apps/payments/${var.env}"
  retention_in_days = 30
  tags              = local.common_tags
}

This isn’t about Terraform worship. It’s about clarity: who owns the service, where logs live, and what “normal” looks like. When someone asks, “Where are the logs for staging?” we don’t answer with folklore—we point to code.

Leadership-wise, IaC reviews are also coaching moments. We can teach patterns: naming conventions, tagging standards, retention defaults, and least privilege. We can also make cross-team contributions safer because the blast radius is visible in pull requests.

If your team is new to IaC discipline, HashiCorp’s intro docs are straightforward and practical: Terraform Documentation. The goal isn’t perfection; it’s reducing mystery.

When systems are legible, on-call gets easier, onboarding gets faster, and trust grows quietly—our favourite kind of growth.

Measure What Hurts, Then Fix the Bottleneck

Leadership loves metrics until metrics start tattling. We don’t need dashboards that look like a spaceship; we need a few indicators that tell us where time and joy are leaking out of the process.

We typically track:

  • Lead time to production (not “time in Jira”)
  • Deployment frequency (per service, not per org)
  • Change failure rate (rollbacks, hotfixes, incidents tied to deploys)
  • Mean time to restore (how fast we recover, not how fast we panic)

But we use them for diagnosis, not for punishment. The moment metrics become a performance weapon, people will game them. Leadership is creating an environment where the numbers help us argue for better tooling, fewer manual steps, and clearer ownership.

A practical approach: pick one bottleneck per quarter. If lead time is high, is it code review backlog, flaky tests, or slow environments? If change failure is high, do we lack canaries, or are configs drifting? Then fix one thing and re-measure. Improvement is usually a chain of small wins.

We also recommend linking metrics to concrete interventions: “reduce flaky tests by quarantining and fixing top 10 failures,” or “reduce MTTR by improving alert routing and runbooks.” Make the work boring and specific.

If we need a sanity check on which metrics matter, the DORA metrics overview is a good anchor. We don’t need to worship the numbers—we just need to let them point us to the next most annoying problem.

Grow People With Clear Expectations and Safe Feedback

Tools don’t quit; people do. Leadership is keeping the team healthy enough to ship over the long haul. That means clarity: what “good” looks like, what’s expected at each level, and how we give feedback without turning it into theatre.

We’ve had the best outcomes with three habits:

1) Role clarity: not rigid job descriptions, but a shared understanding of responsibilities (build ownership, on-call expectations, documentation, reviewing).
2) Regular 1:1s: not status updates—those belong elsewhere. 1:1s are for obstacles, growth, and the stuff people won’t say in standups.
3) Small, frequent feedback: “When X happened, it caused Y. Next time, try Z.” No grand performance monologues.

We also treat documentation as a leadership artefact. If someone solved a problem at 2 a.m. and didn’t write it down, they didn’t finish the job (kindly said, but still true). Writing things down is how we respect future teammates.

Finally, we protect focus time. Constant interruptions destroy both productivity and morale. A leader’s calendar is often chaotic; the team’s calendar shouldn’t be. We can set office hours, rotate support duty, and keep Slack from becoming an always-on firehose.

For a good reference on healthy team dynamics and psychological safety, this overview of Google’s research is a decent starting point: re:Work – Psychological Safety. The aim isn’t to be “nice.” It’s to be effective without burning people out.

Share