Practical DevOps leadership Without the Drama

leadership

Practical DevOps leadership Without the Drama

How we build trust, ship changes, and keep sleep intact.

Leadership Starts With “What Problem Are We Solving?”

DevOps leadership gets oddly mystical on the internet. We’d rather keep it grounded: our job is to help the team solve real problems—faster, safer, and with fewer 2 a.m. surprises. When we lead well, people don’t feel “managed”; they feel unblocked. The simplest move is also the most underused: before we approve a tool, a process, or a project, we ask, “What problem are we solving, and how will we know it’s solved?”

That question does three things. First, it turns vague pain (“deploys are scary”) into something measurable (“deploy failure rate is 18%, rollbacks take 40 minutes”). Second, it exposes when we’re about to start a hobby project disguised as strategy (we’ve all been there). Third, it gives the team permission to say “no” to work that doesn’t connect to outcomes.

We like a lightweight “problem brief” we can write in 10 minutes: context, impact, current workaround, and acceptance criteria. The leadership trick is making this culturally safe—no eye-rolling, no “why didn’t you think of that,” no performative debate. If someone can’t articulate the problem yet, that’s fine; we help them sharpen it.

If you want a solid reference point, DORA metrics provide a pragmatic frame for what “better” can look like: lead time, deployment frequency, time to restore, and change failure rate. Not because metrics are magic—because shared language prevents endless opinion wars.

The Leadership Move We Underuse: Make Work Visible

If work is invisible, leadership becomes guesswork. And guesswork becomes… meetings. Our best teams don’t run on heroics; they run on clarity. Visibility isn’t about surveillance—it’s about reducing surprises and helping everyone make better decisions with the same information.

We aim for three layers of visibility:

1) Flow visibility: What’s in flight, what’s blocked, and why.
2) Risk visibility: What changes are risky, where dependencies sit, and what could break.
3) Ownership visibility: Who’s on point when something goes sideways.

A practical approach is a single “operations board” (Jira, GitHub Projects, whatever we can keep updated). The leadership part is insisting that the board reflects reality. If it’s not updated, we don’t shame anyone—we adjust the workflow until updating it is the path of least resistance.

We also like a short weekly written update (yes, written). Nothing fancy: top outcomes, top risks, asks. Writing forces clarity, and reading scales better than another call. If you want inspiration for written-first culture, GitLab’s handbook approach is a goldmine—even if we don’t copy it wholesale.

Finally, we normalize “red is a colour, not a career-limiting event.” When people can mark something as blocked without fear, issues surface early. That’s how we avoid the classic DevOps anti-pattern: everything is “fine” until it’s on fire.

Set Standards That Remove Decisions, Not Autonomy

DevOps leadership isn’t about controlling people; it’s about controlling chaos. The best standards remove repetitive decisions so teams can spend their brainpower on the hard stuff. We don’t need a 40-page policy doc—just a few defaults that make the safe path the easy path.

A good standard has three traits: it’s small, enforced by tooling, and easy to explain. For example: “Every service has health checks, resource limits, and a rollback plan.” That’s not bureaucracy; that’s self-defense.

Here’s a tiny Kubernetes baseline we’ve used as a starting point—opinionated enough to help, flexible enough to evolve:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 3
  selector:
    matchLabels: { app: app }
  template:
    metadata:
      labels: { app: app }
    spec:
      containers:
      - name: app
        image: example/app:1.2.3
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet: { path: /ready, port: 8080 }
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet: { path: /health, port: 8080 }
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          requests: { cpu: "100m", memory: "128Mi" }
          limits:   { cpu: "500m", memory: "512Mi" }

Leadership is also knowing when not to standardize. If a standard creates constant exceptions, it’s either wrong or too early. We treat standards like code: versioned, reviewed, and retired when they no longer help. If you’re looking for a sensible north star, the SRE book does a great job describing how reliability practices become repeatable without becoming oppressive.

Incident Leadership: Calm Beats Clever Every Time

When production breaks, leadership shows up in tone more than words. We can’t outsmart an outage with a dramatic speech. What works is calm, clarity, and a predictable routine. Our goal is to lower cognitive load so engineers can think again.

We like a simple incident structure:

  • Incident commander (IC): keeps the timeline and assigns tasks
  • Tech lead: investigates and proposes actions
  • Comms: updates stakeholders and users
  • Scribe: documents what happened and decisions made

The leadership trick: we don’t let the most senior person automatically become IC. The best IC is the one who can coordinate and stay calm—not necessarily the one who can debug the fastest. We rotate the role, train it, and treat it as a skill.

Comms matters more than we think. Stakeholders can handle bad news; they can’t handle silence. We keep updates short, time-boxed, and honest: what we know, what we’re doing, when the next update lands.

Here’s a Slack-style incident update template we’ve found works well:

[INCIDENT] SEV-2 | Checkout errors increased

Status: Investigating
Impact: ~12% of checkout requests failing (5xx)
Start: 14:05 UTC
Current hypothesis: DB connection pool exhaustion after deploy 2026.03.15.2
Actions:
- Rollback in progress (ETA 6 min) @alex
- DB metrics review @sam
Next update: 14:20 UTC
Links: dashboard | logs | runbook

Afterwards, we run blameless reviews that focus on conditions, not character. If you need a reference for doing this well, PagerDuty’s incident response resources are practical and battle-tested. The real leadership win is turning “we got lucky” into “we got better.”

Feedback as a System, Not a Mood

We’ve all worked in places where feedback only appears when someone’s annoyed. That’s not feedback; that’s weather. Leadership means building a system where feedback is frequent, specific, and safe—so nobody has to guess how they’re doing.

We keep it simple:

  • Weekly 1:1s focused on obstacles, not status
  • Monthly growth check-ins: skills, scope, next steps
  • Post-project reviews: what worked, what didn’t, what we’ll try next

In 1:1s, we ask three repeatable questions:
1) What’s slowing you down?
2) What’s one thing you’re proud of this week?
3) If you were me, what would you change?

We also try to make praise as operationally useful as critique. “Great job” is nice; “Great job because you reduced deploy risk by adding a canary and documenting rollback steps” teaches the team what good looks like.

When we need to deliver tough feedback, we keep it anchored in observable behaviour and impact. No mind-reading, no labels. That’s not just kinder—it’s more actionable. If someone’s missing expectations, we co-design the next two weeks: what “good” will look like, what support we’ll provide, and how we’ll check progress.

For a thoughtful lens on creating safety and learning culture, Amy Edmondson’s work on psychological safety is worth reading. We don’t need to turn it into a slogan—we just need to build routines where people can speak up before problems become outages.

Hiring and Onboarding: Leadership That Pays Off Quarterly

DevOps leadership isn’t only about today’s sprint—it’s about building a team that can still ship six months from now without burning out. Hiring and onboarding are compounding investments: do them well, and everything gets easier; do them poorly, and we’ll “move fast” straight into a wall.

In hiring, we optimize for three things: systems thinking, debugging habits, and collaboration under ambiguity. Tool knowledge matters, but tools change. We’d rather hire someone who can reason through a messy failure than someone who memorized every CLI flag. We also look for writing ability—because incident notes, runbooks, and design docs are how we scale ourselves.

Onboarding is where leadership can quietly transform a team. We aim for a 30/60/90 plan that includes:
– environments and access (day 1 wins)
– one small production change in week 1–2
– pairing sessions with infra, app, and security folks
– ownership of a service area by day 60–90

We also keep an onboarding “golden path” doc: how to get a dev environment running, how to deploy safely, how to find logs, how to request changes. The doc is never “done”—new hires improve it. That’s leadership too: letting the system evolve with reality.

If you want a pragmatic model for reducing toil and making work sustainable, Google’s SRE concept of toil is a helpful framing. It keeps us honest about whether we’re building a team—or just feeding a queue.

Metrics, Guardrails, and the Courage to Say “No”

We can’t lead operations with vibes alone. Metrics don’t replace judgment, but they do prevent arguments powered purely by confidence. Our leadership job is to choose a small set of metrics that reflect outcomes and pair them with guardrails that prevent “progress” from becoming chaos.

We like a basic scorecard:
– DORA metrics (delivery performance)
– availability/SLO health (reliability)
– cost trend (sustainability)
– operational load (on-call pages, ticket volume, toil)

Then we connect those to decisions. Example: if change failure rate climbs, we don’t yell “be careful.” We change the system: smaller batches, better tests, staged rollouts, clearer ownership.

This is also where leadership means saying “no” to work that looks exciting but weakens the platform. If a team wants to introduce a new database, runtime, or CI system, we ask them to fund the operational cost: on-call readiness, dashboards, runbooks, and a migration plan. If they can’t, it’s not a moral failing—it’s just not ready.

A lightweight way to encode guardrails is policy-as-code. Even a simple check for required tags or approved regions can prevent expensive surprises. We’re not trying to police; we’re trying to avoid late-night archaeology.

For a solid overview of policy-as-code patterns, Open Policy Agent is a good starting point. The leadership punchline: constraints are kindness when they prevent predictable pain.

Share