Quietly Effective leadership for Busy DevOps Teams
How we lead without speeches, posters, or heroic late nights
Lead With Clarity, Not Volume
In DevOps teams, “leadership” isn’t the person with the loudest opinions in the incident channel. It’s the person (or group) that makes work clearer when everything feels messy. We’ve all seen the opposite: priorities shift daily, tickets multiply like gremlins, and “urgent” becomes a permanent state of being. The fix usually isn’t another meeting. It’s clarity—delivered calmly, repeated often, and backed up by visible decisions.
We can start with three habits. First, name the goal in plain language: “Reduce deploy failures,” beats “improve reliability posture.” Second, set a small number of priorities and actually defend them. If everything is top priority, nothing is. Third, make trade-offs explicit. When we accept a feature request that increases operational risk, we should say so, write it down, and decide intentionally—rather than discovering it at 2:13 a.m.
Clarity also means making it easy to know “what good looks like.” Define a few measures the team trusts: deploy frequency, change failure rate, MTTR, on-call load. Not as a stick—more like the dashboard on the car. We don’t stare at the speedometer to feel judged; we glance at it to avoid doing something silly.
A practical trick: write short “decision notes” in the repo or wiki. Two paragraphs: what we decided, and why. It reduces re-litigation, helps new joiners ramp up, and keeps us honest when we’re tempted to rewrite history.
Make Work Visible (So It Stops Being Magical)
A lot of leadership work is reducing invisible labour. When toil hides in private DMs and heroic context-switching, the team can’t improve it—and the person doing it quietly burns out. Visibility isn’t about surveillance; it’s about giving the team a shared map so we can choose better routes.
We’ve had success with a simple public “Ops Backlog” that includes recurring tasks (certificate renewals, patching cycles, access reviews) alongside incident follow-ups and reliability work. When that backlog is visible, product partners understand why “just one more change” isn’t free. It also helps us spot patterns: why are we rotating credentials manually? Why do we keep doing the same Kubernetes cleanup every sprint? That’s usually automation waiting to happen.
We should also make incident work visible without turning it into theatre. Post-incident reviews are useful when they produce actions and learning—not when they produce blame and long essays. A lightweight template plus a fixed timebox keeps it sane.
If you want a solid reference point for what to measure and why, the DORA research is still one of the clearest summaries we’ve got. And for cultivating learning-focused incidents, Etsy’s classic piece on blameless postmortems remains worth bookmarking.
Visibility gives us leverage: it turns “random pain” into “known work,” and known work can be prioritised, shared, automated, or deleted.
Turn On-Call From Punishment Into Practice
On-call is where leadership either shows up—or quietly evaporates. If we treat on-call as a rite of suffering, we’ll get churn, cynicism, and a team that’s one alert away from updating their LinkedIn. If we treat it as a practice—like fire drills, but with fewer burned bagels—we can make it sustainable.
First, we set expectations. What’s the response time? What counts as an incident? When do we wake people up? If “severity” is decided by whoever shouts first, that’s not severity—it’s adrenaline. Second, we invest in runbooks that actually help. A runbook that begins with “check logs” is technically true, but spiritually unhelpful. Good runbooks include “what success looks like,” safe rollback steps, and links to dashboards.
Third, we rotate fairly and support the primary. Secondary on-call should be real support, not ceremonial. And leaders should occasionally take a shift (yes, even managers). Nothing sharpens our prioritisation like feeling the consequences of noisy alerts at midnight.
Here’s a minimal example of an escalation policy as code (using Terraform-ish pseudocode for readability). The point isn’t the provider—it’s the principle: on-call rules are reviewable, versioned, and not stuck in someone’s head.
resource "oncall_schedule" "primary" {
name = "platform-primary"
timezone = "UTC"
rotation {
type = "weekly"
start_day = "monday"
participants = ["alice", "ben", "chandra", "diego"]
}
}
resource "oncall_escalation" "sev1" {
name = "sev1-escalation"
step { notify = oncall_schedule.primary timeout_minutes = 10 }
step { notify = "platform-secondary" timeout_minutes = 10 }
step { notify = "incident-commander" timeout_minutes = 10 }
step { notify = "engineering-director" timeout_minutes = 10 }
}
When on-call is deliberate, we learn faster, sleep more, and stop normalising pain.
Lead Incidents With Calm, Repeatable Mechanics
During incidents, leadership looks like creating order without creating panic. We don’t need a heroic “war room voice.” We need a simple system: roles, timelines, and communication that stakeholders can trust. The best incident leaders we’ve seen do three things well: they slow the room down, they protect focus, and they keep updates flowing.
A lightweight structure helps. We assign an Incident Commander (IC) to coordinate, a Tech Lead to drive investigation, and a Comms lead to update stakeholders. That’s it. We don’t need a cast of thousands; we need clear lanes. The IC’s job isn’t to debug. It’s to keep the team from tripping over itself.
We also standardise communication. Stakeholders don’t need play-by-play logs; they need current impact, what we’re doing, and when the next update will land. If we can’t estimate resolution, we can still estimate the next update. Reliability is often just disciplined communication.
This is where a Slack “/incident” workflow (or equivalent) pays off. Here’s a simple incident channel checklist we can paste as the first message:
INCIDENT TEMPLATE
- Severity: SEV-?
- Impact: Who/what is affected? How many users? Which regions?
- Start time: UTC
- Current status: Investigating / Mitigating / Monitoring / Resolved
- IC: @name
- Tech Lead: @name
- Comms: @name
- Links: dashboard, logs, recent deploy, runbook
- Next update: in 15 minutes at HH:MM UTC
For broader context on reducing risk in complex systems, Google’s SRE book is still a great reference—especially around incident response and error budgets. The trick is to adopt the parts that fit, not to cosplay as a hyperscaler.
Calm mechanics don’t remove stress, but they stop stress from turning into chaos.
Build Trust With Boring Consistency
Trust isn’t built in offsites. It’s built when we do the small things consistently: we follow through, we admit uncertainty, and we don’t punish people for telling the truth. In DevOps leadership, trust is the currency that buys speed. Without it, everything becomes slow: reviews drag, decisions stall, and people hoard information “just in case.”
One of the most practical trust builders is predictable decision-making. If we’re changing direction, we say why. If we’re declining a request, we explain what we’re prioritising instead. The goal isn’t to win popularity contests; it’s to reduce confusion. Confusion is expensive.
We also treat “bad news” as a signal, not a sin. When someone reports that a migration is riskier than we thought, the correct response is “thanks—let’s adjust,” not “why are you blocking progress?” The fastest teams we’ve led weren’t fearless; they were honest early.
And we don’t do “private praise, public blame.” Public blame teaches everyone to hide mistakes. Public learning teaches everyone to surface them. If we want fewer incidents, we need more truth, not better scapegoats.
A helpful model for team health is to look at cognitive load and focus—if the team is constantly context-switching, trust erodes because nobody can deliver predictably. The Team Topologies approach is a practical lens here, especially for deciding what the platform team owns versus what product teams should handle.
Boring consistency isn’t glamorous, but it’s the reason teams can take on big changes without falling apart.
Coach With Guardrails, Not Gatekeeping
A classic leadership failure mode in DevOps is becoming the human API gateway. Every decision routes through one person “for safety,” and soon we’ve invented a single point of failure with a coffee addiction. Coaching is how we scale without cloning ourselves.
We can start by writing down guardrails: what’s allowed without approval, what needs review, and what’s prohibited. Then we teach people how to operate inside them. Guardrails should be specific enough to guide decisions, but not so strict that they turn into a straitjacket.
For example: “Any change affecting customer authentication requires two reviewers and a staged rollout.” That’s a guardrail. “All changes require the platform lead to approve” is gatekeeping disguised as caution.
We also invest in review quality. Code review isn’t just bug-catching; it’s skill transfer. Instead of rewriting someone’s work, we ask questions that teach: “What’s the rollback plan?” “What’s the failure mode if DNS is slow?” “Can we make this idempotent?” Over time, those questions become the team’s default thinking.
Pairing during risky operations is another great coaching tool. Not forever—just when the blast radius is large. It spreads knowledge and reduces the “only Sam knows how the deploy pipeline works” problem (and poor Sam deserves a holiday).
Finally, we celebrate deletions. Removing a brittle script, consolidating pipelines, shrinking the surface area of permissions—these are leadership wins. Less stuff means fewer surprises, and fewer surprises means happier humans.
Make Improvement a Habit With Small, Real Automation
DevOps leadership gets real when we protect time for improvement. Not as a quarterly “innovation week” that disappears the moment deadlines loom, but as a steady habit. The best trick we’ve used is to reserve a slice of capacity every sprint for reliability and toil reduction—and treat it like first-class work.
Automation is a great forcing function because it requires clarity: what exactly are we doing repeatedly, and why? But we don’t need to automate everything. Start with the annoying, frequent, low-risk tasks. If we automate one pain per week, the compounding effect is huge.
Here’s an example: a GitHub Actions workflow that runs Terraform plan on pull requests and applies on main, with an explicit manual approval step via environments. This is “leadership” because it creates safer, more repeatable change without relying on one person’s memory.
name: infra
on:
pull_request:
paths: ["infra/**"]
push:
branches: ["main"]
paths: ["infra/**"]
jobs:
plan:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
defaults:
run:
working-directory: infra
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform fmt -check
- run: terraform validate
- run: terraform plan -no-color
apply:
if: github.event_name == 'push'
runs-on: ubuntu-latest
environment: production # configure required reviewers in repo settings
defaults:
run:
working-directory: infra
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform apply -auto-approve
We keep it simple, we keep it reviewable, and we keep humans in the loop where it matters. That’s how improvement becomes normal work, not a miracle.


