Ship 30% Faster with Pragmatic Kanban for Busy Ops

Ship 30% Faster with Pragmatic Kanban for Busy Ops
Turn chaos into flow using simple limits, metrics, and visible work.

Ops Reality Check: Why Kanban Beats Wishful Planning

Most ops work doesn’t queue politely. Tickets tumble in sideways, incidents land at midnight, and someone always needs a small change “real quick.” That’s why kanban feels like a comfortable hoodie for ops: it starts with what we’re already doing, makes work visible, and adds just enough constraints to keep us from drowning in our own good intentions. We don’t need to guess two weeks of work or negotiate with an empty calendar. We need flow.

Kanban’s core ideas are simple: visualize work, limit work-in-progress (WIP), and improve via small, continuous tweaks. Visualization makes hidden queues and half-started tasks visible. WIP limits force trade-offs—do we want more started work or more finished work? And continuous improvement lets us tune policies and limits as we learn. It’s not fancy, but it’s effective.

For ops, we also care about different classes of work. Incidents are interrupts. Requests and small changes are standard flow. Bigger projects and migrations are special: they’re multi-step and compete with interrupts. Kanban handles all of these with explicit policies (e.g., “we always prioritize active incidents”) and a single board that shows reality instead of a fantasy backlog.

We’ll keep the mechanics lightweight. No ceremonies for ceremony’s sake. Short standups near the board, weekly replenishment to choose what’s “Ready,” and real conversations around policies when something hurts. The benefits are immediate and practical: fewer half-done tasks, shorter lead times, and less frantic context-switching. We’ll ship a steady stream of small wins while still making room for the big rocks that actually move the needle.

Design the Board: Mirror the Way Work Actually Moves

Let’s make a board that reflects our work instead of the way we wish it worked. Seven columns usually cover ops without overfitting: Intake, Triage, Ready, In Progress, Review, Blocked, Done. Intake is the inbox. Triage decides priority and class of service. Ready is our “commitment point” queue where options become obligations. In Progress is active work only. Review captures peer checks, security gates, or change approvals. Blocked is a quarantine for stuff that can’t move. Done is boring, which is the goal.

Each column gets a WIP limit that matches team capacity and pain tolerance. Triage is tight, because triaging is expensive and should be quick. In Progress is tighter, to enforce finishing. Review is also limited, or else we’ll just pile work into someone else’s inbox. We use an expedite lane for true incidents with a one-in-policy, and we track blocked items’ age to spot systemic issues (vendor delays, brittle environments, missing permissions).

If your tool supports config-as-code, a simple YAML helps enforce the basics:

board:
  columns:
    - name: Intake
      wip: 999
    - name: Triage
      wip: 2
    - name: Ready
      wip: 6
    - name: In Progress
      wip: 5
    - name: Review
      wip: 3
    - name: Blocked
      wip: 5
    - name: Done
      wip: 0
policies:
  expedite:
    limit: 1
    service_level: "start-to-finish <= 24h"
  pull-criteria:
    ready-definition:
      - user story or ticket linked
      - env validated
      - rollback noted
      - reviewer available

We’ll tweak these as we gather data. The point is to track the real path, surface blockers, and stop pretending “in progress” includes everything we’ve glanced at this week. Spoiler: it doesn’t.

Pick WIP Limits That Hurt a Little (On Purpose)

WIP limits should create helpful friction. Not agony, not theater—just enough sting to force finishing before starting. We can pick initial limits using a nice bit of queueing math: Little’s Law. It says, roughly, average WIP = throughput × average lead time. If we know two of those, we can estimate the third. For example, say we finish 20 items per week with an average lead time of five days (~1 week). That implies about 20 items of WIP across the whole system. If we have seven columns, we’re not giving “In Progress” 15 slots; we’re carving those 20 across Triage, Ready, In Progress, Review, and Blocked. Then we nudge down until it hurts just enough to improve flow. If you want the formal statement, see Little’s Law.

We’ll also scale WIP to the number of humans actually available. If three people are on-call and meetings eat half our day, “In Progress” being five might be optimistic. Try assigning one slot per active engineer plus one “team” slot for pairing, then hold the line. If someone asks to raise the limit, we ask what we’ll drop instead. Limits make the cost of context switching visible.

Finally, remember the expedite exception is not a backdoor. True incidents bypass Ready and In Progress limits, but we keep just one expedite slot. If it’s occupied, we swarm and finish it before starting another. That’s the social contract that keeps expedites rare and meaningful instead of the default.

Measure Flow Like Engineers: Lead Time, Age, Throughput

If we don’t measure flow, we’ll argue based on vibes. The basics are simple: lead time (from “entered Intake” to “Done”), cycle time (from “left Ready” to “Done”), throughput (items finished per time), and WIP age (how long an item has been in its current state). Flow efficiency—active time divided by lead time—helps spot handoff or approval slowness, but we can start with the first four.

We can scrape our tool and compute the numbers. GitHub issues? You can do a quick-and-dirty export with gh + jq. It won’t be perfect, but it’ll get us trends:

gh issue list --state all --json number,labels,createdAt,closedAt \
  | jq '[.[] | {id: .number,
                type: (if (.labels|map(.name)|index("incident")) then "incident" else "work" end),
                created: .createdAt,
                closed: .closedAt}]'

If we’ve instrumented timestamps per column, even better. Load them into a tiny SQLite or a spreadsheet, make a weekly chart, and watch for fat tails and stuck items. For a more structured approach, the open-source Four Keys project from Google gives a reference pipeline for change metrics, which can be adapted for flow metrics too: Four Keys.

As we collect a few weeks, we’ll see the usual suspects: “Review” acting like a black hole, “Blocked” filling up with third-party approvals, and “Ready” growing into a secret backlog. That’s good news—we can fix what we can see. The rule of thumb: reduce WIP if lead time balloons, add reviewers if Review waits dominate, and kill zombie tickets older than your average lead time times two.

Make Policies Explicit So Tuesday You Still Agree

Kanban without explicit policies turns into “we’ll remember.” We won’t. Let’s write policies that humans can follow even on a Tuesday afternoon after two incidents and one lukewarm coffee. Examples that help:

Pull criteria for “Ready”: the work is clear, dependencies are settled, rollback plan drafted, reviewer available, and any change process is known. This prevents “surprise approvals” that stall at Review.
Definition of Done: deployed, logs clean, monitors green, docs updated, and ticket closed. If the runbook isn’t updated, it’s not Done.
Expedite rules: one slot, incidents only, clock starts immediately, stop-the-world swarm. No “my boss pinged me” exceptions.
Blocker protocol: if an item sits blocked for 24 hours, we page the owner of the dependency or escalate. The longer it sits, the more we shine a light on it.

To reduce toil, codify some of this in code and docs. For example, CODEOWNERS files enforce reviewers for sensitive areas without debates in chat. And if a task recurs more than a handful of times, it’s a candidate for automation or deletion. The SRE book’s guidance on eliminating toil pairs nicely with kanban’s “finish what you start” spirit. We’re not trying to be rigid; we’re trying to make the invisible costs visible so we can choose wisely and consistently.

Handle Incidents, Projects, and Fixed Dates Without Melting

Mixing interrupts and projects on one board can feel chaotic, but it’s better than hiding work in separate tools. We use classes of service and lanes to make the trade-offs explicit. Incidents get the Expedite lane, limited to one active item; they bypass Ready but still pass through Review if the change is risky—pairing helps here. Standard work flows through the main lane. Fixed-date items get tagged and gently pulled earlier to hit the date without last-minute panic.

For projects, break them into thin slices that can move through the same columns as standard work. A dozen cards that ship weekly beats one “epic” that blocks Review for a month. If we can’t slice, at least carve clear integration points: decisions, approvals, environment readiness, and test data. Those become explicit cards with their own acceptance criteria, so they don’t become lurking blockers.

We also reserve capacity. If incidents average 30% of our time, pretending we can run at 100% project throughput is self-sabotage. We set Ready’s limit to protect a slice of capacity for standard work while leaving slack for interrupts. That slack isn’t waste; it’s our buffer against variability. If interrupts are quiet, we pull more standard work. If they spike, we stay inside limits rather than blowing up the system.

Finally, track blocked-age and expedite frequency. If expedites are common, we either misclassify work or we’re under-staffed. If blocked-age grows, we have a dependency problem—fix the upstream process, not just the card.

Automate the Boring Guardrails: Bots, Checks, Nudges

We don’t need a PM police squad to enforce WIP. A couple of guardrails keep us honest without nagging. Start with a Slack/Teams bot that posts daily WIP status and highlights the oldest item in each column. Surface the pain where we talk. Add a scheduled script that comments on any item in Review for more than two days. Keep it friendly, not punitive.

We can also block starting new work when WIP is exceeded. On GitHub, a small action can stop a new PR from opening if the author already has too many “In Progress” items. It’s not perfect, but it nudges behavior in the right direction without manual gatekeeping. Example:

name: WIP Guard
on:
  pull_request:
    types: [opened, ready_for_review]
jobs:
  check-wip:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Count In-Progress Issues
        run: |
          COUNT=$(gh issue list --assignee ${{ github.actor }} --label "In Progress" --state open --json number | jq 'length')
          if [ "$COUNT" -gt 2 ]; then
            echo "::error::WIP limit exceeded (${COUNT} > 2). Finish work before starting new."
            exit 1
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

It’s worth reading the docs for the platform you use so you don’t reinvent the wheel. For GitHub, the starting point is the Actions documentation. Treat these automations like guardrails on a curvy road: they won’t drive the car, but they’ll keep us out of the ditch.

What We’ll Notice After 60 Days: Boring Speed

The first two weeks feel awkward. We’ll hit a WIP limit and stare at a blocked card, tempted to sneak one more thing into “In Progress.” Resist. Swarm, unblock, or drop something. By week three, the board starts to tell a story: Review is our bottleneck, or Triage is eating time, or expedites are too common. We tweak limits, adjust policies, and keep measuring. And then something quiet happens: the noise drops.

After about 60 days, lead time stabilizes. Queue length (WIP) becomes predictable. We’re still interrupted—because ops—but we recover faster, and projects actually finish. The team’s stress drops because the invisible queues are now visible limits. Standups stop being status theatre; they become short, focused conversations about flow and blockers. Engineers feel safe to say, “I’m not pulling new work; I’m at my limit.” That’s not laziness; that’s professionalism.

We won’t need a manifesto to defend this; the metrics will do it for us. Throughput will rise 15–30% just by capping WIP and smoothing Review. Lead time will shrink. On-call will get less gnarly because fewer half-done changes are floating around. And when leadership asks why things feel faster, we can point to a board that matches reality, a few well-chosen policies, and a habit of finishing before starting. It’s not magic. It’s just flow made visible—and a team disciplined enough to keep it that way.