Ship Calmly: Make Agile Deliver in 7 Days
Practical patterns for faster flow, safer releases, and saner teams.
Stop Worshipping Ceremonies: Optimize Flow, Not Ritual
We’ve all sat through standups that felt like reciting the weather. Agile isn’t a calendar full of ceremonies—it’s the discipline to reduce the time between a good idea and a safe, running change. If we want seven-day lead time, we start by attacking waiting, not working. Flow efficiency equals hands-on time divided by elapsed time. Most teams we meet are shocked to discover their flow efficiency is under 15%—work is waiting in a queue nine hours for every hour a human touches it. That delay is where quality decays, context is lost, and morale goes to die. So we measure the queues: code review waiting time, staging environment contention, test pipeline bottlenecks, change approval board delays. Then we carve them down. A ten-minute PR review SLA beats a perfect code review done tomorrow. Self-service staging environments beat calendar-Tetris. Automated tests that run on every push beat heroics on Friday nights. If you enjoy numbers (we do), keep one truth near the keyboard: shortening lead time tends to correlate with higher deployment frequency and lower change failure rate—those DORA signals aren’t magic, they’re physics. The DORA research has said this for years; it’s not a vibe, it’s data. When we deliberately reduce queues, our throughput goes up without adding headcount, and defects drop without adding more process. Agile becomes visible as a system: small batches flowing quickly through stable, boring lanes. That’s the kind of boring that makes releases exciting.
Slice Work to Fit the Calendar, Not the Fantasy
If we want changes moving in under a week, the work itself must fit inside a week. That means vertical slices that compile, not horizontal slices that need a committee to integrate. We aim for “one user-visible capability” per slice: a minimal path from UI or API, through business logic, to storage, protected by tests and feature flags. If it can’t go behind a flag, it’s not a thin slice; it’s a preview of next quarter’s incident. We keep acceptance criteria painfully clear and binary—no kabuki theatre around “done.” And we ruthlessly separate sequencing from coupling. Just because A logically precedes B doesn’t mean A must block B. We often spike a thin tracer through the stack (an endpoint that returns a hard-coded value, a background job that writes a no-op log), then backfill behavior incrementally. This lets us build dark, integrate early, and keep end-to-end tests alive. Feature flags give us the safety net to ship incomplete work without incomplete outcomes. We prefer short-lived flags and cleanup tasks baked into the definition of done—stale flags become unmaintained forks of production behavior. And we avoid “partial” PRs that break test expectations. If it doesn’t pass CI, it doesn’t land. You’ll find that splitting by capability rather than by component uncovers accidental complexity—often a signal to simplify interfaces. It’s not more work to slice small; it’s less rework, because you get feedback while the code is still warm.
Trunk-Based Development You Can Actually Live With
Trunk-based development is less religion, more housekeeping. We keep one long-lived branch—main
—and keep changesets small enough to review and ship quickly. Short-lived feature/*
branches are fine if they live hours, not weeks. The goal is to never be more than a few commits away from production. That means continuous integration in the literal sense: integrate early and often, resist “big bang” merges, and let automation police the gate. We standardize a fast review checklist: scope clear, tests present, flag strategy visible, rollbacks obvious. We also cap PR size; if you need a movie and a sandwich to review it, it’s too big. Conflicts melt away when we reduce branch lifetimes. Yes, you’ll still have rough edges—UI work you can’t show yet, migrations that need sequencing—but those fit behind toggles and phased rollouts better than they fit in a long-lived branch. As a practical starter kit:
# Create thin, short-lived branch
git fetch origin && git checkout -b feature/search-suggestions origin/main
# Keep current, rebase small diffs
git fetch origin && git rebase origin/main
# Push early, build runs on every push
git push -u origin feature/search-suggestions
# Merge fast-forward via CI gate, delete branch
git checkout main && git pull --ff-only && git merge --ff-only --no-commit feature/search-suggestions
git push origin :feature/search-suggestions
If you want deeper rationale, the community’s collected wisdom at trunkbaseddevelopment.com is a treasure trove. In practice, this isn’t about purity; it’s about reducing inventory and surfacing integration issues while they’re still cheap.
CI/CD Paved Road: Fast, Boring, and Reproducible
Our pipeline’s job is to make agile real, not theatrical. If every push runs a reliable, fast build; if every merge deploys to a production-like environment; and if production deploys are one click (or zero), then the team can move without whispering a prayer into the keyboard. Speed comes from parallelization, caching, and avoiding unnecessary work. Safety comes from automated checks—static analysis, unit and contract tests, smoke tests, and a production gate controlled by SLO health and error budgets. A simple paved road for a service should look like this:
name: service-ci
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build-test:
runs-on: ubuntu-latest
strategy:
matrix:
node: [18, 20]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: ${{ matrix.node }} }
- uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
- run: npm ci
- run: npm test -- --reporter=junit
deploy-staging:
needs: build-test
if: github.ref == 'refs/heads/main' && needs.build-test.result == 'success'
runs-on: ubuntu-latest
steps:
- run: ./scripts/deploy staging
We keep tests speedy by isolating flaky ones and pushing slow integration tests behind tags that still run on merges. We make deploy scripts idempotent, so re-running is boring. When we need more knobs, we add them thoughtfully and document them. Boilerplate should live in a template repo, not in Slack lore. If you’re wiring this up, the GitHub Actions workflow syntax doc is a solid, concise reference.
SLOs and Error Budgets: The Guardrails for Agile Speed
Shipping faster is only useful if users are happy. That’s where service level objectives come in. We define clear SLIs—latency, availability, correctness—and target SLOs that reflect user tolerance, not our vanity. A common pattern: 99.9% availability measured over 30 days at the request level, excluding client-induced errors. The math is simple but sobering: at 99.9%, your budget is roughly 43 minutes of allowed unavailability per month. We track consumption of that budget in near-real time; when it’s depleted, we treat feature work as negotiable and reliability work as mandatory until the budget recovers. That’s not punishment; it’s how we avoid death by a thousand paper cuts. For teams starting out, we like the approach in the SRE Workbook’s SLO chapter: begin with a few meaningful SLIs, validate them against user journeys, and iterate. We wire alerts to “page only on burn” rather than absolute thresholds, so we page humans when the budget is at risk, not simply when a metric twitches. We also link deploys to budget health—if latency SLO is red, we require a smoke test to pass in production after deploy, with a rollback trigger on failure. And we keep SLOs visible: dashboards on TVs, budget burn burndowns in retro, and a one-line SLO summary in each service’s README. Agile without SLOs is like a fast car without guardrails—you can go fast, once.
Plan With Capacity: Forecasting That Survives Reality
We love ambition; we distrust fantasy. Planning with capacity means we fit work to throughput, not the other way around. If our three-sprint average is eight completed slices per sprint with a standard deviation of two, we don’t promise 20 next sprint because a slide deck said so. We set WIP limits so we finish what we start, and we keep cycle time tight by minimizing handoffs. Instead of timeboxing everything, we use classes of service: expedite for genuine customer pain, standard for normal work, and fixed-date for externally committed items—with explicit policies for each. When stakeholders ask “when will it be done,” we reach for probabilistic forecasts. A quick-and-dirty Monte Carlo using past cycle times can answer, “there’s an 85% chance of finishing these six slices within 10 working days.” That’s way more honest than a single date written in ink. Cumulative flow diagrams show us where queues are forming; if “in review” is a mountain, we fix review, not add standups. We also plan for slack—literal, scheduled slack—to pay down small risks and clean up flags and tests. It’s not wasted time; it’s preserving future velocity. Finally, we ruthlessly cut scope to hit dates without cutting quality. If we can deliver a trimmed, coherent capability behind a rollout flag before a major event, we do that and sleep. Our sprint ritual becomes simple: pick the slices that fit capacity, slice any that don’t, then get out of the way.
Operate Calmly: Incidents Without the Adrenaline Hangover
Incidents will happen. Our job is to make them small, rare, and educational—not dramatic. We start with crisp definitions, severity levels, and a simple escalation path. We use a single incident channel, a shared timeline, and clear roles: incident commander, communications, and subject matter owners. We keep remediation humble: runbooks close to the keyboard, not buried in a wiki maze. A lightweight taxonomy helps keep heads cool:
severity:
SEV1: "Critical user impact; widespread outage; page immediately; 24/7 response"
SEV2: "Major functionality degraded; high impact; page during business hours"
SEV3: "Minor impact or workaround available; track; no page"
roles:
commander: "Single decision-maker; coordinates and delegates"
comms: "External and internal updates; status page and stakeholder pings"
ops: "Executes mitigations and runs diagnostics"
cadence:
updates: "Every 15 minutes for SEV1, 30 minutes for SEV2"
After the fire’s out, we do a blameless postmortem within five business days. We document what happened, what surprised us, what we’ll change in code, tests, and process, and who owns those changes. No public shaming, no “human error” as a root cause. The SRE guidance on postmortem culture is still the gold standard: end-to-end transparency, clear actions, and learning over punishment. We track actions as normal backlog items with owners and due dates; incident follow-ups don’t get a magical “someday” tag. Agile isn’t just fast delivery—it’s fast recovery and faster learning. Quiet operations is a competitive advantage, and it’s the dividend of disciplined engineering.