Ship Faster With 7 Oddly Specific devops Habits
Practical playbook to cut lead time without breaking on-call sleep.
1) Start With Three Numbers That Matter
Let’s start with a confession: most teams measure the wrong things or measure the right things in the wrong places. We’ve found that picking three numbers and putting them where everyone can see them beats any 40-page dashboard. If we had to choose only three, we’d pick lead time for changes, change failure rate, and error-budget burn. That gives us a healthy mix of speed, safety, and reliability. We pair those with a weekly cadence: every Friday, we look at the trendlines and ask, “What did we try this week, and what moved?” For many teams, just seeing deploy frequency next to failure rate sparks smarter conversations—like splitting one risky weekly deploy into multiple bite-sized ones, or making a rollback script part of “done.”
We keep the numbers boring and comparable. For lead time, we clock from first commit on the main branch to production deploy. For failure rate, we count incidents that require a rollback, hotfix, or customer-facing mitigation. For error-budget burn, we track SLO miss hours. The target isn’t perfection; it’s sustainable improvement. A 10–25% improvement quarter over quarter compounds faster than heroics. If you’re looking for starter references for the why and how, the research behind the four key metrics at DORA is excellent, and the Reliability pillar in the AWS Well-Architected guide provides practical guardrails. Once you’ve chosen the three numbers, don’t bury them—pin them in chat, post them in the repo README, and review them during standup. When the scoreboard is visible, we naturally play a better game.
2) Define “Done” in Code, Not Slides
PowerPoint can lie; your repo can’t. If “it works on my machine” is still a common refrain, we’ve left too much to human memory. We make “done” executable. Concretely, we put a Makefile (or a tiny task runner) in every repo so anyone—developer, SRE, or manager who knows just enough to be dangerous—can run the same steps locally and in CI. The pattern is simple: a single entry point to lint, test, build, and package. That becomes the contract for the pipeline. When a team asks, “What does success look like?” we point at the target list instead of a meeting note.
Here’s a minimal example we drop into new services:
.PHONY: all lint test build package clean
all: lint test build
lint:
@echo "Linting..."
flake8 src tests
test:
@echo "Running tests..."
pytest -q --maxfail=1 --disable-warnings
build:
@echo "Building wheel..."
python -m build
package:
@echo "Creating container..."
docker build -t ${IMAGE_NAME:?set IMAGE_NAME} .
clean:
@echo "Cleaning..."
rm -rf dist build .pytest_cache
We keep it small and predictable—no clever bash puzzles. Every new requirement graduates into a target: make audit
for dependencies, make sbom
for a software bill of materials, make migrate
for database changes. Once “done” lives here, CI is just a thin wrapper, and drift between laptops and robots evaporates. Bonus: this doubles as documentation that stays current because it actually runs.
3) Build a Paved CI Path New Repos Adopt in 10 Minutes
Pipelines shouldn’t feel like bespoke furniture. We keep a single “paved path” workflow that most repos can adopt unchanged. The trick is to keep it boring, fast, and self-explanatory. Boring means a sane default: lint, test, build, and publish on main; test on pull requests; cache aggressively; and fail clearly. Fast means smart caching and parallel jobs. Self-explanatory means the pipeline tells you what to do next, not just that you did it wrong. When a team deviates, they do it consciously and document why. Most of the time, they come back to the path once they see the maintenance cost of custom tweaks.
A compact GitHub Actions starter we’ve used:
name: ci
on:
pull_request:
push:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [ "3.10", "3.11" ]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache deps
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt') }}
- run: pip install -r requirements-dev.txt
- run: make lint test
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: make build
We complement this with branch protection and required checks, so the path enforces itself. For details and deeper knobs, the GitHub Actions docs are helpful. Keep the starter in a template repo; new services clone it and go. Ten minutes later, they have a green pipeline and a PR building confidence instead of anxiety.
4) Keep Environments Boring With Declarative IaC
If you’ve ever SSH’d into a server to “just tweak a thing,” you’ve created a ghost. It’ll haunt you during the next incident. We aim for boring environments: no pets, no snowflakes, no sacred servers. Everything is declarative, versioned, and reviewed. When something needs to change, we change code, not machines. That discipline pays back during outages, audits, and onboarding. New folks learn by reading the repo, not guessing which wiki page is current. Our rule of thumb: if your environment can’t be recreated from scratch by the pipeline, it’s not done.
Here’s a tiny Terraform sketch that shows the spirit:
terraform {
required_version = ">= 1.5.0"
}
provider "aws" {
region = var.region
}
variable "region" {
type = string
default = "us-east-1"
}
resource "aws_s3_bucket" "artifacts" {
bucket = "${var.env}-artifacts-bucket"
tags = {
env = var.env
owner = "platform"
}
}
variable "env" {
type = string
default = "dev"
}
Simple pieces, easy to read, and tagged for visibility. We pair this with drift detection and a weekly “no-code” change report. If something changes outside Terraform, we notice and fix the gap. Review IaC PRs like application code—clear naming, small diffs, and rollback plans. The effect is subtle but powerful: deploys get less exciting because the system itself is predictable. And yes, we still keep a break-glass option, but we treat it like a fire extinguisher—useful, inspected, and rarely needed.
5) Make Observability Non-Negotiable: SLOs and Traces From Day One
A release isn’t done until we can see it breathing. We bake observability in before the first customer ever sees the service. That means three things: usable logs, metrics with labels that match our domain (not just infrastructure), and distributed traces. On top of those, we define one or two Service Level Objectives with clear SLIs—usually success rate and latency. We like budgeting in hours per month rather than percentages; it’s easier for humans to reason about. With an SLO, we can say, “We’ve burned 6 of 10 hours; we’re pausing risky deploys” instead of “We’re at 99.2%” (which nobody interprets the same way).
For instrumentation, we standardize on one library and one export path to keep noise down. These days, that often means OpenTelemetry, which plays nicely with most languages and backends. We make the client libraries part of our starter template, so new services emit traces and metrics without extra work. Alerts ride on SLOs and error budget, not CPU spikes. During incidents, traces show us whether a symptom is between services or inside one function. After incidents, we add one unit test and one observability check—maybe a trace span around a hairy code path or a metric for a queue depth. Over time, the telemetry mirrors the shape of the system. And when telemetry mirrors the system, debugging is a conversation with data, not a séance with logs.
6) Ship Safely With Supply Chain Guardrails
We don’t need a security panic; we need guardrails that everyone quietly follows. Our goal is simple: prove what we built, prove what it includes, and prove it hasn’t changed since we signed it. We integrate three low-friction steps: dependency scanning on every PR, a Software Bill of Materials (SBOM) at build time, and signed artifacts. We fail builds for known critical issues, and we keep suppressions tight and time-bound. For containers, we pin base images and rebuild them regularly so we aren’t carrying old bugs forward forever. For languages, we use lockfiles and renovate bots rather than heroic manual upgrades twice a year.
The guidance we align to is boring and clear. The SLSA levels help us reason about provenance and tamper-evidence without reinventing a process. We sign images and manifests (e.g., with cosign) and verify them before deploy. Our CI system is locked down: ephemeral runners, short-lived credentials, and no secret sprawl. We treat our pipeline definitions like production code; changes get reviews and logs. We also practice our response plan: when a hot CVE hits, we know who decides, how we patch, and what we roll back if needed. None of this turns engineers into security analysts—it just makes the secure path the easy path. And once it’s the easy path, it’s the path people take when the pager is buzzing at 2 a.m.
7) Run Incident Reviews That Change Code, Not Feelings
We’ve seen reviews that feel like courtroom drama and deliver nothing but stress. We prefer a pattern that treats incidents like rigorous, kind code reviews. The inputs are facts: a timeline, impact, and evidence. The outputs are changes to code, tests, or runbooks—with owners and dates. We start with a neutral summary: what failed, what customers felt, what we did. Then we ask, “What made this easy to miss?” and “What made it hard to recover?” That frames better fixes: earlier detection, safer deploys, and faster rollback. If your answer requires superhuman attention, it’s not a fix; it’s an apology to future you.
We also rewrite the first 15 minutes. Could we detect earlier? Could we route the alert to the team that can act? Could kubectl rollout undo
or a feature flag have sliced the blast radius? Every review ends with a small set of actions a human can actually do in a week or two: a canary rollout for a flaky service, a pre-flight check in CI, a load-shed limit to protect a dependency. We track these in the same backlog as product work and close the loop in a retro. If a fix doesn’t land, we talk about capacity, not blame. Over months, the culture shifts: incidents become a source of engineering work we’re proud to ship, not stories we’re trying to forget.