Devops That Actually Works On Mondays
Practical habits, sane tooling, and fewer “who broke prod?” moments
Start With The Boring Goal: Fewer Surprises
In devops, we’re tempted to start with tools because tools feel like progress. But the only goal that matters on a sleepy Monday morning is fewer surprises: fewer late-night pages, fewer “it worked on my machine” debates, fewer mystery deploys, fewer heroic recoveries that become a personality trait. If we can reduce surprise, we get speed and stability as side effects.
We do that by tightening the loop between change and feedback. Small changes are easier to review, easier to test, easier to roll back. Fast feedback means we catch issues when they’re still cheap. And visible work—dashboards, pull requests, release notes—means nobody has to spelunk Slack threads to understand what happened.
A useful mental model: treat every production change as an experiment. We define what “success” looks like, we ship, we observe, and we either keep it or revert it. This isn’t academic; it’s how we avoid the “we deployed three things and now we don’t know which one hurt us” situation.
If we want one metric to rally around, we can borrow from the DORA set (deployment frequency, lead time, change fail rate, time to restore). We don’t need a certification wall poster—just pick one pain point and measure it consistently. If you want the canonical reference, the DORA research is still the least hand-wavy place to start.
Our Pipeline Is a Product, Not a Chore
A CI/CD pipeline is the assembly line for our software. If it’s flaky, slow, or mysterious, it becomes the team’s shared resentment. So we treat it like a product: versioned, observable, and continuously improved.
The first pipeline upgrade we usually make is cutting noise. If tests fail randomly, engineers stop trusting them. Quarantine flaky tests, fix them, and keep the signal clean. Next, we reduce cycle time: parallelize tests, cache dependencies, and avoid rebuilding the world for every commit.
Here’s a minimal GitHub Actions pipeline that’s boring in the best way: lint, test, build, then deploy on main. It won’t win any awards, but it will save our weekends.
name: ci
on:
pull_request:
push:
branches: [ "main" ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- run: npm ci
- run: npm run lint
- run: npm test
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- run: npm ci
- run: npm run build
deploy:
if: github.ref == 'refs/heads/main'
needs: build
runs-on: ubuntu-latest
steps:
- run: echo "Deploy goes here"
If your org prefers something else, fine—just keep the principles. And if you’re comparing CI options, GitHub Actions and GitLab CI both cover the basics well. Our only hard rule: the pipeline should explain itself to a new hire without a 45-minute interpretive dance.
Infrastructure as Code: Make Changes Boring (Again)
If we’re clicking around in cloud consoles, we’re doing theatre, not devops. The cloud console is fine for learning, but it’s a terrible source of truth. The moment two people can “just tweak a setting,” we’ve created configuration folklore—every environment is unique, and nobody knows why.
Infrastructure as Code (IaC) is how we get repeatability: the same inputs create the same outputs, with reviews and history attached. Terraform is a common choice, and even if you don’t love it, the workflow is sound: plan, review, apply. If you want the official reference, Terraform docs are clear and practical.
A small Terraform snippet that provisions an S3 bucket with versioning illustrates the point: it’s readable, reviewable, and consistent.
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.region
}
resource "aws_s3_bucket" "app_artifacts" {
bucket = "${var.project}-artifacts-${var.env}"
}
resource "aws_s3_bucket_versioning" "app_artifacts" {
bucket = aws_s3_bucket.app_artifacts.id
versioning_configuration {
status = "Enabled"
}
}
We keep IaC in the same repo as the service (or in a clearly related repo) so changes ship together. We enforce code review and run terraform plan in CI so nobody “applies” surprises. And we tag resources predictably—owner, environment, service—because cost and incident investigations always start with: “what is this thing and who pays for it?”
Containers and Kubernetes: Only as Fancy as Needed
Containers are great because they standardize runtime. They’re not magic; they just make “works on my machine” less common. We aim for small images, explicit dependencies, and immutable releases. If we can rebuild an image from scratch at any time, we’re in a good place.
Kubernetes is where devops humour goes to get its material. It’s powerful, but it’s also a complexity multiplier. Our rule of thumb: use Kubernetes when we need its scheduling, scaling, or multi-service orchestration benefits—not because it’s the default answer on the internet. If a managed container service or even a VM-based deploy meets the needs, we keep it simple.
When we do run Kubernetes, we make deployments safe by default: readiness probes, resource requests/limits, and rolling updates. Here’s a trimmed Deployment that includes the bits we actually miss when they’re not there:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: ghcr.io/acme/web:1.2.3
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
And yes, we document why these values exist. Otherwise, six months later, someone “temporarily” bumps limits and we all pretend we’ll revisit it.
Observability: Logs, Metrics, Traces, and the Truth
If we can’t see it, we can’t operate it. Observability isn’t about buying a platform and calling it done; it’s about choosing signals that help us answer: “Is it broken?” and “Why is it broken?” quickly.
We start with the basics:
– Metrics for health and performance (latency, error rate, saturation).
– Logs for context (what happened, to whom, and with what input).
– Traces for distributed systems (where time went across services).
We also standardize what “good” looks like: dashboards with SLO-relevant views, not a museum of charts. A service that’s “up” but returning errors is not up. And a dashboard that nobody checks is just wall art.
Alerting is where devops teams accidentally create their own misery. The best alerts are actionable and rare. “CPU is 70%” is usually not an incident; “p95 latency exceeded SLO for 10 minutes” might be. We use paging for user-impacting conditions and route everything else to tickets or daytime channels.
For a practical framework, Google’s SRE material still holds up. The Google SRE books are packed with approaches that work in the real world, especially around SLOs and alerting discipline. If we implement only one concept, it’s error budgets: they force honest tradeoffs between shipping features and paying down reliability debt.
Security in Devops: Shift Left Without Shifting Pain
Security in devops isn’t a separate lane; it’s the guardrails on the road. The goal isn’t to turn engineers into security specialists—it’s to make the secure path the easy path.
We start with hygiene:
– Least-privilege IAM roles and short-lived credentials.
– Secrets in a manager, not in environment files pasted into tickets.
– Dependency scanning and container image scanning in CI.
– Signed artifacts where it matters.
We also make threat reduction part of normal delivery. That means automated checks in pull requests, not surprise audits two weeks before a launch. If a security tool produces false positives that nobody can interpret, it becomes shelfware. We tune it until it helps.
A practical win: define a baseline policy for deployments (no public buckets, encryption enabled, required tags, etc.) and enforce it automatically. You can do this with policy-as-code tools, or even basic CI checks depending on your maturity. The key is consistency.
For teams needing a reasonable starting point, the OWASP Top Ten is a solid list of common application risks. It’s not devops-specific, but it keeps us grounded in what actually gets exploited instead of what looks scary on a conference slide.
Culture and Habits: The Unsexy Stuff That Pays Off
Devops falls apart when it’s treated as a team, not a practice. If “DevOps” is a department that does deployments for everyone else, we’ve reinvented the wall between dev and ops—just with fresher paint.
What works better is shared ownership:
– Teams build and run what they ship.
– Ops expertise is embedded as patterns, templates, and coaching.
– Post-incident learning is blameless and documented.
We run incident reviews that focus on contributing factors: unclear runbooks, missing dashboards, risky deploy patterns, brittle dependencies. The output isn’t shame; it’s backlog items. And we track those items like product work, because they are.
Runbooks deserve special mention. A good runbook answers: what’s the symptom, how do we confirm it, what are safe mitigations, and when do we escalate? If the only person who can fix an issue is “Dave,” Dave can’t take a holiday, and we can’t scale. (Also Dave will eventually stop answering.)
Finally, we keep our devops stack intentionally small. Every new tool adds cognitive load, integration points, and maintenance. We’d rather do a few things well than collect a zoo of half-adopted platforms. The best compliment we can get is: “Deploying here feels boring.” In operations, boring is premium.



