Microservices Done Right Without Losing Our Minds
Practical DevOps patterns for speed, safety, and fewer 3 a.m. surprises.
Why We Chose Microservices (And What We Regretted)
We didn’t adopt microservices because a conference slide told us to. We did it because our product kept growing, our release cycles kept slowing, and our “one repo, one deploy” setup turned every change into a group project. Microservices gave us clearer ownership, smaller blast radiuses, and the ability to ship improvements without waiting for a quarterly “big bang” release.
But let’s be honest: microservices also multiplied our problems. One service became twenty. One deployment became a choreography. One log file became a scavenger hunt. And the first time a request hopped across five services and failed due to a typo in a header, we understood a timeless DevOps truth: distributed systems are just systems… distributed enough to hide the issue from you.
The biggest regret we see in teams is jumping to microservices before they’re ready operationally. If we can’t reliably deploy, observe, and roll back one application, we won’t magically do it for fifty. Microservices are an organisational choice as much as a technical one—teams need ownership boundaries, clear interfaces, and the ability to run services without shouting across the office (or Slack) all day.
So our goal isn’t “microservices everywhere.” It’s “microservices where they pay rent.” We keep the bar high: measurable benefits, clear seams, and a plan for the boring parts—monitoring, security, CI/CD, and incident response. The fun architecture diagrams don’t page us at night. The boring parts do.
Service Boundaries: Start With the Team, Not the Code
We’ve learned to treat service boundaries like fences on farmland: put them in the wrong spot and everyone spends the season arguing about who owns the stray sheep. The cleanest microservices boundaries usually align with business capabilities and team ownership. When a team can build, deploy, and operate a service independently, we get the real win: fewer coordination bottlenecks.
A practical approach is to start with a modular monolith (or a “well-separated monolith”) and extract only the modules that have clear reasons to stand alone. Good candidates tend to have independent scaling needs, different release cadence, or distinct data ownership. Bad candidates are “utility” chunks that every other service depends on. Those become shared libraries, or worse—shared runtime services that drag latency and downtime through the whole estate.
We also stop ourselves from carving services too small. If a service can’t justify its own on-call rotation (even if shared), or if it needs constant lockstep changes with two others, it’s probably not a service—it’s a function with a travel budget.
A lightweight heuristic we use:
– One service = one business capability.
– One service = one owning team.
– One service = one data store (or at least one clear data owner).
– APIs are contracts, not suggestions.
If you want a deeper framing, we’ve found Domain-Driven Design concepts like bounded contexts genuinely useful—less for theory, more for giving us shared language when we argue (politely) about where things belong.
APIs And Contracts: Make Change Boring Again
Microservices live and die by their interfaces. If our API contracts are vague, every deployment becomes a suspense novel: “Will this break production? Let’s find out together.” We aim to make change boring by being explicit about versioning, backward compatibility, and consumer expectations.
First, we document APIs as if other teams are paying customers—because functionally, they are. We use OpenAPI specs for synchronous APIs, and schemas (JSON Schema/Avro/Protobuf) for events. Second, we avoid “clever” RPC patterns that blur boundaries. We keep it simple: HTTP for request/response, events for async integration, and clear error semantics.
Here’s a minimal OpenAPI snippet that shows what “explicit” looks like—especially around errors and deprecation. It’s not glamorous, but it saves hours of guessing:
openapi: 3.0.3
info:
title: Orders Service API
version: 1.4.0
paths:
/v1/orders/{orderId}:
get:
summary: Get an order by ID
parameters:
- name: orderId
in: path
required: true
schema: { type: string }
responses:
"200":
description: Order found
content:
application/json:
schema:
$ref: "#/components/schemas/Order"
"404":
description: Order not found
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
components:
schemas:
Order:
type: object
required: [id, status, total]
properties:
id: { type: string }
status: { type: string, enum: [PENDING, PAID, SHIPPED, CANCELED] }
total: { type: number, format: float }
Error:
type: object
required: [code, message, traceId]
properties:
code: { type: string }
message: { type: string }
traceId: { type: string }
For compatibility, we prefer additive changes, keep fields optional when evolving, and publish deprecation windows. And yes, we actually enforce it in CI with contract checks—because “please don’t break this” isn’t a control, it’s a wish.
CI/CD For Microservices: Standardise The Pipeline, Not The Code
Microservices can turn CI/CD into a zoo if every team invents their own pipeline. Our rule: we standardise the pipeline shape and safety rails, while letting teams pick the language/runtime that fits their service. This avoids the “fifty snowflakes” problem and makes it possible for any engineer to understand how a service ships.
We keep a golden pipeline template: build → test → scan → package → deploy to staging → integration checks → progressive delivery to prod. A service can add steps, but it can’t skip the guardrails. We also make artifacts immutable and traceable—every deployment points to a specific image digest and a specific Git commit.
Here’s an example GitHub Actions workflow we’ve used as a baseline. It’s intentionally boring: the same boring that keeps us employed.
name: service-ci
on:
push:
branches: [ "main" ]
pull_request:
jobs:
build-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm test
- name: Build image
run: docker build -t ghcr.io/acme/orders:${{ github.sha }} .
- name: Trivy scan
uses: aquasecurity/trivy-action@0.24.0
with:
image-ref: ghcr.io/acme/orders:${{ github.sha }}
severity: "CRITICAL,HIGH"
exit-code: "1"
- name: Push image
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
docker push ghcr.io/acme/orders:${{ github.sha }}
deploy-staging:
needs: build-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy via Helm
run: |
helm upgrade --install orders charts/orders \
--namespace staging \
--set image.tag=${{ github.sha }}
We also recommend borrowing sensible defaults from Google’s SRE book—particularly around release engineering and reducing toil. Not because it’s trendy, but because it works.
Observability: Logs, Metrics, Traces, And One Truth
With microservices, observability isn’t optional—it’s the difference between “minor incident” and “all-hands war room.” Our goal is simple: when a user reports an issue, we should be able to answer “what happened?” in minutes, not hours.
We standardise three signals:
– Metrics for trends and alerts (latency, error rates, saturation).
– Logs for context (structured, not poetic).
– Traces for request journeys across services.
The biggest accelerant is correlation. Every request gets a trace ID, and every service logs it. We propagate context via headers (e.g., traceparent) and ensure our logging format includes the ID. If you’ve ever grepped logs for “maybe this is related,” you know why.
We also define “golden signals” per service: request rate, error rate, latency, and resource saturation. That’s the bedrock for dashboards and alerts. Alerts should be actionable: “Checkout latency p95 > 800ms for 10m” beats “CPU is high” every day of the week.
We’re fans of open standards to avoid tool lock-in. OpenTelemetry is worth adopting early; it keeps instrumentation consistent even if you swap backends later. Whether you use Prometheus, Grafana, ELK, or a managed suite, consistency beats perfection.
Finally, we run game days. Microservices don’t fail politely. Practising failure in daylight makes night-time incidents less dramatic. Also, it’s the only kind of drama we want in engineering.
Data In Microservices: Stop Sharing Databases
If there’s one microservices rule we enforce with the energy of a tired parent in a supermarket aisle, it’s this: don’t share databases between services. Shared databases create hidden coupling, coordinated deployments, and “why did Service B break when Service A changed a column” mysteries.
We aim for data ownership: one service owns a dataset and exposes it via an API or events. When another service needs that data, it either queries via the owning service or keeps a local copy updated through events (event-driven replication). Yes, it introduces eventual consistency. No, it’s not the end of the world—if we design for it.
We choose integration patterns based on business needs:
– Synchronous calls for reads that must be up-to-date (with caching and timeouts).
– Events for workflows and denormalised read models.
– Sagas for distributed transactions (and explicit compensations).
We keep transactions local and make cross-service workflows explicit. If we need to reserve inventory, charge a card, and create a shipment, we don’t wrap it in a fantasy “global transaction.” We publish events and react, with compensating actions when something fails.
A good reference when designing reliability patterns is the Circuit Breaker pattern. Add timeouts, retries with backoff, and idempotency keys, and suddenly microservices feel less like a juggling act and more like an engineered system.
Security And Platform Guardrails: Make The Safe Path The Easy Path
Microservices increase the attack surface, full stop. More services means more endpoints, more identities, more secrets, and more opportunities to accidentally expose something we didn’t mean to. The good news: if we treat security as platform plumbing rather than a checklist, we can make it manageable.
We start with identity. Services should authenticate and authorise using short-lived credentials. We avoid long-lived static secrets when possible, and we rotate aggressively when we can’t. For Kubernetes, we lean on service accounts, workload identity, and network policies to limit who can talk to whom.
We also bake in:
– Dependency and image scanning in CI (fail builds for high severity).
– Runtime policies (least privilege, read-only file systems where possible).
– TLS everywhere (internal traffic too, not just the edge).
– Centralised secrets management.
Most importantly, we don’t ask every team to become experts in everything. We provide paved roads: templates, libraries, and defaults. Teams can deviate, but they have to do it knowingly. That’s not control-freak management; it’s how we reduce surprise.
And we keep a sense of humour: security is serious, but our job is to make “doing the right thing” less painful than “doing the fast thing.” If secure-by-default feels like friction, people will route around it—like water, or like engineers on a deadline.



