Microservices Without Tears: Shipping Fast, Sleeping Better
Practical patterns we use to keep services small and on-call calmer.
Why We Choose microservices (And When We Don’t)
We don’t adopt microservices because it’s fashionable; we do it because sometimes the shape of the business demands it. When a product grows beyond one team’s ability to change it safely, splitting along clear boundaries can reduce collisions. Teams can deploy independently, scale only what’s hot, and iterate without waiting for a quarterly “big release” window (remember those?). In practice, microservices help when we’ve got multiple domains moving at different speeds—billing, search, notifications, and so on—and we want autonomy without a constant merge-fest.
But microservices aren’t free. They add network hops, operational overhead, and a long list of “where did that request go?” moments. If we can’t name the service boundaries in plain language, we’re probably not ready. If the data model is one tightly coupled knot, we’ll spend our days fighting distributed transactions. If the team is small, a well-structured modular monolith is often the kinder choice. A monolith with clear modules, good tests, and a clean deployment pipeline can outperform a poorly designed microservices setup—both in latency and in sanity.
A good rule of thumb we use: if we can’t support on-call for it, we can’t ship it. That means thinking about logging, metrics, and deployability before we carve the system into 27 “tiny” services that each need babysitting. We aim for the smallest number of services that still gives independent change. Not “micro” for micro’s sake—just “separate” where it counts.
Carving Boundaries: Make the Org Chart Work for Us
Microservices boundaries are less about technology and more about reducing arguments. We start by mapping business capabilities, not tables or class hierarchies. If “Orders” and “Payments” are different concerns with different failure modes, they’re good candidates to separate. If “Users” is a cross-cutting dependency for everything, we’re careful—turning it into a central service can accidentally create a bottleneck and a single point of failure. A service boundary should let a team ship changes without coordinating with three other teams every time.
We lean on Domain-Driven Design as a vocabulary tool, not a religion. “Bounded contexts” help us say: this service owns its data, its rules, and its release cadence. That ownership is key: shared databases are where microservices go to die. We’ve learned to avoid “just one shared schema” even if it feels faster today. Instead, each service owns its persistence, and other services integrate through APIs or events.
We also keep a close eye on coupling signals: if two services change together more than occasionally, they might belong together. If one service needs five synchronous calls to fulfill a request, we may be slicing too thin—or we’re missing an aggregator pattern. We’ve had success starting with a modular monolith, then extracting services once the seams are visible through real change. You don’t need to guess the perfect design on day one. We’d rather be a bit conservative and split later than over-split and spend six months re-stitching.
Service-to-Service Communication: Pick Fewer Patterns
In microservices, communication is where complexity hides. We try to keep a small toolbox: synchronous HTTP/gRPC for request/response and asynchronous messaging for events. The trick is being intentional about where each fits. If a user-facing request can’t complete without data, synchronous calls are fine—but we keep the chain short. If we’re propagating state changes (“order created”, “payment captured”), events are usually better and reduce tight coupling.
We standardise on a few conventions: timeouts everywhere, retries only when safe, and idempotency as a first-class feature. We’ve all seen the “retry storm” where a small slowdown becomes a meltdown. We also use circuit breakers and bulkheads so one flaky dependency doesn’t take out an entire fleet. If you want a great deep dive on resilience patterns, Martin Fowler’s catalogue is still one of the clearest references: Circuit Breaker.
For API design, we keep contracts explicit and versioned. Backward compatibility is not optional—especially when teams deploy independently. For events, we treat schemas like APIs: version them, document them, and evolve carefully. Tools like AsyncAPI help us publish event contracts in a way humans can read and CI can validate.
Finally, we avoid synchronous fan-out in the hot path. If a single request needs to call five services, we ask whether we should cache, denormalise, or materialise a read model. Latency is a budget. Microservices spend it quickly.
Kubernetes Basics We Actually Use (YAML Included)
We like Kubernetes because it gives us consistent deployment mechanics, not because we enjoy writing YAML (we don’t). Our baseline is simple: each service gets a Deployment, a Service, sensible resource requests/limits, health probes, and a disruption budget if it’s important. The goal is boring reliability.
Here’s a trimmed example we use as a starting point:
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders
labels:
app: orders
spec:
replicas: 3
selector:
matchLabels:
app: orders
template:
metadata:
labels:
app: orders
spec:
containers:
- name: orders
image: ghcr.io/acme/orders:1.12.3
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 20
periodSeconds: 20
env:
- name: OTEL_SERVICE_NAME
value: orders
---
apiVersion: v1
kind: Service
metadata:
name: orders
spec:
selector:
app: orders
ports:
- port: 80
targetPort: 8080
We don’t ship without probes. Readiness prevents traffic to a half-started service; liveness restarts wedged processes. We also set resource requests because the scheduler can’t read minds, and because noisy neighbours are only funny when they’re not ours.
For rollouts, we keep it gradual and observable. Kubernetes’ own docs are excellent when we need a refresher on probe behaviour and deployment strategies: Kubernetes Probes. The real win is consistency: once every service follows the same baseline, debugging becomes pattern recognition instead of archaeology.
CI/CD That Doesn’t Turn Into a Weekly Ritual
Microservices multiply deployments, so the pipeline has to be boringly reliable. If every service needs hand-holding, we’ll spend our lives “just checking the build” like it’s a pet hamster. We standardise pipelines with templates, keep stages consistent, and push quality checks left so we fail fast.
Our typical pipeline: lint → unit tests → build image → dependency scan → integration tests (where meaningful) → deploy to staging → automated smoke tests → progressive delivery to prod. We don’t make every service run an expensive full end-to-end suite; that’s how pipelines become 45-minute endurance events. Instead, we test contracts and key flows, and keep end-to-end tests focused on the most critical paths.
Here’s a minimal GitHub Actions workflow we’d accept as a baseline:
name: ci
on:
push:
branches: [ "main" ]
pull_request:
jobs:
build-test:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm test -- --ci
- run: docker build -t ghcr.io/acme/orders:${{ github.sha }} .
- run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
- run: docker push ghcr.io/acme/orders:${{ github.sha }}
From there, we promote with GitOps rather than “clickops”. A declarative approach (Argo CD, Flux, etc.) keeps drift visible and rollbacks practical. The point isn’t tooling purity; it’s that the system should tell us what’s running where, and why. Also: if the pipeline can’t be explained in five minutes, it’s too clever.
Observability: If We Can’t See It, We Can’t Own It
Microservices fail in creative ways. Observability is how we stop being surprised. We treat logs, metrics, and traces as part of the product, not an afterthought. Our baseline: structured logs with correlation IDs, RED/USE metrics (rate, errors, duration / utilisation, saturation, errors), and distributed tracing for cross-service requests.
We lean heavily on OpenTelemetry because it reduces vendor lock-in and gives teams a shared language. A trace that shows a request hopping through gateway → orders → inventory → payments is worth more than a hundred “works on my machine” conversations. For a solid overview, the official docs are a good starting point: OpenTelemetry.
We also define what “good” looks like with SLOs. Not aspirational “five nines” posters—real targets tied to user experience. If we don’t have an error budget, we end up debating reliability emotionally instead of mathematically. And when incidents happen (they will), we do blameless postmortems focused on learning. The best time to add a dashboard is before an outage, but the second-best time is immediately after one, while the pain is still motivating.
A practical trick: we publish a “golden dashboard” template for all services—latency percentiles, error rate, request volume, saturation, and dependency health. Standard dashboards make on-call rotations survivable because every service looks familiar. Microservices already have enough novelty; our dashboards don’t need to add to it.
Data in microservices: Consistency Is a Choice
Data is where microservices get real. The biggest mindset shift is accepting that strong consistency everywhere is expensive, and sometimes unnecessary. We design with business tolerance in mind: what must be correct immediately, and what can be eventually consistent? Inventory counts might be “close enough” for a few seconds; payment capture probably isn’t.
We avoid distributed transactions across services whenever we can. Two-phase commit sounds nice until it meets real networks. Instead, we use patterns like the saga (or process manager) to coordinate multi-step workflows with compensating actions. If an order is created but payment fails, we cancel the order and release inventory. It’s not glamorous, but it’s honest engineering.
We also treat events as durable facts. An “OrderCreated” event should mean that the Orders service has committed that change. Consumers then build their own read models—sometimes denormalised—optimised for their use case. This is where CQRS can help, but we keep it pragmatic: we separate reads and writes only where it actually reduces pain.
For storage, we don’t force one database to rule them all. Teams pick what fits: Postgres for relational integrity, Redis for caching, a document store for flexible schemas. But we do standardise operationally: backups, retention, encryption, access controls, and migration practices. Microservices don’t excuse messy data hygiene; they make it easier to create many small messes instead of one big one.
When we get this right, teams move faster without stepping on each other’s data. When we get it wrong, we get phantom bugs and “why is this field null?” mysteries at 2 a.m.



