Microservices Without Tears: A Practical DevOps Playbook

How we keep services small, shipping fast, and sleep mostly uninterrupted

Why We Build Microservices (And When We Don’t)

We don’t build microservices because it’s trendy; we build them when we need teams to move independently without stepping on each other’s toes. If our product is growing, our deployment cadence is climbing, and different parts of the system change at different rates, microservices can be a sensible choice. The win isn’t “distributed systems” (that’s a cost); the win is organisational: smaller codebases, clearer ownership, and the ability to deploy one thing without redeploying everything.

But here’s the bit we put in bold in our internal docs: microservices amplify both good and bad engineering habits. If we don’t have decent automation, clear interfaces, and a baseline approach to observability, we’ll just end up with a collection of confusing repos that all fail in different ways. If our biggest problem is that we can’t agree on a schema, splitting into 40 services won’t make us agree faster.

We also don’t force it. If we’ve got a small team, a simple domain, or we’re still figuring out the product shape, a modular monolith is usually the kinder option. We can still have boundaries, clean APIs, and separate modules—without paying the “everything is a network call” tax.

A good checkpoint question: “Will this boundary let two teams deploy independently most weeks?” If the answer is no, it might just be a new package, not a new service.

For pragmatic guidance, we often refer people to Martin Fowler’s framing of microservices and tradeoffs, because it’s refreshingly grounded: Microservices.

Right-Sizing Service Boundaries: Start With the Domain

If we had to summarise boundary design in one sentence: we split by business capability, not by database tables. Microservices work best when each service has a crisp purpose that maps to how humans talk about the business. “Orders”, “Payments”, “Shipping”, “Notifications” are concepts people recognise; “OrderLineItemServiceV2” is what happens when we start from the schema instead of the domain.

We like to begin with a basic domain exercise: list the core user journeys and identify the verbs and nouns that matter. Then we map ownership: which team is accountable for the outcomes, not just the code? That team should own the service lifecycle—build, run, on-call rotation, and backlog priorities. If nobody wants to own it, we’ve just discovered a future incident.

We also keep interfaces boring. When we split too early, we end up with chatty services that call each other 12 times just to render a page. That’s a sign we’ve cut along the wrong seam or we’ve missed an aggregate boundary. We’d rather have a service that does “too much” but is cohesive, than one that does “too little” and needs a committee meeting for every feature.

Our rule of thumb: optimise for change. Put the things that change together in the same service, and separate the things that change for different reasons. That’s the heart of good boundaries, whether you call it microservices, modules, or “please stop making 17 repos for one feature”.

If we need more structure, we borrow from Domain-Driven Design patterns (bounded contexts in particular). The terminology can be heavy, but the idea is simple: keep meanings consistent within a boundary. The DDD Reference is a solid bookmark when we’re debating semantics.

Contracts, APIs, and Versioning: Keep It Boring

In microservices, “integration” is where optimism goes to die. Our goal is to make service boundaries predictable, so teams can change things without summoning everyone to a meeting. That means we treat APIs as products: documented, versioned, and tested.

We default to REST for most synchronous calls because it’s widely understood and easy to troubleshoot. gRPC can be great when latency matters and teams are comfortable with protobuf tooling, but we don’t reach for it just to feel faster. For asynchronous events, we publish domain events that are meaningful (“OrderPlaced”, “PaymentFailed”) rather than low-level (“RowInsertedIntoOrdersTable”, which is just… a cry for help).

Versioning: we try hard not to do it often, and when we do, we do it explicitly. URL versioning (/v1/) is blunt but clear. Header-based versioning is fine too, but it’s easier to misconfigure. Whatever we choose, we write it down and enforce it.

We also insist on contract testing. Consumer-driven contract tests reduce the “works on my machine” factor across teams. We don’t need perfection; we need early warning. If a provider change breaks a consumer, we want to know in CI, not in production at 2 a.m.

And yes, we write OpenAPI specs. Not because we love YAML, but because it’s a shared source of truth. Tooling like Swagger UI makes it easy for humans, and generators make it easier for clients to stay consistent.

For a practical, widely used standard, we lean on OpenAPI. For contract testing, Pact is a common choice and fits well when teams move independently.

Deployments: One Service, One Pipeline (Here’s a Template)

Microservices shine when we can deploy each service independently and safely. That requires a pipeline that’s repeatable and boring. Our baseline pipeline stages look like this: lint/test, build artifact, scan, deploy to a test environment, run integration checks, then promote to production with guarded rollout.

We also like to standardise the shape of pipelines across repos. Teams can still make local decisions, but the basics should feel familiar. That’s how we reduce cognitive load (and the number of “why is this pipeline different?” messages).

Here’s a simple GitHub Actions pipeline we’ve used as a starting template for a containerised service:

name: ci
on:
  push:
    branches: [ "main" ]
  pull_request:

jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Run unit tests
        run: |
          make test

      - name: Build image
        run: |
          docker build -t ghcr.io/acme/orders:${{ github.sha }} .

      - name: Scan image (example)
        run: |
          echo "Run your scanner here (Trivy/Grype/etc.)"

      - name: Push image
        run: |
          echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
          docker push ghcr.io/acme/orders:${{ github.sha }}

  deploy-staging:
    needs: build-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          ./deploy/staging.sh ghcr.io/acme/orders:${{ github.sha }}

We keep the scripts (deploy/staging.sh) in-repo so teams can version deployment logic with code changes. If we’re using Kubernetes, that script usually wraps helm upgrade or kubectl apply.

The key isn’t the tool. It’s consistency: same stages, same promotion flow, and an obvious place to add tests and rollouts. If we can’t explain the pipeline on a whiteboard in five minutes, it’s probably too clever.

Kubernetes and Config: The Smallest Useful Manifest

Microservices and Kubernetes often arrive together, like a pair of friends who encourage each other’s worst habits. Kubernetes is powerful, but it’s also a buffet of knobs. We try to keep manifests minimal and add complexity only when we can explain the reason in plain language.

Our minimum viable setup: a Deployment, a Service, a health check, resource requests/limits, and config via ConfigMaps/Secrets. If we skip health checks, we’re basically asking Kubernetes to guess whether our service is healthy. Kubernetes is many things, but it isn’t psychic.

Here’s a trimmed Deployment/Service example we’d be happy to ship as a baseline:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orders
  template:
    metadata:
      labels:
        app: orders
    spec:
      containers:
        - name: orders
          image: ghcr.io/acme/orders:sha-REPLACE
          ports:
            - containerPort: 8080
          env:
            - name: LOG_LEVEL
              value: "info"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: orders
spec:
  selector:
    app: orders
  ports:
    - port: 80
      targetPort: 8080

We also standardise endpoints: /health for “process is alive”, /ready for “can accept traffic”. It makes dashboards and runbooks reusable.

And we keep config separate from images. Build once, deploy many. If an environment change requires a rebuild, we’ve muddled responsibilities.

For a reliable baseline on probes and workload patterns, the official Kubernetes docs are still the best source—especially when we’re tempted to copy-paste from a random gist at 11 p.m.

Observability: Logs, Metrics, Traces (Pick Three, Not One)

In microservices, debugging without observability is like playing hide-and-seek in the dark. We don’t need perfection, but we do need enough signals to answer: “What’s broken, where, and why now?”

We treat observability as part of the definition of done. Every service should produce structured logs (JSON), basic metrics (request rate, error rate, latency), and traces for cross-service calls. If we can’t follow a request through the system, we’ll end up correlating timestamps and guessing. Guessing is not a strategy; it’s a hobby.

A practical pattern we like:
– Correlation IDs: propagate X-Request-Id (or W3C traceparent) across service calls.
– Golden signals: latency, traffic, errors, saturation.
– Actionable alerts: page on symptoms that users feel (error rates, SLO burn), not on every CPU spike.
– One dashboard per service plus a system-level view.

For tracing, OpenTelemetry has become the sensible default. It’s not magic, but it reduces vendor lock-in and keeps instrumentation consistent across languages. We aim for sampling rules that keep costs sane while still capturing enough detail during incidents.

Useful references we point people to: OpenTelemetry for instrumentation guidance, and the Google SRE book sections on monitoring philosophy for sanity checks: Site Reliability Engineering.

If we have to choose where to start, we start with metrics and structured logs. Tracing comes next once cross-service debugging becomes a weekly sport.

Data and Consistency: Stop Sharing Databases

The fastest way to sabotage microservices is to let services share the same database schema and call it “integration”. That’s not a boundary; that’s a polite fiction. When two services share tables, any change becomes a coordination tax, and deployments stop being independent.

Our default stance: each service owns its data. Other services don’t query it directly. They call an API or consume an event. Yes, it feels slower at first. Then it saves us months later.

But what about cross-service reporting and joins? We handle that via:
– Events: publish domain events and build read models where needed.
– CQRS-lite: separate write models (authoritative service) from read models (optimised for queries).
– Data warehouse: for analytics, not operational workflows.

Consistency is the other hard bit. Distributed systems mean we often accept eventual consistency. We mitigate the pain with idempotency keys, retries with backoff, and clear user messaging (“Payment pending”) rather than pretending everything is instantly consistent.

We also use the outbox pattern when publishing events from a transactional DB—write changes and the event record in one transaction, then publish reliably. That avoids the “DB updated but message never sent” horror story.

If this sounds like extra work, it is. But it’s the work microservices require. The payoff is real independence: teams can evolve schemas and storage without breaking neighbours, and incidents stay contained.

A good read when we’re thinking about events and integration patterns is the classic set of enterprise integration patterns (still relevant, minus the SOAP nostalgia): Enterprise Integration Patterns.