Microservices Without the Headaches: Practical Lessons We’ve Learned

microservices

Microservices Without the Headaches: Practical Lessons We’ve Learned

What actually worked for us, and what we now avoid on purpose

Start With Boundaries, Not Dockerfiles

When we say “microservices,” it’s tempting to start with tooling: containers, CI, service meshes, shiny dashboards. We’ve done that. It’s like buying a power drill before you know whether you’re hanging a picture or building a deck. The first real decision isn’t how we run services—it’s where the boundaries go.

We’ve had the best results when we treat boundaries as a product decision, not an infrastructure decision. Ask: what needs to change independently? What teams need to move at different speeds? What parts of the domain are constantly in flux, and which are boring (boring is good)? If the answer is “everything is connected to everything,” that’s not a sign you need more microservices—it’s a sign you need better domain boundaries.

A simple rule we like: if two components share the same release cadence, incident blast radius, and on-call ownership, they might not be separate services. Conversely, if a change in checkout keeps breaking search, those probably shouldn’t be welded together.

If you want a structured way to think about boundaries, Domain-Driven Design is still the least-worst map we’ve found. We don’t need to go full ceremony. Just using “bounded contexts” as a way to argue politely in meetings can save months of refactoring later.

Also: start with fewer services than you think. It’s easier to split a stable service than merge five services with inconsistent APIs, duplicated logic, and seven different ideas of what “customerId” means.

APIs: Keep Them Boring and Explicit

In microservices, our most expensive bugs have often been “agreement bugs”—two services both worked fine, just not with each other. The cure isn’t more meetings; it’s clearer contracts.

We’ve learned to prefer simple, explicit APIs over clever ones. Version intentionally. Document error cases. Agree on idempotency for write endpoints. And don’t let internal models leak into public contracts unless you enjoy being permanently backward compatible with your worst ideas.

A pragmatic move: standardize response envelopes and error shapes across services. Not because it’s elegant, but because it makes troubleshooting faster and client code saner. If you’re doing HTTP, use consistent status codes and include a stable error code that isn’t a sentence.

Here’s an example OpenAPI snippet we’d actually ship—small, explicit, and focused on behavior:

openapi: 3.0.3
info:
  title: Orders Service API
  version: 1.2.0
paths:
  /v1/orders:
    post:
      summary: Create an order (idempotent)
      parameters:
        - in: header
          name: Idempotency-Key
          required: true
          schema: { type: string, maxLength: 64 }
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [customerId, items]
              properties:
                customerId: { type: string }
                items:
                  type: array
                  items:
                    type: object
                    required: [sku, quantity]
                    properties:
                      sku: { type: string }
                      quantity: { type: integer, minimum: 1 }
      responses:
        "201":
          description: Created
        "409":
          description: Idempotency conflict (same key, different payload)
        "422":
          description: Validation error

For async messaging, we keep payloads versioned and additive. And we avoid “event soup”—a topic where every team dumps whatever happened, with no schema discipline. If you’re heading that way, AsyncAPI is a good nudge toward sanity.

Data: One Service, One Database (Mostly)

If there’s a microservices commandment that we’ve broken and regretted, it’s shared databases. The pain doesn’t show up immediately. It shows up later, when you need to change a column, and five services are “just reading it” (until they aren’t). Or when a hot query in one service takes down everyone else. Or when you realize nobody actually owns the data model.

So yes: “one service, one database” is a solid default. Not because it’s pure, but because it forces ownership and reduces surprise coupling.

But we also live in the real world. Sometimes we inherit a monolith database. Sometimes the data platform team says “no.” Sometimes there’s a reporting workload that needs cross-service views. When that happens, we try not to pretend it’s fine. We name the compromise, put guardrails on it, and plan an exit.

A pattern we’ve used: each service owns its operational store, and we replicate a curated set of events into an analytics store for cross-cutting queries. That way product folks can still get their dashboards, and we don’t end up joining across production databases at 2 a.m.

When we need consistency across services, we avoid distributed transactions like we avoid “quick” production changes on Fridays. Instead we lean on sagas, retries, and compensating actions. For a good overview of tradeoffs, Martin Fowler’s microservices resource guide is still worth bookmarking.

Deployments: Standardize the Path, Not the Stack

The hidden tax of microservices is operational variation. Ten services with ten different deployment patterns is how we end up with ten different ways to fail. We don’t need every team using the same programming language, but we do need a consistent path to build, test, ship, and roll back.

Our rule: standardize the delivery pipeline and runtime expectations, then let teams choose implementation details within those boundaries.

A baseline Kubernetes deployment template is one of the simplest ways to reduce chaos. Not a giant “platform” no one understands—just a sensible default. Here’s a trimmed example of what we like to see:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels: { app: orders }
  template:
    metadata:
      labels: { app: orders }
    spec:
      containers:
        - name: orders
          image: ghcr.io/acme/orders:1.2.0
          ports: [{ containerPort: 8080 }]
          readinessProbe:
            httpGet: { path: /health/ready, port: 8080 }
            periodSeconds: 5
          livenessProbe:
            httpGet: { path: /health/live, port: 8080 }
            periodSeconds: 10
          resources:
            requests: { cpu: "100m", memory: "256Mi" }
            limits: { cpu: "500m", memory: "512Mi" }
          env:
            - name: LOG_LEVEL
              value: "info"

We care less about the YAML and more about the discipline it encodes: health checks, resource boundaries, and safe rollout settings. If you’re using Kubernetes, the official docs on Probes are unglamorous but crucial.

Also: bake rollbacks into the process. If rollback requires a bespoke manual ritual, it won’t happen when it’s needed most.

Observability: If You Can’t Trace It, You Don’t Own It

Microservices don’t fail in isolation. They fail as a chain reaction with innocent bystanders. The only reliable way we’ve found to debug distributed systems is to invest in observability before we need it.

We aim for three basics everywhere:

  1. Structured logs with consistent fields (service, version, requestId, userId when safe).
  2. Metrics that reflect user impact (latency, error rate, saturation).
  3. Traces that follow a request across services.

If you can’t answer “what changed?” and “where did time go?” in under five minutes, incidents get expensive quickly.

We’ve standardized on OpenTelemetry because it’s one of the few things in this space that’s both broadly supported and not tied to a single vendor. The OpenTelemetry project is also a good place to start when you’re picking instrumentation patterns.

One practical tip: don’t drown in cardinality. Tagging metrics with userId might feel helpful… right up until your metrics backend starts smoking. We tag with things like endpoint, status code, and dependency name. For logs, we keep high-cardinality fields but rely on sampling and good search.

And yes, we still occasionally tail logs like it’s 2012. We just do it with better context now and fewer tears.

Resilience: Assume Everything Breaks (Including Us)

If we had to summarize microservices reliability in one line: networks are unreliable, dependencies are moody, and timeouts are your friend. The good news is we can design for this. The bad news is we have to actually do it.

We standardize a few behaviors across services:

  • Timeouts everywhere, especially on outbound calls.
  • Retries with jitter, and only when the operation is safe to retry.
  • Circuit breakers or at least “fail fast” behavior when a dependency is unhealthy.
  • Graceful degradation: return partial results, cached results, or a useful error.

The biggest improvement we made was aligning on sane defaults: a 2-second timeout beats an infinite timeout every day of the week. Infinite timeouts don’t mean “it will succeed eventually.” They mean “it will pile up until your service keels over.”

We also watch for retry storms. If Service A retries Service B, and Service B retries Service C, congratulations: you’ve invented a distributed denial-of-service attack against yourself. We keep retry budgets and we alert on rising retry rates as an early sign of trouble.

Finally, we test failure on purpose. Not constant chaos theatre—just targeted experiments: kill a pod, break DNS, slow down a dependency. If the first time we see those failures is in production during peak traffic, that’s on us.

Team Ownership: Microservices Are a Social System

Microservices aren’t just architecture—they’re org design with error messages. The whole point is to let teams deliver independently. If we keep a centralized gatekeeper for every decision, we’ve reinvented the monolith with extra steps.

We’ve found a few team practices make a disproportionate difference:

  • Clear ownership: every service has a team, an on-call rotation, and a backlog.
  • You build it, you run it (with support): teams own operational outcomes, but they’re not abandoned.
  • Golden paths: templates, libraries, and paved roads that are easy to adopt and easy to escape.
  • Service maturity expectations: health endpoints, SLOs, dashboards, runbooks—no exceptions.

On-call is where fantasy meets reality. If teams aren’t empowered to fix their own services quickly, on-call becomes a game of telephone with higher blood pressure. We try to keep runbooks short, actionable, and linked from alerts. If an alert doesn’t tell us what to do next, it’s not an alert—it’s a complaint.

We also keep our service count honest. A service that exists “because microservices” but has no clear reason to be separate will cost us forever. Fewer, well-owned services beat a zoo of tiny services that nobody understands.

Share