Microservices Without Tears: DevOps Lessons We Learned

Practical ways to ship small services without big headaches.

Why We Pick Microservices (And What We Forgot)

We didn’t choose microservices because we love complexity. We chose them because our monolith was turning every “small change” into a cross-team group project with snacks and despair. The promise was simple: smaller services, clearer ownership, faster releases. And it worked—until we discovered we’d quietly signed up for a new set of problems: network hops, versioning, distributed debugging, and the joy of “it works on my laptop” now multiplied by 37.

The first lesson we learned is that microservices don’t fix messy boundaries; they expose them. If the domain is tangled, you just end up with a distributed monolith—same coupling, more latency. We started doing lightweight domain mapping workshops and forcing ourselves to write down service responsibilities in plain language. If we couldn’t describe a service in two sentences without saying “and also,” it wasn’t ready.

Second lesson: your organisation chart will show up in your architecture whether you want it to or not. We stopped pretending one central “platform” team could approve everything and instead focused on shared standards plus self-serve tooling.

Finally, we learned to measure success differently. “More services” isn’t a win. The win is: shorter lead time, safer deployments, fewer late-night pages, and predictable performance. If microservices don’t move those needles, we’re just collecting YAML like it’s a hobby.

Service Boundaries: Start Boring, Stay Clear

Our best microservices started boring. A single capability, a small API surface, and one team accountable for the whole thing—code, runtime, on-call. The worst ones started ambitious: “We’ll split the monolith into 20 services by Q2.” That’s how you end up with 20 services that all need the same database table and a meeting to change a column name.

We use a few rules of thumb. One: each service owns its data. Not “we share the database but promise to be careful.” Actual ownership. If another service needs data, it goes through an API or an event, not a sneaky SQL join. Two: optimise for change frequency. Things that change together should live together. Three: treat boundaries as product decisions, not just technical ones—because microservices are really about who can change what without negotiating with five other teams.

We also keep an eye on “chattiness.” A boundary that looks clean on a diagram can turn into ten synchronous calls in the hot path. When we see that, we either redesign the flow (aggregation, caching, denormalised read models) or admit the boundary isn’t right yet.

If you want a solid grounding for domain thinking, we’ve found Martin Fowler’s microservices primer still holds up. And for the “please don’t share databases” conversation, it’s a helpful neutral reference when emotions run high.

CI/CD for Microservices: Pipelines That Don’t Hate Us

Microservices live and die by delivery automation. If every service needs a bespoke pipeline, we’ve basically reinvented artisanal snowflakes—now with more outages. Our approach is “paved roads”: a standard pipeline template that covers 80–90% of services, with escape hatches when a team genuinely needs something different.

Key pieces we standardised:
– Build once, promote the same artifact across environments.
– Automated tests at the right layers (fast unit tests, a few contract tests, targeted integration tests).
– Security and dependency scanning baked in.
– A deployment strategy that supports quick rollback or safe roll-forward.

Here’s a trimmed GitHub Actions example we use as a baseline. Nothing fancy—just predictable:

name: service-ci

on:
  push:
    branches: [ "main" ]
  pull_request:

jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install
        run: npm ci

      - name: Unit tests
        run: npm test -- --ci

      - name: Build
        run: npm run build

      - name: Build image
        run: docker build -t ghcr.io/acme/orders:${{ github.sha }} .

      - name: Push image
        run: |
          echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
          docker push ghcr.io/acme/orders:${{ github.sha }}

We keep environment deployments separate (often triggered after review/approval), and we tag releases in a way humans can understand. The goal isn’t “maximum automation points.” The goal is fewer surprises at 2 a.m.

Kubernetes Config: Make It Repeatable, Not Clever

Once microservices multiply, Kubernetes can either bring order or become the world’s most expensive guessing game. We aim for “repeatable, boring manifests,” plus a small set of conventions: health checks always present, resource requests always set, and autoscaling only after we’ve measured something real.

A baseline deployment (trimmed) looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders
  labels:
    app: orders
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orders
  template:
    metadata:
      labels:
        app: orders
    spec:
      containers:
        - name: orders
          image: ghcr.io/acme/orders:__TAG__
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          envFrom:
            - secretRef:
                name: orders-secrets
---
apiVersion: v1
kind: Service
metadata:
  name: orders
spec:
  selector:
    app: orders
  ports:
    - port: 80
      targetPort: 8080

We template the __TAG__ replacement via Helm or Kustomize, but we don’t try to be clever. Clever manifests age badly.

We also push teams to keep runtime configuration in config maps/secrets, not baked into images. And we insist on probes because “Kubernetes will restart it” is not a strategy if Kubernetes doesn’t know it’s broken.

If you’re standardising clusters, Kubernetes docs are still the least-wrong source of truth, and it pays to point people there rather than relying on tribal knowledge.

Observability: Logs, Metrics, Traces, And Fewer Ghost Stories

Microservices turn debugging into detective work across multiple scenes. Without observability, you’ll end up with what we call “ghost stories”: incidents where everyone has a theory and nobody has evidence.

We treat observability as a product feature. Every service ships with:
– Structured logs (JSON, consistent fields like trace_id, service, env, request_id).
– Golden signals metrics (latency, traffic, errors, saturation).
– Distributed tracing for cross-service requests.

Traces are the big unlock—not because they’re glamorous, but because they turn “something’s slow” into “this call is waiting on that downstream dependency for 900ms.” We’ve had good results with OpenTelemetry, and we don’t let teams invent their own telemetry formats. Standardisation here saves massive time later.

Alerting is where we got burned early. Our first iteration alerted on everything that moved. That produced fatigue and taught folks to ignore alarms—an impressive achievement, technically. Now we focus alerts on user impact: error rate, latency SLO violations, and saturation that predicts imminent failure.

For practical SRE-style thinking, we often point teams to Google’s SRE resources, especially around SLOs and alerting philosophy: Google SRE Book. It’s not perfect for every org, but it’s a solid antidote to “alert on CPU > 70% because someone once said so.”

Resilience: Timeouts, Retries, And The Courage To Say “No”

Networks fail. Dependencies get slow. DNS decides to have a personal day. In microservices, resilience isn’t optional; it’s table stakes.

We bake in a few defaults:
– Timeouts everywhere. No timeout means “infinite sadness.”
– Retries only when safe, and never without jitter/backoff.
– Circuit breakers for flaky dependencies.
– Bulkheads (limit concurrency) to stop one slow downstream from taking the service hostage.

The subtle part is retries. If you retry a request that isn’t idempotent, you can double-charge a customer and then spend your afternoon “reconciling.” We mark idempotent operations clearly and use idempotency keys where needed.

We also embrace graceful degradation. When a non-critical dependency fails, we’d rather return a partial response than fail the whole request. That’s not always possible, but it’s a great default posture.

And we design for “shedding load” rather than collapsing. If the service is overloaded, returning a fast 429 with a clear message is kinder than queuing requests until everything times out.

A good conceptual model for failure modes is still AWS’s Well-Architected Framework. Even if you’re not on AWS, the reliability guidance is portable—and it gives you vocabulary for trade-offs that aren’t just “because DevOps said so.”

Governance That Scales: Standards, Not Gatekeeping

Microservices can’t survive on tribal rules shouted across Slack. But they also can’t survive with a central committee approving every deployment like it’s a bank loan. We’ve learned to separate standards from approvals.

Our “minimum standards” are a short checklist:
– Service owns its data and has a clear API.
– CI pipeline includes tests and scanning.
– Kubernetes manifests include probes and resources.
– Observability is in place (logs/metrics/traces).
– Runbook exists (how to deploy, rollback, common failures).
– On-call ownership is explicit.

If a team meets the standards, they ship. If they don’t, we help them meet the standards. The only time we block releases is when there’s an immediate risk to customers or the platform.

We keep standards in versioned docs, and we provide templates so teams don’t start from scratch. The platform team’s job becomes building and maintaining the paved roads: base container images, pipeline templates, helm charts, shared libraries for telemetry, and sensible defaults.

This is also where internal developer portals can help, but we’re careful not to turn it into a shiny project that nobody uses. It has to reduce friction on day one: bootstrap a service, get logs, deploy safely, and find ownership info quickly. If it can’t do that, it’s just another website to forget your password for.