Stubbornly Reliable gitops: Seven Moves That Cut MTTR 32%

gitops

Stubbornly Reliable gitops: Seven Moves That Cut MTTR 32%

Make git the control plane without blowing up production.

Move 1 — Draw the gitops Line in the Sand

Let’s start by agreeing on what we’re actually doing. In gitops, the desired state of our systems lives in Git, and a controller in the cluster (or clusters) continuously reconciles reality with that state. That’s it. We’re not “kind of” using gitops if a CI job runs kubectl apply on Tuesdays and we hope it works. We’re doing gitops when every change flows through a pull request, gets reviewed, and is pulled into the cluster by an agent. Drift isn’t a crime—it’s a signal. The controller notices, shows us what changed, and corrects it or, at the very least, makes the diff obvious.

We should also decide early on whether we’re using a pure pull model (recommended) or mixing in a push for bootstrapping. There’s nothing wrong with a one-time bootstrap script, but if our steady state relies on CI pushing manifests directly, we’ve lost the audit trail that makes gitops worth the calories. Controllers like Argo CD or Flux do the heavy lifting: fetch from Git, compare, apply, repeat. They don’t get bored or distracted. They don’t forget to run on Fridays.

The contract has a few non-negotiables: declarative config, a single source of truth, an immutable deployment history, and automation that reconciles continuously. If we want to argue any of those away, we’re no longer debating gitops—we’re inventing a different system. The CNCF GitOps Principles lay this out cleanly. Print them, highlight them, put them on the fridge. The fewer exceptions we carve out now, the fewer incidents we’ll file later.

Move 2 — Design Repos for Humans and Robots

Good repo design pays rent every time we ship. Our goal: make it obvious where to put a change, easy to code-review, and simple for the controller to reconcile. We’ve had luck separating “application source code” from “environment manifests” so that app teams move at their own velocity, while platform changes roll through controlled environments. Whether you go mono-repo or multi-repo depends on scale and ownership boundaries, but either way, keep the shape predictable.

A simple pattern is per-environment overlays with Kustomize. The base defines the app; each overlay tweaks replicas, endpoints, or limits. Robots (controllers) love this because the directory structure tells them what to sync, and humans love it because diffs are tight and readable.

Example layout:

envs/
  dev/
    kustomization.yaml
    patches/
      resources.yaml
  staging/
    kustomization.yaml
  prod/
    kustomization.yaml
apps/
  payments/
    base/
      deployment.yaml
      service.yaml
      kustomization.yaml

And a tiny kustomization:

# envs/staging/kustomization.yaml
resources:
  - ../../apps/payments/base
patches:
  - target:
      kind: Deployment
      name: payments
    patch: |-
      - op: replace
        path: /spec/replicas
        value: 3

With Argo CD, we point an Application at envs/staging and let it reconcile. Clean diffs, quick rollbacks, and no guessing where the prod knobs live. The Argo CD docs have good examples for single-app and app-of-apps setups—use them as a map, not a maze.

Move 3 — Make Promotion Boring and Visible

Promotion should be a PR, not a ritual. If we have to copy-paste manifests across environments or hunt for the “right” YAML, we’ll eventually get it wrong. Instead, we promote by changing a reference that everyone can see: a Git ref, a Helm chart version, or a container image tag that’s already been verified upstream. The change merges, the controller notices, and the deployment moves forward—same way, every time.

We can do this with Kustomize by pinning image tags in overlays. Some teams prefer Helm with values files per environment. Others adopt automated tag updates via controllers like Flux Image Automation or Argo CD Image Updater. The trick is limiting who can bump “prod.” You’ll never regret adding a CODEOWNERS rule that requires a second set of eyes on prod refs.

We also like to “promote forward,” not “rebuild for prod.” The artifact that ran in staging should be the artifact that runs in prod. This makes our diff tiny and our confidence large. If we need a freeze window, we represent it in Git—disable the sync or lock the branch. If we need to pause a rollout, we use the deployment controller, not a guessy pipeline toggle.

Finally, make promotion status visible. A commit that changes staging should link to the dashboard or the controller’s health checks. We’ve used commit templates that include links to the Application’s health page in Argo CD. It’s amazing how many “is it done yet?” messages disappear when everyone can self-serve the answer.

Move 4 — Treat Secrets Like Radioactive Material

Secrets don’t belong in Git, but references to encrypted secrets do. We want the auditability of gitops without the oops of plaintext. Tools like SOPS integrate with controllers so we can keep encrypted blobs in the repo and decrypt them only inside the cluster using KMS keys, cloud providers’ key services, or age keys. Flux’s SOPS integration is mature and friendly, and Argo users often wire SOPS into build-time or use external secret stores with controllers.

A minimal SOPS-encrypted secret might look like this:

apiVersion: v1
kind: Secret
metadata:
  name: payments-db
  namespace: payments
type: Opaque
data:
  url: ENC[AES256_GCM,data:...,iv:...,tag:...,type:str]
  password: ENC[AES256_GCM,data:...,iv:...,tag:...,type:str]
sops:
  kms:
    - arn: arn:aws:kms:us-east-1:123456789012:key/abcd-efgh
  encrypted_regex: '^(data|stringData)$'
  version: 3.7.3

Flux decrypts at reconcile time, and the secret never sits in plaintext in Git or CI logs. If we’re already using a secret store like AWS Secrets Manager, GCP Secret Manager, or Vault, a secret store controller can map a “Secret claim” to the real thing, and our manifests just include a reference. The key is consistency: pick one model, document it, and enforce it with policy. The Flux SOPS guide is a great place to start, and we’ve found it easier to standardize early than to retrofit later after someone checks in a base64’d password “temporarily.”

Move 5 — Guardrails With Policy-as-Code, Not Wishful Thinking

Git reviews catch typos; policies catch landmines. We want guardrails that run in CI and in-cluster, so the same rules block a bad change before it merges and prevent bad drift if it sneaks in. Tools like Kyverno and Gatekeeper let us write policies that match our operational sensibilities—restrict hostPath volumes, require resource limits, force TLS, or ban the mythical :latest tag.

Here’s a tiny Kyverno policy that bans :latest:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: no-latest-tags
spec:
  validationFailureAction: enforce
  rules:
    - name: disallow-latest
      match:
        resources:
          kinds:
            - Pod
            - Deployment
      validate:
        message: "Do not use :latest image tags."
        pattern:
          spec:
            containers:
              - image: "!*:latest"

Tie this into CI by running a policy test against the rendered manifests. That means Helm charts get rendered into plain YAML, Kustomize overlays get built, and the result is scanned. In-cluster, the admission controller enforces the same thing. Zero policy drift, maximum peace of mind. Start with a handful of high-signal rules and ratchet up over time. If we go from zero to a thousand rules overnight, we’ll generate more tickets than safety. The Kyverno docs include a library of common policies we can adapt. Make them ours, version them alongside the manifests, and resist the temptation to add exceptions “just this once.”

Move 6 — Progressive Delivery and Fast, Clean Rollbacks

If gitops is the how, progressive delivery is the how-much-how-fast. We can push a new version to 5%, watch metrics, then ramp to 25%, and so on. If things wobble, we stop and roll back. This isn’t fancy; it’s cautious. Argo Rollouts and Flagger both make this practical without wrapping our deployments in tangle.

With canaries, we define steps, analysis windows, and success thresholds. Metrics should be close to user impact: HTTP success rates, p95 latency, error budgets, even business events if we’ve instrumented them. If the canary fails, the controller aborts the rollout and leaves us with a clean paper trail in Git and the rollout status. No midnight kubectl flailing.

Rollbacks are just reverts in Git. We revert the commit that bumped the tag or the version reference, the controller reconciles back, and we’re done. No side-channel manual tweaks, no emergency post-its. Because the system is declarative and reconciled continuously, the rollback is safe, fast, and repeatable. If we need a faster lever in a true emergency, we can gate rollouts behind a feature flag for instant traffic cuts while the deployment rolls back behind the scenes. The Argo Rollouts documentation covers blue-green, canary, and analysis templates—worth a careful read before we name our canaries after our favorite pets.

Move 7 — Close the Loop: Telemetry, Audits, and Drills

We don’t get more reliable by hoping. We get more reliable by measuring, auditing, and practicing. Git gives us an audit trail of “who changed what, when,” but we also need to see “what happened in the cluster” and “how users felt it.” That means logs, metrics, traces, and alerts that tie back to deployments. Tag our telemetry with the git commit SHA and deployment name so dashboards and runbooks tell the same story the repo does. When someone says “what changed?”, our graphs should literally annotate the answer.

Alerting should point to both the controller (sync health) and the service (SLOs). If an Application goes out of sync, we want a human to know. If error rates spike during a canary step, we want automation to stop the rollout. After each incident, we add one concrete improvement to our gitops flow—maybe a new policy, a missing alert, or a cleaner rollback checklist. Don’t try to fix everything; fix something.

Finally, drill. Pick a low-risk service, plan a game day, and rehearse: cut a PR, watch the controller, observe logs, simulate a failed canary, revert, verify. Time it. We’ve seen teams shave MTTR by a third just by practicing the motions and removing silly friction (like missing permissions or a mystery dashboard nobody could find). Document the happy path and the unhappy one. Celebrate the boring deploys. Boring is the point.

For teams getting started, the combo of the CNCF GitOps Principles, the Argo CD docs, the Flux SOPS guide, and the Kyverno docs is plenty. Take the first move, make it stick, and grow from there. The easy days build confidence. The hard days prove the system.

Share