Gitops In Practice: Shipping Changes Without Drama

gitops

Gitops In Practice: Shipping Changes Without Drama

How we keep clusters predictable, auditable, and pleasantly boring.

Gitops, But With Adult Supervision

We’ve all seen it: the “quick fix” in a cluster that somehow becomes the permanent architecture. Someone kubectl edits a Deployment in production, things look fine, and then three days later a node rolls, the Pod reschedules, and—surprise—your fix vanishes like a snack left unattended in the office kitchen.

That’s the mess gitops is meant to prevent. In plain terms, we keep the desired state of our systems in Git, and we use an automated reconciler to make reality match what’s in the repo. Git becomes the source of truth, not tribal knowledge, not “that one terminal session,” and definitely not a sticky note.

The real win isn’t that gitops is trendy. It’s that it forces a discipline we already know we need: version control, reviews, rollbacks, and traceability. When everything goes through pull requests, we can answer “who changed what, when, and why” without launching an archaeological dig through Slack.

Gitops also gives us a consistent operating model across teams. Application deploys, platform tweaks, config changes—same workflow. The platform team doesn’t have to be the human API for every namespace tweak, and app teams don’t need to learn five different deployment rituals depending on which cluster they’re in.

If you’ve ever wanted production to be calmer, more repeatable, and less dependent on heroics, gitops is how we get there—by making the boring path the easiest path.

The Core Loop: Desired State Meets Reality

At the heart of gitops is a loop: define desired state in Git, and let an agent continuously reconcile the running environment toward that state. “Continuously” matters. CI can apply changes once, but reconciliation keeps applying pressure over time. If someone changes something manually (or a controller mutates an object), the reconciler notices drift and corrects it—or at least tells us loudly.

We usually break the loop into three parts:

1) Source: a Git repo with Kubernetes manifests, Helm charts, or Kustomize overlays.
2) Reconciler: a controller in the cluster (commonly Argo CD or Flux) that watches Git and applies changes.
3) Feedback: status, alerts, and dashboards so we know whether the cluster matches Git.

This loop works best when we keep boundaries clear. CI builds artifacts (container images, Helm chart packages), and Gitops deploys artifacts by changing references in Git (image tags, chart versions, values). That separation makes deployments reproducible: if we can rebuild an image, we can redeploy the exact version by pointing to the same digest.

It’s also where gitops pays off in incidents. When things are on fire, we don’t need to remember the correct incantation to roll back. We revert a commit (or pick a previous tag), and the reconciler does the rest. A PR revert is not just “nice process”—it’s a reliable emergency brake.

For deeper reading on the model, we often point newcomers to the OpenGitOps principles, which are refreshingly practical for a “principles” page.

Repo Design That Doesn’t Make Us Cry Later

Repo structure is where gitops lives or dies. If it’s confusing, people bypass it. If it’s too clever, it breaks at 2 a.m. We aim for “obvious” over “elegant.”

A pattern that works well is separating app source repos from environment config repos:

  • App repo: code, Dockerfile, Helm chart templates (optional), tests.
  • Env repo: cluster and environment definitions (namespaces, policies, releases, values).

This keeps change velocity sane. App teams can ship frequently without accidentally rewriting platform settings. Platform teams can evolve cluster baselines without digging through every microservice repo. If you’re small, you can combine them, but we still recommend keeping clear folders per environment.

A simple layout might look like:

  • clusters/prod/ (cluster-specific bootstrap and shared add-ons)
  • clusters/stage/
  • apps/<app-name>/base/
  • apps/<app-name>/overlays/prod/
  • apps/<app-name>/overlays/stage/

We also choose early whether we’re “Helm people” or “Kustomize people” for most services. Mixing both is possible, but it increases the cognitive load. It’s fine to support both, just don’t make every team reinvent the wheel.

And yes, naming matters. If “prod” is called “live-ish-final2” in one folder and “production” elsewhere, we will eventually deploy the wrong thing. Let’s keep it boring: dev, stage, prod.

For reference implementations and good defaults, the Flux team’s docs are a solid compass: Flux documentation.

Gitops With Argo CD: A Minimal, Working Example

Let’s make this concrete. Here’s a basic Argo CD Application that syncs a folder of manifests from a Git repo into a namespace. This is the kind of thing we can paste into a bootstrap repo and iterate from there.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: orders-api
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/acme/platform-gitops.git
    targetRevision: main
    path: apps/orders-api/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: orders
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

A couple of choices here are doing real work:

  • automated.prune: true means if we remove a manifest from Git, Argo removes it from the cluster. That’s how we avoid “ghost resources.”
  • selfHeal: true is drift correction. If someone edits a Deployment manually, it gets reset to the Git version.
  • CreateNamespace=true reduces bootstrap friction, but we still typically manage namespaces with policies and quotas.

Argo CD itself is well-documented and easy to demo, which helps adoption: Argo CD docs. We also like Argo’s UI for day-to-day visibility—especially when we’re onboarding teams who want to “see” what’s happening before they trust it.

One caution: automated sync is powerful. We usually start with auto-sync in lower environments first, then graduate to prod once the workflows and guardrails are proven.

Flux Example: Image Updates Without Human Tag-Poking

Flux shines when we want a tight loop between image registries and Git updates, without asking humans to manually bump tags in values files all day. The trick is: Flux can observe new images and then commit updates back to Git (with policies), so Git still remains the source of truth.

Here’s an example using Flux image automation. First, we define an ImageRepository and an ImagePolicy:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: orders-api
  namespace: flux-system
spec:
  image: ghcr.io/acme/orders-api
  interval: 1m
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: orders-api
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: orders-api
  policy:
    semver:
      range: ">=1.6.0 <2.0.0"

Then we connect that policy to an ImageUpdateAutomation that commits changes:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
  name: orders-api
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: platform-gitops
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        name: flux-bot
        email: flux-bot@acme.local
      messageTemplate: "chore(images): update orders-api to {{range .Updated.Images}}{{.NewTag}}{{end}}"
    push:
      branch: main
  update:
    path: ./apps/orders-api/overlays/prod
    strategy: Setters

This gives us a controlled conveyor belt: registry updates → policy selects versions → bot commits → reconciler applies. If we want approvals, we can push to a branch and require PR review. If we want speed in dev, we can auto-merge.

Flux’s image automation docs are worth bookmarking: Flux image automation.

Security And Compliance: Gitops Is Great, Not Magic

Gitops improves security posture, but it’s not a force field. We still need to design for blast radius, secrets handling, and supply chain integrity—otherwise we’ve just made our mistakes more reproducible.

First, access control: we try hard to ensure nobody needs direct admin rights in prod clusters for routine work. If the workflow is PR-based, we can use Git permissions, CODEOWNERS, and required reviews as enforcement. Cluster credentials should be tightly scoped to the reconciler, not sprinkled across laptops.

Second, secrets: storing raw secrets in Git is a “no.” Instead, we encrypt secrets in Git using tools like SOPS or use an external secrets manager. The key idea is that Git can still hold the desired state, but not in plaintext. For Kubernetes, common approaches include SOPS-encrypted Secret manifests or controllers that sync from Vault/Cloud secret stores.

Third, policy: we treat policy as code. Admission control with Gatekeeper or Kyverno helps prevent unsafe manifests (privileged containers, hostPath mounts, wild-west RBAC) from ever applying. Gitops doesn’t replace admission; it complements it.

Finally, supply chain: pin images by digest where practical, sign artifacts if you can, and make sure CI is producing what you think it is. If “latest” sneaks into prod, gitops will faithfully deliver that chaos on schedule.

Gitops gives us the paper trail and the workflow controls; we still have to choose good controls.

Day-2 Operations: Drift, Rollbacks, And The Human Factor

Once gitops is running, day-2 operations are mostly about keeping humans happy and outages boring. That means visibility, sensible alerts, and an agreed way to handle exceptions.

Drift is the obvious one. We usually set a hard rule: manual changes in prod are for emergencies only, and if we do them, we follow up with a Git commit to make the fix permanent. Otherwise, self-heal will “fix” the fix, and we’ll have an awkward conversation with ourselves.

Rollbacks become pleasantly mechanical. If a deployment goes sideways, we revert the Git commit (or reset the release version) and let the reconciler converge. The only real gotcha is data migrations: rolling back an app is easy; rolling back a schema is… character-building. We bake that into release practices (backward-compatible migrations, feature flags).

We also watch for “PR bottlenecks.” Gitops can slow teams down if every change needs three approvals from people who are asleep in a different timezone. Our compromise is risk-based review: prod changes get stricter controls than dev changes. Namespace-scoped app changes shouldn’t require a platform committee meeting.

And we document the escape hatch. If the reconciler is down or Git is unavailable, what’s our procedure? Who can apply a hotfix? How do we record it and reconcile later? The process should exist before the incident, not as interpretive dance during it.

Gitops works best when it’s not a religion—just the normal way we operate.

A Sensible Adoption Plan (So We Don’t Boil The Ocean)

Rolling out gitops is less about tooling and more about sequencing. We’ve learned to start small, prove value, then expand—because nothing kills momentum like a six-month “platform rebuild” with no visible payoff.

A practical plan:

1) Pick one cluster and one team. Preferably non-prod, but real enough to matter.
2) Bootstrap the reconciler (Argo CD or Flux) with a tiny repo and a few apps.
3) Standardise templates for namespaces, RBAC, ingress, and app release patterns.
4) Add guardrails: policy checks, required reviews, and secrets strategy.
5) Graduate to prod once the loop is reliable and the rollback story is rehearsed.

We also recommend setting a clear contract: what belongs in gitops repos vs. what stays in Terraform/CloudFormation. As a rough guide, we manage cloud infrastructure (VPCs, clusters, databases) with IaC, and manage in-cluster configuration (namespaces, Deployments, Services, controllers) with gitops. There’s overlap, but clarity prevents tool turf wars.

Finally, measure success with boring metrics: fewer manual prod changes, faster recovery via revert, fewer “works on my cluster” mysteries, and cleaner audits. If we can show those, teams adopt gitops because it helps them, not because we wrote a policy about it.

If we keep the scope tight and the workflow friendly, gitops becomes the default—and the cluster stops being a magical snow globe.

Share