Kubernetes Habits That Keep Clusters Calm

kubernetes

Kubernetes Habits That Keep Clusters Calm

Practical routines that stop small issues becoming 3 a.m. dramas.

Start With A Boring, Repeatable Cluster Baseline

If we want kubernetes to behave, we’ve got to stop treating clusters like artisanal snowflakes. The goal is simple: every cluster starts from the same baseline, and any change is deliberate, reviewed, and reversible. “Boring” is a compliment here. We standardise core add-ons (CNI, CoreDNS, ingress, metrics, logging) and we keep versions aligned with a clear upgrade window. If one environment runs “whatever worked last summer,” we’re basically doing archaeology when incidents hit.

A baseline also means naming conventions, labels, and namespaces that make sense. We want to be able to answer: “What is this workload? Who owns it? How critical is it?” without opening a treasure chest of Slack threads. Use labels for owner, environment, tier, and cost centre. Use namespaces for isolation and RBAC boundaries, not as a replacement for planning.

We also document a small set of supported patterns: how apps expose HTTP, how they publish metrics, where they write logs, how secrets are handled, and how they do health checks. Most operational pain comes from one-offs.

Finally, we keep a “break glass” plan: who can access what, how credentials are rotated, and how we recover if our GitOps toolchain is down. The best time to design a fallback is not during an outage. For reference best practices, Kubernetes upstream docs are refreshingly practical: Kubernetes Docs.

Make Scheduling Predictable With Requests, Limits, And PDBs

Kubernetes is a scheduler first and a miracle worker never. If we don’t tell it what workloads need, it guesses. Guessing leads to noisy neighbours, eviction parties, and nodes that look fine until they don’t. Our habit: every production workload ships with sensible resources and a PodDisruptionBudget (PDB) when it matters.

Requests tell the scheduler what to reserve; limits prevent one process from eating the node when it has a bad day. We start with measured values (even rough ones), then tune based on actual usage. If we can’t measure, we start conservative and iterate. We also keep an eye on QoS classes: too many BestEffort pods is basically inviting chaos to move in.

PDBs are the other half of the bargain. If we do node upgrades or autoscaler consolidations, we need to know how many pods can be disrupted without breaking service. That’s not “high availability,” that’s “not accidentally taking ourselves down on a Tuesday.”

Here’s a minimal deployment pattern we consider non-negotiable:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: prod
  labels:
    app: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
      - name: api
        image: example/api:1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: prod
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

For disruption mechanics and how they interact with nodes and drain operations, the upstream guidance is solid: Pod Disruption Budgets.

Treat Health Checks As Contract Tests, Not Decorations

We’ve all seen them: liveness probes that always return 200, readiness probes that do nothing, and startup probes missing entirely. Then we act surprised when rolling updates stall or traffic goes to pods that aren’t ready. Our rule: probes are a contract with the platform, not a box to tick.

Readiness means “can accept traffic now.” It should fail if the app can’t serve (e.g., migrations running, dependency unavailable, cache warming). Liveness means “is the process wedged.” It should be stricter and less frequent, and it shouldn’t restart pods for temporary downstream blips. Startup probes are for slow boots; they stop liveness from killing the app before it’s had breakfast.

We also align probes with timeouts and retries that match reality. If an endpoint needs 2 seconds at peak, a 1-second timeout turns your cluster into a self-inflicted DDoS. And when probes fail, we want evidence: logs, metrics, and a clear error message.

A good habit is to version your health endpoints and keep them cheap. Don’t run a full dependency check on every probe if that adds load; do targeted checks that reflect actual readiness.

Example probe setup:

containers:
- name: api
  image: example/api:1.2.3
  ports:
  - containerPort: 8080
  readinessProbe:
    httpGet:
      path: /health/ready
      port: 8080
    periodSeconds: 5
    timeoutSeconds: 2
    failureThreshold: 3
  livenessProbe:
    httpGet:
      path: /health/live
      port: 8080
    periodSeconds: 20
    timeoutSeconds: 2
    failureThreshold: 3
  startupProbe:
    httpGet:
      path: /health/startup
      port: 8080
    periodSeconds: 5
    failureThreshold: 30

If we do this consistently, rollouts become boring. And boring rollouts are our favourite kind.

Observability: Pick A Few Signals And Actually Use Them

Kubernetes gives us a lot of data, which is a polite way of saying it can drown us. The trick is to decide what we’ll routinely look at, and what we’ll alert on, and then practice using it. We focus on a small set of signals: node pressure, pod restarts, saturation (CPU/memory), request error rate, latency, and queue/backlog where relevant.

For cluster-level metrics, we want the basics: kube-state-metrics, node exporter (or equivalent), and a metrics pipeline we trust. If we’re on managed kubernetes, we still validate what the provider gives us and what it doesn’t. Managed doesn’t mean magical; it means someone else gets paged for some of the problems.

Logs should be structured where possible. If apps spit out random strings, searching becomes guesswork. We don’t need perfection, but we do need consistent fields: timestamp, level, request ID, user/tenant, and error codes. Traces are optional until they’re not—if we run microservices, distributed tracing quickly pays for itself in time saved during incidents.

One habit that works: every alert must link to a runbook, and every runbook must start with “how to confirm this is real.” Otherwise, we build an alerting system that trains us to ignore it. If we need a reference for keeping alerting sane, Google’s SRE material is still one of the clearest takes: Google SRE Book.

And yes, we do a monthly “dashboard tour.” Not because it’s fun—because it stops surprises.

GitOps And Change Control: Fewer Clicks, Fewer Mysteries

If we change kubernetes by clicking around in a web console, we’ve created a mystery novel. The problem isn’t that clicks are evil; it’s that they’re hard to audit, hard to reproduce, and easy to forget. Our habit is to push everything through Git: manifests, Helm values, Kustomize overlays, policy, and even cluster add-ons.

GitOps tools like Argo CD or Flux make this practical. They continuously reconcile the desired state, which means drift becomes visible and reversible. More importantly, we can review changes like adults: pull requests, diffs, approvals, and a paper trail.

We keep environments separated (dev/stage/prod) with clear promotion rules. We also keep secrets out of plain Git. That might mean sealed secrets, external secret operators, or a secret manager. The exact tool matters less than the discipline: secrets are rotated, scoped, and audited.

Another habit: every change includes a rollback plan. That can be “revert the commit” or “roll back Helm release,” but it must be explicit. Rollbacks shouldn’t require heroics.

Finally, we minimise the number of ways to deploy. If half the company uses Helm, a quarter uses raw YAML, and the rest uses “that one script,” we’ll spend more time debating tooling than shipping. Pick a path, document it, and move on.

Security And Policy: Guardrails That Don’t Ruin Everyone’s Day

Security in kubernetes is often presented as a choice between “wide open” and “no one can deploy anything.” We aim for guardrails that prevent the worst mistakes without turning developers into petitioners.

We start with the basics: RBAC that matches responsibilities, separate namespaces for isolation, and least privilege for service accounts. Default service accounts shouldn’t have permissions to do anything interesting. Network policies are next: if everything can talk to everything, we’ve built a flat network with extra steps.

Admission policies are where we get real leverage. Using something like OPA Gatekeeper or Kyverno, we can enforce rules like: no privileged pods, no hostPath mounts, required resource requests, allowed registries, and mandatory labels. The key is to start with “audit” mode, show teams what would fail, then enforce gradually. Otherwise, we’ll learn about our own policies via angry messages.

We also keep an eye on supply chain basics: pin images by digest where feasible, scan images, and run minimal base images. And we don’t allow random containers from the internet because someone needed curl at 4 p.m. on a Friday. (We’ve all been there. We can still do better.)

If you want a solid grounding in pod security concepts, the Kubernetes docs on Pod Security Standards are a good compass: Pod Security Standards.

Upgrades And Capacity: Practice Change Before Change Practices Us

Clusters decay if we don’t upgrade. APIs deprecate, nodes age out, and add-ons drift. Our habit is to treat upgrades as a normal maintenance activity, not a once-a-year festival of fear. We keep a cadence: patch releases regularly, minor versions on a planned schedule, and add-on compatibility checked in advance.

Before upgrading, we run a quick inventory: which APIs are deprecated, which controllers need updates, and whether our admission policies will block new behaviour. Tools help, but discipline matters more. We test upgrades in a non-prod cluster that mirrors prod as closely as possible (same add-ons, similar node types, similar policies). If non-prod is “close enough,” it won’t be close enough when it matters.

Capacity planning is the other side. Autoscaling is helpful, but it’s not an excuse to ignore fundamentals. We watch headroom: if the cluster routinely sits at 85–90% CPU, we’re one traffic spike away from scheduling failures. We also check for skew: one node pool saturated while others are idle usually means constraints or requests are off.

Finally, we practise node drains and failure scenarios. Not constantly—just enough that it’s familiar. A planned disruption should not be the first time we discover that a critical workload runs as a single replica with no PDB. That’s not “bad luck,” that’s “we didn’t look.”

Share