Kubernetes Without Tears: Practical Ops Patterns That Stick

kubernetes

Kubernetes Without Tears: Practical Ops Patterns That Stick

We’ll keep it simple, repeatable, and friendly to on-call sleep.

Why We Run kubernetes (And What We Refuse To Do)

We don’t run kubernetes because it’s trendy. We run it because, when it’s set up with sane defaults, it gives us consistent deployments, self-healing for the boring failures, and a decent way to scale without hand-editing servers at 2 a.m. The trick is deciding what we won’t do: we won’t treat the cluster like a pet, we won’t cram every workload into one “mega-cluster” just to brag, and we won’t accept mystery-meat YAML no one understands.

A good mental model: kubernetes is a platform for declaring intent. We tell it “run this many replicas, expose it this way, limit resources like that,” and it tries to keep reality aligned with that intent. But kubernetes can’t rescue us from unclear ownership, weak release processes, or “we’ll add monitoring later” (spoiler: later never comes).

So we anchor on a few principles:

  • Make it boring: choose a standard Ingress, a standard metrics stack, and a standard deployment method.
  • Fail predictably: health checks, resource requests/limits, and safe rollouts.
  • Reduce surprise: namespacing, quotas, and policies so one team can’t accidentally torch the cluster for everyone.
  • Automate the dull stuff: GitOps, sealed secrets (or equivalent), and repeatable cluster bootstraps.

If you want official background reading, the upstream docs are still the source of truth: Kubernetes Concepts. We’ll stay practical from here.

Cluster Basics We Standardise From Day One

Before we deploy “real” apps, we set the table. Most kubernetes pain comes from clusters that grew organically: three Ingress controllers, five log shippers, and a security model that amounts to “don’t touch anything.” Our approach is to standardise early so teams can ship without reinventing plumbing.

Here’s our baseline checklist:

  1. Namespaces with purpose: platform, observability, ingress, and then per-team or per-product namespaces. No dumping ground called misc (we’ve all tried; it becomes a landfill).
  2. Network boundaries: even a simple default-deny network policy is better than “everything talks to everything.” If you’re using Calico or Cilium, great—just don’t skip this because it’s “hard.”
  3. Ingress and certificates: pick one Ingress controller and stick with it (NGINX, HAProxy, or cloud-native). For certs, cert-manager is our usual go-to because it turns TLS into a solved problem rather than a calendar reminder.
  4. Storage class sanity: define what “default” storage means and when people should choose alternatives (fast vs cheap, zonal vs regional, etc.).
  5. Node pools: separate general workloads from “special” ones (GPU, high-memory, system). Taints/tolerations are easier than unpicking a mess later.

We also keep cluster creation itself repeatable. If you’re on managed offerings, that means infrastructure as code and a documented “golden” cluster config. If you’re self-managing, we still want a one-button (or one-pipeline) cluster build. Reproducibility is our emergency exit when something goes sideways.

Deployments That Don’t Surprise Us (With YAML We Can Read)

Our deployment standard aims for two outcomes: rollouts we can predict and manifests we can understand. That means we don’t ship workloads without resource settings, probes, and a strategy that won’t take down production because someone pushed a bad build.

Here’s a trimmed example we’re happy to see in reviews:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: payments
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: app
          image: ghcr.io/acme/payments-api:1.7.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 10
            failureThreshold: 3

A few opinions baked in:

  • maxUnavailable: 0 for user-facing services keeps capacity steady during rollouts.
  • Requests and limits are non-negotiable. Without them, scheduling becomes “best effort,” which is another way to spell “incident later.”
  • Readiness vs liveness: readiness gates traffic; liveness restarts stuck processes. Mixing them up is a classic self-own.

We also standardise how teams template and ship these manifests—Helm, Kustomize, or plain YAML is fine, but we pick one primary path and document it. The official docs on Deployments are worth a skim when you’re tuning rollout behaviour.

Traffic In, Traffic Out: Ingress, Services, And DNS Hygiene

Networking in kubernetes is where good intentions go to get complicated. So we keep it boring: one Ingress controller, clear service types, and a single pattern for exposing apps. We avoid hand-crafted snowflake configs because they tend to break at the exact moment leadership is watching a demo.

Our usual flow:

  • Service (ClusterIP) for stable internal addressing
  • Ingress for HTTP/HTTPS routing
  • ExternalDNS (optional) to keep DNS records synced to Ingress resources
  • cert-manager to handle TLS automatically

A minimal Ingress that doesn’t make us sad:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payments-api
  namespace: payments
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - payments.example.com
      secretName: payments-api-tls
  rules:
    - host: payments.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: payments-api
                port:
                  number: 80

We’ll typically pair that with a Service exposing port 80 to the pod’s 8080. The key is consistency: the same annotations, the same TLS approach, the same host naming rules. It makes troubleshooting dramatically faster because we can say “Ingress looks fine, check the Service endpoints” without doing interpretive dance through custom annotations.

If you’re choosing components, we like to start with upstream-friendly bits: Ingress docs, plus a controller you can support operationally. Remember: feature count matters less than “who’s on call for it.”

Resource Management: Requests, Limits, And The Art Of Not Melting Nodes

Most clusters don’t die from one huge failure; they die from many small “it’ll be fine” decisions. Resource management is where we avoid the slow-motion pile-up: pods without requests, noisy neighbours, and nodes that spend their final moments swapping like it’s 2009.

We start with three practical rules:

  1. Every container has requests (CPU and memory). If someone doesn’t know what to set, we help them measure—but we don’t let “TBD” reach production.
  2. Limits are used intentionally. CPU limits can cause throttling; memory limits can cause OOM kills. Both might be acceptable, but only if the app behaves well under pressure.
  3. Autoscaling is not a substitute for sizing. HPA and cluster autoscalers are great, but they can’t fix an app that spikes memory because of a leak.

Then we enforce fairness with namespaces:

  • ResourceQuotas to stop one namespace from claiming the cluster
  • LimitRanges to provide defaults and minimums
  • PriorityClasses for truly critical workloads (used sparingly)

We also pay attention to scheduling: separate node pools for system vs apps, and use taints/tolerations to keep “special” nodes special. If you want a reference for the raw mechanics, the upstream Resource Management for Pods and Containers page is solid.

This isn’t glamorous work, but it’s the difference between “we had a small blip” and “why did everything restart at once?”

Observability That Helps During Incidents (Not Just Dashboards For Screenshots)

We’re not aiming for pretty charts; we’re aiming for answers at 3 a.m. Observability in kubernetes needs to cover four basics: metrics, logs, traces, and events—plus a map of how requests flow. If any one of those is missing, debugging becomes guesswork and guesswork becomes downtime.

Our minimal stack usually looks like:

  • Metrics: Prometheus + Alertmanager (often via the Prometheus Operator)
  • Dashboards: Grafana
  • Logs: a cluster-wide agent (Fluent Bit/Vector) shipping to a central store
  • Traces: OpenTelemetry where it matters (not everywhere just because we can)

The key is standard labelling and a consistent way to slice by namespace, app, and version. If you can’t answer “what changed?” in under a minute, you’re going to have a long night.

Alerting is where we stay disciplined. We alert on symptoms users feel (error rate, latency, saturation) and on clear capacity risks (disk filling, node pressure). We avoid alerting on “pod restarted once” unless it’s part of a pattern. Noise trains people to ignore pages, and then the real alert shows up and gets swiped away like a spam call.

Also: don’t ignore kubernetes Events. They often tell you exactly what went wrong (“failed scheduling,” “image pull backoff”) before your dashboards catch up. A good incident workflow includes kubectl describe, not just Grafana.

Security And Policy: Guardrails That Don’t Block Delivery

We like security controls that are boring, automated, and hard to bypass accidentally. In kubernetes, that usually means making the safe path the easy path: base templates that run as non-root, namespaces with least-privilege, and policies that prevent the nastiest foot-guns.

Our baseline guardrails:

  • RBAC by team/namespace, not everyone as cluster-admin (we’ve all seen that movie)
  • Pod Security Standards (or equivalent enforcement) so privileged pods are the exception
  • Network policies to reduce lateral movement
  • Image provenance: signed images if you can, at least locked-down registries and scanning
  • Secrets handling: avoid raw secrets in Git; use Sealed Secrets, External Secrets, or a cloud secret manager

We also try to keep “policy” as code and in version control. Whether you use Gatekeeper (OPA) or Kyverno, the goal is simple: prevent insecure configs before they hit the cluster. Policies should come with actionable error messages, not riddles. If a developer needs a decoder ring to deploy, the policy will get bypassed or disabled.

A good rule: if a control blocks releases, we pair it with automation and clear docs. Security that only exists in a wiki is just security fan fiction.

For a solid upstream starting point, the kubernetes docs on Security are broad but useful.

GitOps And Change Control: Fewer Clicks, More Confidence

We’ve learned (the hard way) that most “kubernetes problems” are really “change problems.” If changes happen through ad-hoc kubectl apply sessions, you’ll eventually ship something unreviewed, untracked, and unrepeatable. Then you’ll spend your weekend trying to recreate the exact sequence of commands that “definitely worked yesterday.”

GitOps is our antidote: the desired state lives in Git, and a controller reconciles the cluster to match it. We like Argo CD or Flux; both work. The important part is the workflow:

  • Pull requests for changes
  • Automated checks (linting, policy validation, template rendering)
  • Controlled promotions between environments
  • Easy rollbacks by reverting Git

We also separate concerns: platform repos for cluster components (Ingress, cert-manager, monitoring), and app repos (or app directories) for workloads. That keeps the platform team from being the bottleneck for every service tweak, while still maintaining guardrails.

Our change rule is simple: if it’s worth doing, it’s worth being reproducible. That includes config changes, not just code. When an incident happens, we want to answer: “what changed, who changed it, and how do we revert?” GitOps gives us those answers without forensic archaeology.

And yes, we still allow “break glass” access for emergencies—but we treat it like a fire extinguisher: visible, logged, and hopefully dusty.

Share