Seven Surprisingly Costly kubernetes Mistakes We Finally Fixed
Practical patterns, configs, and trade-offs we wish we knew sooner.
H2: Start With SLOs, Not Shiny Abstractions
If we could rewind, we’d start by writing down what “good” looks like before touching kubernetes YAML. The cluster will run anything we throw at it; that’s both empowering and dangerous. Without service-level objectives (SLOs) for latency, availability, and error rate, every debate becomes theoretical. We’ve seen teams chase perfect deployments while customers wait on slow requests. SLOs pull the conversation back to “what’s the impact?” and “what are we willing to spend to fix it?” We like to define two or three crisp SLOs per service, tie alerts to error budgets, and mark clear “shed load here” strategies. For example, an API SLO might be median latency under 50 ms and 99th under 300 ms during business hours. If we breach the 99th percentiles three days in a row, we stop feature deployment and prioritize the ugly parts.
SLOs also steer cluster-level decisions that otherwise get hand-wavy. Do we need multi-zone nodes for this workload? If yes, which services justify the cost? What’s the acceptable blast radius when we reboot nodes for kernel patches? And how strict do our PodDisruptionBudgets need to be? Even capacity planning gets simpler. We plan for headroom based on actual error budgets rather than gut feel. No, this isn’t a “process seminar.” It’s the difference between choosing a 3-node control plane because the Internet said so and choosing it because our target downtime can’t eat a whole error budget after one control-plane update. kubernetes gives us lots of knobs; SLOs are what tell us which ones to turn and how far.
H2: Namespaces, Guardrails, and RBAC That Actually Stick
Our second expensive mistake was “sharing is caring” multi-tenancy without guardrails. A single default namespace where everyone dumped Deployments, Jobs, and ConfigMaps led to naming collisions, surprise deletes, and the occasional “why is our staging in prod?” moment. Namespaces are cheap isolation. We create one per team or per product domain, and then we add hard limits so the cluster doesn’t become a buffet. Start with ResourceQuotas and LimitRanges to prevent a runaway Job from hoovering all the CPU at 3 a.m.
Here’s a minimal scaffold we use:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: "20"
requests.memory: 80Gi
limits.cpu: "40"
limits.memory: 160Gi
pods: "300"
---
apiVersion: v1
kind: LimitRange
metadata:
name: team-a-defaults
namespace: team-a
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: 512Mi
defaultRequest:
cpu: "200m"
memory: 256Mi
Then we lock access with scoped Roles and RoleBindings. If we want accountants nearby but not driving, we give them read-only roles, and we keep cluster-wide privileges to a small ops group. The RBAC docs are straightforward—don’t skip them. We also add admission checks (like a required owner
label and sensible imagePullPolicy
) so mistakes get caught at submit time, not at pager time. A simple policy keeps us from arguing later about whether a container with no requests and limits is “just for a minute.” People move fast. Guardrails should be faster.
H2: Right-Size Pods: Requests, Limits, And The HPA
We used to treat requests
and limits
like vitamins—nice to have, mostly decorative. Then we met real scheduling. The kube-scheduler is honest but unforgiving; it only places a Pod if a node has the requested resources. If we fat-finger requests, we end up with pending Pods and idle nodes. If we under-request, we cause noisy neighbors and throttling. Our rule: set requests
to the 95th percentile of real usage during peak, and limits
1.5–2x requests unless the workload is latency-sensitive (then we sometimes avoid limits to reduce CPU throttling). We also add Horizontal Pod Autoscaling (HPA) to follow traffic without burning money overnight.
Here’s a slice we actually ship:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: team-a
spec:
replicas: 4
selector:
matchLabels: {app: api}
template:
metadata:
labels: {app: api}
spec:
containers:
- name: api
image: ghcr.io/acme/api:1.23.4
resources:
requests:
cpu: "300m"
memory: "512Mi"
limits:
cpu: "600m"
memory: "1Gi"
ports:
- containerPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: team-a
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Read the resource management guide if any of this feels mystical. It’s all math and histograms. We profile with kubectl top
and application metrics, trim requests weekly during peak seasons, and add per-endpoint timeouts to avoid slow bleed. The result: fewer 3 a.m. “why is everything throttled?” moments and a 22% drop in node spend.
H2: Networking Without Tears: DNS, CNI, And Policies
We burned weeks chasing “kubernetes networking is haunted” bugs that came down to three things: DNS timeouts, leaky egress, and ambiguous Services. Start with clarity. Cluster-first DNS is a gift; don’t make it crawl. We make sure Services have predictable names, and we keep headless Services scoped to StatefulSets that actually need stable identities. For egress, decide who can talk to the world, and deny-by-default inside the namespace. The happy path should be explicit, not implied. NetworkPolicies are the simplest way to fence traffic, and they’re surprisingly readable once you start.
Here’s a tiny but effective example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-api
namespace: team-a
spec:
podSelector:
matchLabels:
app: api
policyTypes: ["Ingress", "Egress"]
ingress:
- from:
- podSelector:
matchLabels: {app: frontend}
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels: {kubernetes.io/metadata.name: kube-system}
ports:
- protocol: UDP
port: 53
This says: only Pods labeled app=frontend
may hit the API, and DNS egress is allowed. That’s it. Tidy. For clusters running eBPF CNIs, we’ll add HTTP-aware policies later, but we always start with the basics. If you’ve never dabbled, the NetworkPolicy docs are refreshingly approachable. One final note: avoid mystery Services that select half the cluster with app: web
. Specificity costs nothing and saves afternoons.
H2: Stateful Doesn’t Mean Sad: Storage That Behaves
Stateful workloads in kubernetes aren’t a trap, but they are picky. Our rookie mistake was treating them like Deployments with persistent volume claims stapled on. Databases and queues want stable identities, graceful shutdown, and anti-affinity. That’s what StatefulSets give us: ordered rollout, stable Pod names, and volume templates tied to those names. We set a PodManagementPolicy of Parallel when safe, but default to OrderedReady for anything that needs sequencing. For storage, we pick a CSI driver that supports volume expansion, snapshots, and ReadWriteOncePod where possible to avoid messy fencing. Also, we test failover like we mean it: reboot nodes mid-write, simulate AZ loss, and delete a PVC just to confirm the backup plan is more than a slide.
Security-wise, don’t skip fsGroup
and proper runAsUser
settings so containers don’t become root just to touch disks. Use readOnlyRootFilesystem: true
for services that can handle it, and separate data from logs. We also learned to right-size IOPS the same way we right-size CPU—watch real metrics and keep latency headroom. For backups, we like tools that snapshot at the storage layer and stream to object storage on a schedule, then we do an actual restore drill. If the first time you restore is during an incident, you’re beta-testing under stress. StatefulSets need attention, not heroics. Once we aligned replicas, PDBs, and storage policies, our “DB fell over” alerts dropped to almost zero.
H2: Resilience You Can Prove: PDBs, Drains, And Spreads
Our uptime didn’t really improve until we stopped hoping and started proving. Draining a node should be a boring non-event. If kubectl drain
triggers a cascade of timeouts, we missed PodDisruptionBudgets (PDBs), readiness gates, or topology spread. PDBs set guardrails for voluntary disruptions—upgrades, drains, autoscaler moves. We add them for anything that matters, especially APIs and critical workers. Here’s a minimal PDB that actually does its job:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: team-a
spec:
minAvailable: 80%
selector:
matchLabels:
app: api
We pair PDBs with readiness that reflects truth. If the app can accept traffic only after warming caches, say so; don’t gate on “container started.” On the placement side, topologySpreadConstraints
prevents all replicas from hugging the same node or zone:
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
This spreads Pods evenly across zones and turns a zonal hiccup into a shrug. We test it. We run kubectl cordon
and kubectl drain --ignore-daemonsets --grace-period=30 --timeout=5m
during business hours and watch SLOs. If charts wobble, we tune PDBs and warm-up behavior. The PDB guide covers edge cases we wish we’d read earlier (spoiler: HPAs and PDBs have feelings about minReplicas).
H2: Shipping Config Like Grown-Ups: Secrets, Rollouts, And Drift
We once played config whack-a-mole across namespaces and clusters. The fix was simple but disciplined: treat config as code, keep images immutable, and make rollouts deliberate. ConfigMaps and Secrets get versioned in Git, templated with a sane tool, and mounted in predictable paths. We avoid kubectl edit
like a hot stove; drift hides in those “quick fixes.” For Secrets, we either integrate a KMS or use a sealed mechanism that keeps plaintext out of repos and CI logs. Whichever path, we audit Secret sprawl quarterly and rotate keys on a schedule that doesn’t depend on your bravest engineer’s memory.
For rollouts, spec.strategy.rollingUpdate
is our friend, but we don’t trust defaults blindly. Latency-sensitive apps get smaller maxUnavailable
, and we test with synthetic traffic. If services need config flips without restarts, we design for it; otherwise we plan predictable restarts with readiness gates and preStop hooks. Progressive delivery doesn’t require a space shuttle—start with simple canaries. Spin up a second Deployment with a small replica count, route a sliver of traffic, bake for an hour, and promote. We also track “what’s running where” with a single cluster registry document that maps commit SHA to image, Helm chart version, and applied values. When things get weird at 2 a.m., that map ends the debate. Did we ship that feature? Exactly which config is live? The calm answer is worth more than coffee.
H2: Lower Bills, Happier On-Call: Autoscaling And Spot Math
Our spend graph used to look like a mountain range. Then we trimmed requests, enabled autoscaling, and got honest about spot instances. First, cluster-level autoscaling. If we run on a cloud provider, we adopt the native node group autoscaler or the well-trodden Cluster Autoscaler. We keep node pools simple: a balanced “general purpose” pool for most workloads, a memory-heavy one for big caches, and a small “do not preempt” pool for control-plane adjacencies and critical services. For spot/preemptible nodes, we quarantine them to tolerant workloads and keep a small on-demand buffer so we can absorb a spot eviction without paging anyone.
Second, we right-size Pods as mentioned earlier and encourage scale-out. HPAs shine when Pods scale horizontally; vertical-only scaling saves some cash but risks single-Pod throttling. We also stagger CronJobs so they don’t all fire at the top of the hour and cause a temporary need for an extra node. As we tuned, we found 10–15% of our spend was “just-in-case” headroom sitting all weekend. Once we trusted SLOs and PDBs, we let the autoscaler reclaim it. The result was a measured 31% cost reduction month-over-month, less paging during night traffic valleys, and a happier finance meeting. The bonus: fewer right-now incidents, because scaling decisions are automatic instead of someone clicking buttons in a console while whispering “please work.”