Terraform Without Tears: Shipping Infra Changes Safely

Practical habits for calmer plans, fewer surprises, and better sleep.

Why We Keep Coming Back To terraform

We’ve all tried the “just click it in the console” phase. It feels fast until the next person asks, “Who changed this security group?” and we all stare at the ceiling like it holds answers. That’s why we keep coming back to terraform: it gives us a repeatable way to define infrastructure, review changes, and roll them out with fewer mystery knobs.

At its best, terraform is a collaboration tool disguised as a provisioning tool. A plan is basically a diff for infrastructure. A state file is the memory of what we believe exists. And modules are how we stop copy‑pasting the same VPC layout like we’re paid by the duplicated line.

But terraform also has sharp edges. “It worked on my laptop” turns into “why is prod recreating the database?” if we ignore the boring details: state management, provider pinning, and drift detection. The goal isn’t “infrastructure as code” as a slogan. The goal is: changes are predictable, reviewable, and reversible enough that we’re not scared to touch them.

If you’re getting started, the official docs are solid, especially for core concepts like state and configuration language: Terraform documentation. We’ll keep this post practical: how we structure repos, how we avoid surprise diffs, how we use modules without creating a module museum, and how we make CI do the worrying for us.

Repo Layouts We Can Actually Live With

A terraform repo can look tidy on day one and become a drawer of tangled cables by month six. The trick is to pick a structure that matches how we deploy. We generally see three options:

1) Single repo, multiple environments (e.g., envs/dev, envs/stage, envs/prod).
2) One repo per environment (clean isolation, more duplication).
3) Monorepo with stacks (common modules + per-stack configs).

We tend to prefer a monorepo with clear “stacks” because it scales: shared modules are centralized, and each stack has its own state and lifecycle. A simple shape:

modules/ for reusable building blocks (VPC, IAM roles, Kubernetes node groups)
stacks/ for deployable units (networking, cluster, app platform)
stacks/<name>/<env>/ for environment-specific config

The point is to make blast radius obvious. “Network stack” changes shouldn’t be accidentally coupled to “app stack” changes. If a developer updates a module, we want the resulting plan to be run explicitly in the stacks that consume it—not via some magical recursion that updates everything everywhere.

We also keep environments boring. Avoid clever naming, avoid nested workspaces unless you’re truly disciplined, and keep each deployment target anchored to a distinct state backend key. If you’re using Terraform Cloud, that often means one workspace per stack+env. If you’re rolling your own backend, it’s usually an object storage key per stack+env.

If your team needs a reference for recommended patterns and anti-patterns, it’s worth skimming HashiCorp’s guidance on modules and composition in their docs: Terraform modules overview.

State: The One File We Don’t Joke About

Let’s be honest: terraform state is where most “exciting” incidents start. State is not just a cache. It’s terraform’s record of what it manages and how resources map to real cloud objects. Lose it, corrupt it, or share it carelessly, and your plan will read like a disaster novel.

Our baseline rules:

Remote backend, always. Local state is fine for learning, not for teams.
Locking enabled. Two applies at once is how you get surprise recreation.
Encrypted at rest and in transit. It’s often full of identifiers and sometimes secrets (even when we try hard not to).
Small, scoped states. One massive global state makes every change risky and slow.

A typical remote backend with locking (example using AWS S3 + DynamoDB) looks like this:

terraform {
  required_version = "~> 1.7"

  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "stacks/network/prod/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

We keep the backend config consistent and predictable. The key should make it obvious what stack and environment we’re talking about. Also: if someone suggests storing state in a shared folder “because it’s easier,” we gently escort them to a whiteboard and talk about locking.

If you want the deep dive on what state contains and why it matters, the official docs explain it clearly: Terraform state. Worth reading once; worth bookmarking forever.

Modules That Help Instead Of Haunt Us

Modules are either a gift to our future selves or a curse we pass down like an heirloom nobody asked for. The difference is whether we treat modules as products with a clear interface—or as a junk drawer for “stuff we didn’t want in main.”

A module should do one job. A networking module might create VPC, subnets, and route tables. It shouldn’t also create databases, DNS zones, and your team’s Slack reminders (tempting though that is).

Here’s a small, sane module interface:

// modules/vpc/variables.tf
variable "name" { type = string }
variable "cidr" { type = string }
variable "azs"  { type = list(string) }

variable "public_subnet_cidrs" {
  type = list(string)
}

variable "private_subnet_cidrs" {
  type = list(string)
}

And an environment stack that consumes it:

// stacks/network/prod/main.tf
module "vpc" {
  source = "../../../modules/vpc"

  name                = "core-prod"
  cidr                = "10.20.0.0/16"
  azs                 = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  public_subnet_cidrs  = ["10.20.0.0/24", "10.20.1.0/24", "10.20.2.0/24"]
  private_subnet_cidrs = ["10.20.10.0/24", "10.20.11.0/24", "10.20.12.0/24"]
}

We also version modules. Even if modules live in the same repo, we treat changes carefully and roll them out stack by stack. If modules are in a separate repo, we pin versions via git tags. “Always pull main” is how you get surprise diffs on a Tuesday.

For module style and publishing norms, HashiCorp’s registry and module guidelines are a helpful sanity check: Terraform Registry. We don’t have to publish publicly, but we can still adopt the conventions.

Planning And Applying Without Heartburn

The terraform workflow is simple on paper: init, plan, apply. The reality is that “simple” becomes “oops” if we don’t add guardrails. Our favourite guardrail is: every apply must be backed by a reviewed plan output.

A few habits that save us repeatedly:

Run terraform fmt and terraform validate in CI.
Pin provider versions so the same config doesn’t behave differently next week.
Store the plan artifact and apply that exact plan.
Use -refresh-only (or just separate drift checks) to surface surprises early.
Keep apply permissions tighter than plan permissions.

Also, we don’t pretend humans are perfect diff readers. We standardise output: show the plan summary, resource counts, and any replacements. If we see “forces replacement” on something that looks stateful (databases, load balancers with static IPs, clusters), we stop and ask why.

One practical tip: use lifecycle sparingly but intentionally. prevent_destroy can be a seatbelt for critical resources—just don’t turn your whole stack into an immovable object.

And if you’re using AWS, it’s worth knowing how terraform models changes against the provider schema, and how that interacts with eventual consistency. Sometimes a plan is “right,” and the API is “not ready yet.” Building retries and reasonable timeouts into modules helps reduce flakiness.

If we’re ever unsure what terraform will do, we don’t guess—we run the plan in a safe environment first. It’s slower than guessing, but faster than incidents.

CI/CD: Let The Pipeline Be The Grumpy Reviewer

We’re big fans of letting CI be the person who never gets tired and never says “looks fine” when it’s not. The pipeline should run the boring checks, standardise how terraform is executed, and reduce the number of ways we can accidentally do something creative in production.

A lightweight GitHub Actions workflow that we’ve used as a starting point:

name: terraform

on:
  pull_request:
    paths:
      - "stacks/**"
      - "modules/**"
  push:
    branches: ["main"]
    paths:
      - "stacks/**"
      - "modules/**"

jobs:
  plan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.5
      - name: Terraform fmt
        run: terraform fmt -check -recursive
      - name: Plan (example stack)
        working-directory: stacks/network/dev
        env:
          AWS_REGION: eu-west-1
        run: |
          terraform init -input=false
          terraform validate
          terraform plan -input=false -no-color

In a real setup, we’d matrix this across stacks and environments, and we’d avoid running production plans on untrusted PRs unless we’re confident about credential handling. If we’re using Terraform Cloud, we often let it run the plan and apply with policy checks, while CI focuses on formatting, static checks, and module tests.

If you want to go further, policy-as-code can help enforce rules like “no public S3 buckets” or “no security groups open to 0.0.0.0/0.” Terraform Cloud/Enterprise uses Sentinel, and Open Policy Agent is another popular option in the ecosystem. We keep policies small and focused; nobody wants a 600-line policy that blocks everything except a full moon deployment.

Drift, Imports, And The “Someone Clicked It” Reality

Even when we’re disciplined, drift happens. Someone with console access “just tweaks one setting.” An autoscaler changes counts. A managed service updates a property under the hood. Then terraform shows a plan we weren’t expecting, and we have to decide: accept, ignore, or reconcile.

Our approach:

1) Detect drift regularly. Nightly “plan-only” runs per stack are a cheap early warning.
2) Decide ownership. If terraform manages it, changes should flow through code. If not, we stop trying to manage it with terraform. Split resources into different stacks if necessary.
3) Use import thoughtfully. Importing is great for bringing existing resources under management, but it’s not magic—your config must match reality after import.

We also keep an eye on resources that are notorious for drift: security group rules edited by hand, IAM policy documents managed in multiple places, DNS records changed during incidents “temporarily,” and Kubernetes resources managed both by terraform and Helm (double-management is a special kind of chaos).

Sometimes the right move is to add ignore_changes for a field that is expected to change outside terraform (like desired_count for an ECS service managed by autoscaling). But we treat ignore_changes like hot sauce: a little can improve the meal, a lot ruins it, and someone will regret it later.

When importing, we keep the work isolated to a branch and stack, and we run a plan immediately after import to ensure terraform isn’t about to “correct” the imported resource into something else.

Our terraform Checklist For Saner Infrastructure

By now, the pattern is pretty clear: terraform rewards teams that are consistent and punishes teams that improvise. So we keep a checklist that’s short enough to follow and strict enough to prevent the classics.

What we expect in every repo/stack:

Remote state with locking and encryption
Providers and terraform versions pinned
Clear stack boundaries (small states, minimal coupling)
Modules with small interfaces and versioning discipline
CI running fmt/validate/plan on every change
A reviewed plan before apply
Scheduled drift detection
Documented “break glass” process (and audited access)

And culturally:

If you had to click it in the console, you open a ticket to codify it after.
If a plan wants to replace something stateful, we stop and investigate.
If a change feels scary, we test it in a lower environment first.
If we can’t explain a diff, we don’t apply it.

The payoff is real: fewer production surprises, faster onboarding, and the ability to make infrastructure changes without feeling like we’re defusing a bomb with oven mitts.

For ongoing reference, we keep the official docs nearby and lean on community patterns when we need them: Terraform docs, plus the Terraform Registry for module conventions and provider versions.