terraform Without Tears: Practical Patterns We Reuse

terraform

terraform Without Tears: Practical Patterns We Reuse

How we keep infrastructure changes boring, reviewable, and reversible.

Start With One Rule: Keep Changes Small and Visible

We’ve all seen the “tiny tweak” that turned into a surprise outage because someone edited a dozen resources in one go. Our default posture with terraform is simple: make changes small enough that a human can actually review them. That means we slice work into short-lived branches, keep PRs focused, and treat terraform plan output like a contract we’re about to sign.

A few habits help a lot. First, we don’t mix refactors with functional changes. If we’re renaming resources, moving modules, or cleaning up variables, we do that in a dedicated PR. Second, we keep environments separated enough that a dev plan doesn’t accidentally touch prod (more on that later). Third, we insist on “plan in CI, apply by approval” so nobody’s laptop becomes the change-control system.

This also nudges us toward better module boundaries. When a change is too big to review, it’s usually telling us the module is doing too much or that resources aren’t grouped by lifecycle. We want the plan to read like a story: “add load balancer,” “update security group,” “roll new task definition.” Not “refresh 482 resources, replace 17, and by the way we’re deleting DNS.”

If we’re onboarding someone new, we point them to the official docs early because the language details matter, especially around expressions and functions: Terraform Language. We also keep a short internal checklist: expected diff, rollback path, blast radius, and “what’s the one thing that could bite us?”

State: Where Dreams Go to Be Serialized

Terraform state is where reality gets written down, and we treat it like production data—because it is. The quickest way to have a bad week is to store state locally, share it by Slack (please don’t), or let multiple people apply against the same state file. Our baseline is remote state + locking, always.

Most teams land on an S3 backend with DynamoDB locking for AWS, or equivalent in other clouds. We also turn on encryption and keep state in a dedicated bucket with tight IAM. State can contain sensitive values, and while terraform tries to redact output in places, the state file still deserves proper protection.

Here’s a typical backend setup we reuse (kept intentionally boring):

terraform {
  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "network/prod/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "acme-terraform-locks"
    encrypt        = true
  }

  required_version = ">= 1.6.0"
}

A couple of practical notes. We don’t hardcode backend config everywhere; we often inject parts via -backend-config in CI for different environments. We also separate state by component (network, cluster, app) because a single mega-state turns every plan into a novel.

When state gets weird—because it will—we reach for official guidance rather than creative improvisation: State documentation. And we remind ourselves: if someone suggests “just delete the state file,” we ask them to step away from the keyboard and have some water.

Modules: Reuse the Boring Bits, Not the Weird Bits

We like modules. We also like not building a module for everything that moves. The trick is to standardise the repeatable plumbing (VPC patterns, IAM roles, logging, tagging) while leaving room for teams to ship. When modules become mini-platforms with a hundred toggles, they stop being helpful and start being a choose-your-own-adventure book with no happy endings.

Our module rules of thumb:
– One module, one clear job. If it provisions both the VPC and the app, it’s two modules.
– Inputs are small and opinionated. If we’re passing 40 variables, we’ve built a framework.
– Outputs are for composition, not curiosity. If nobody needs the output, we don’t export it.
– Versions are pinned. “Latest” is not a strategy; it’s a surprise subscription.

We also keep modules testable. That can be as simple as a minimal example configuration in examples/ and CI that runs terraform init and terraform validate. For deeper testing, we sometimes add a lightweight integration run in an ephemeral environment, but only for modules that are likely to break lots of downstream users.

And yes, we document modules like grown-ups. Not essays—just inputs, outputs, and a sane example. Teams should be able to adopt a module without reading its source like it’s an ancient prophecy.

When we need registry modules, we use them, but we still audit what we’re pulling in. HashiCorp’s module registry is handy: Terraform Registry. We pin versions, read changelogs, and keep our own wrapper modules when we want to enforce company defaults (tags, naming, logging) consistently.

Environments and Workspaces: Separate Real Risk From Practice

We want dev and prod to be different enough that accidents don’t cross the streams. The simplest win is separate accounts/subscriptions/projects per environment. After that, separate state and separate pipelines. We’ve learned the hard way that “same account, different naming convention” is a fragile defence when someone fat-fingers an apply.

Terraform workspaces can help, but we use them carefully. Workspaces are fine when the infrastructure is identical except for a few variables (say, multiple identical ephemeral environments). They’re less great when environments have different shapes, policies, or dependencies. For “real” environments (dev/stage/prod), we often prefer separate directories with explicit backend keys and variable files so the separation is visible in the repo.

We also keep environment config predictable:
envs/dev.tfvars, envs/prod.tfvars
– per-env backend key paths
– per-env provider config where needed (like different regions or accounts)

A common pattern we use is a simple folder layout:

infra/
  network/
  platform/
  app/
envs/
  dev.tfvars
  staging.tfvars
  prod.tfvars

Then CI calls terraform with the right state key and vars for that environment. It’s not fancy, but it’s readable at 2 a.m., which is the real benchmark.

For teams mixing multiple providers (AWS + Datadog + GitHub, for example), the environment boundary matters even more. A “small” apply can silently tweak monitoring, DNS, permissions, and cloud resources all at once. Keeping environments clearly separated makes it much harder to accidentally redecorate production while you thought you were painting dev.

CI/CD: Plan in Pull Requests, Apply With Guardrails

If terraform runs only on laptops, it will eventually run on the wrong laptop at the wrong time. We centralise terraform execution in CI/CD so we get consistent versions, consistent credentials, and an audit trail that doesn’t rely on someone remembering what they did last Tuesday.

Our baseline pipeline is:
1. terraform fmt -check
2. terraform init
3. terraform validate
4. terraform plan (comment back to the PR)
5. Apply only after approval, ideally with a manual gate

Here’s an example GitHub Actions workflow snippet we’ve used (trimmed for sanity). The key is that plan happens on PRs and apply happens only on main with an explicit environment approval:

name: terraform

on:
  pull_request:
  push:
    branches: [ "main" ]

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.6
      - run: terraform fmt -check
      - run: terraform init
      - run: terraform validate
      - run: terraform plan -no-color

  apply:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.6
      - run: terraform init
      - run: terraform apply -auto-approve

We don’t always use -auto-approve in real production. Often we want explicit approval, or we run apply through a release pipeline with change tickets. But the pattern holds: repeatable, reviewed, and logged.

If you’re looking for broader CI patterns and policy ideas, Terraform Cloud/Enterprise docs are worth a skim even if you don’t use the product: Terraform Cloud.

Drift, Imports, and “Why Is It Changing That?”

Terraform is happiest when it owns the world. Real life is messier. People click buttons, scripts run, and “temporary” hotfixes become permanent. Drift happens, and ignoring it just means terraform will eventually surprise us with a plan that wants to change half the stack.

We schedule regular “drift checks” by running terraform plan on a cadence (nightly or weekly) against key stacks. We don’t auto-apply drift fixes; we want humans to see what changed and why. A drift report is also a great way to find undocumented manual changes and tighten permissions.

When resources exist already, we prefer importing them rather than recreating them. Terraform’s import story has improved, but it still requires care: you need the right address in your config, the right IDs, and patience. After import, we run a plan to see what terraform thinks it should change, and we iterate until the diff is zero (or at least explainable).

We also keep an eye on replacement triggers. Seemingly innocent changes—like tweaking a name, changing a subnet, or moving from one resource type to another—can cause “forces replacement.” Those lines in the plan deserve a slow, careful read.

Two small practices save us pain:
– We add lifecycle blocks only when we mean it. ignore_changes can be useful, but it can also hide real issues.
– We don’t fight terraform with manual edits. If it’s managed, it’s managed.

For a solid reference on the plan/apply lifecycle and troubleshooting weird diffs, the upstream docs are still the best source of truth: Terraform CLI.

Security and Policy: Least Privilege Beats Heroic Cleanup

Terraform needs credentials, and credentials are the shortest path from “we’re deploying” to “we’re on the news.” We aim for least privilege for CI roles, scoped per environment and per stack where practical. The pipeline that applies networking shouldn’t be able to delete production databases, and the app pipeline shouldn’t be able to rewrite IAM policies across the account.

We also keep secrets out of terraform variables where possible. If a value is truly secret, we prefer pulling it from a secret manager at runtime (or letting the target platform handle it). When we must pass sensitive values, we mark variables as sensitive = true, and we ensure state is locked down and encrypted. “Sensitive” is not magic invisibility—it just reduces accidental exposure in outputs.

Policy-wise, we try to put guardrails where teams won’t trip over them daily. If you block every instance type, every region, and every resource name, folks will route around the rules. Instead, we focus on a small number of high-impact checks: encryption required, public exposure reviewed, tags present, no wildcard IAM, and known-bad patterns disallowed.

There are lots of ways to do policy checks: OPA/Conftest, cloud-native policy engines, or Terraform’s own approaches. The tool matters less than the habit: policies should run in CI, fail fast, and provide a clear reason. Nothing motivates “creative compliance” like a policy error that reads like a riddle.

And finally: we keep provider versions pinned and updated on purpose. Supply chain risk isn’t theoretical, and neither are breaking changes. A monthly “dependency gardening” session is cheaper than a quarterly outage.

Share