Ansible That Actually Works In Production

ansible

Ansible That Actually Works In Production

Practical habits we use to keep playbooks boringly reliable

Why We Reach For ansible (And When We Don’t)

We like tools that make repeatable change easy and auditable. That’s the whole reason ansible stays in our kit: it’s readable, agentless, and it plays nicely with the way most teams already work (SSH, Git, and a mild fear of Friday deploys). In practice, ansible shines when we’re configuring fleets, enforcing baselines, rolling out apps, or stitching together multi-step operational workflows that would otherwise live in someone’s shell history.

But we don’t use ansible for everything. If we need real-time event handling or continuous convergence, we’ll lean toward systems designed for that. If we’re building immutable images, we might push more into Packer pipelines and keep ansible as a provisioning helper, not the main act. And if the target is Kubernetes, we’ll often manage Kubernetes resources with GitOps tooling and use ansible for surrounding concerns (nodes, dependencies, credentials flows) rather than trying to “ansible the whole cluster” forever.

One more honest note: ansible can become a junk drawer. When every change is “just add another task,” playbooks grow into a spaghetti novella. Our goal is to keep it boring: small roles, clear inventories, tight variables, consistent checks, and predictable runs.

If you’re getting started or need a refresher, the upstream docs remain the best canonical reference: Ansible documentation. For community roles and examples, Ansible Galaxy is a useful starting point—just treat it like a public buffet: look at dates, ingredients, and who’s been touching the serving spoon.

Inventory: The Place Where Good Ideas Go To Live

Inventory design is where many ansible setups either become elegant—or become a haunted house of hostnames. We aim for a structure that answers two questions quickly: “Who are we targeting?” and “What’s different about them?” Everything else is noise.

We generally separate environments (dev/stage/prod) and then group by function (web, api, db, batch). That gives us natural scoping for variables and lets us roll changes in sensible waves. We also keep host variables minimal. If a value is shared by more than one host, it belongs in a group. If it’s sensitive, it belongs in Vault. If it’s derived, it shouldn’t be a variable at all—compute it in a template or task.

Dynamic inventory is great, but we don’t let it become magic. Whether we pull from AWS, VMware, or something else, we still pin down group names and metadata so playbooks remain stable. For AWS in particular, the official guidance is solid: Working with dynamic inventory.

The other habit we swear by: always have a “canary” group. It’s a tiny subset of prod (often one host per tier) that we can target first. When the canary is happy, we proceed. When it’s not, we stop and pretend we meant to do that.

Also, name groups like you mean it. web_prod_eu tells us something. group42 tells us we should take a walk and rethink our choices.

Roles, Layout, And The Art Of Not Making A Mess

We organize ansible code like we organize infrastructure: predictable structure, small components, and no surprise behaviour. Roles are the unit of reuse, and we treat them like code we’ll have to maintain at 2 a.m. (because we will).

A typical layout we like looks like this:

ansible/
  inventories/
    prod/
      hosts.yml
      group_vars/
      host_vars/
    stage/
      hosts.yml
  roles/
    base/
    nginx/
    app/
  playbooks/
    site.yml
    canary.yml
  collections/
  ansible.cfg

A few rules keep this sane:

  • Roles do one job. “Install nginx” is a job. “Install nginx and configure TLS and deploy the app and also tune sysctl” is a cry for help.
  • Defaults are safe and minimal. If a variable needs to be set for prod, it shouldn’t default to something dangerous.
  • Templates are for rendering config, not for hiding logic. If Jinja starts looking like a programming language dissertation, it’s time to move logic into tasks or simplify the config.
  • Handlers are for actual change events. If we’re restarting services constantly, we’re training everyone to ignore restarts—which is how “a quick change” becomes “why is latency on fire?”

When we do reuse from the community, we pin versions and review the role code. Galaxy is convenient, but production is not the place for surprises. The best long-term move is to graduate critical roles into internal ownership—even if they started life as someone else’s work.

Idempotence Or It Didn’t Happen

If a playbook run changes things every time, we don’t call that automation; we call that a very patient intern. Idempotence is the property that lets us run ansible repeatedly without creating drift or chaos. It’s also what makes failure recovery boring (the good kind of boring).

Our checklist is simple:

  • Prefer built-in modules over shell commands.
  • When we must use command/shell, we add creates, removes, or explicit checks.
  • We avoid “append to file” hacks unless we’re using modules designed for it (lineinfile, blockinfile).
  • We treat “changed” as a signal, not a decoration. If a task reports “changed” on every run, we fix it.

Here’s an example that compares a sloppy approach with a better one:

- name: Install packages (good)
  ansible.builtin.package:
    name:
      - nginx
      - curl
    state: present

- name: Drop nginx config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    mode: "0644"
  notify: Reload nginx

- name: Ensure nginx is enabled and running
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true

handlers:
  - name: Reload nginx
    ansible.builtin.service:
      name: nginx
      state: reloaded

This stays quiet on subsequent runs unless something actually changes. That’s what we want. For module behaviour and return values, the module reference is worth bookmarking: Ansible module index.

Idempotence also improves security: fewer ad-hoc commands means fewer weird edge cases and fewer “it worked on my terminal” moments.

Secrets: Vault, Variables, And Not Emailing Passwords

Secrets management in ansible is one of those topics where the “easy” path is also the “we’ll regret this” path. Our baseline rule is: secrets don’t live in plaintext in Git, and they don’t get passed around in chat messages “just for today.” (Today has a habit of lasting three years.)

We usually start with Ansible Vault for smaller setups or for teams that want minimal moving parts. Vault isn’t a full secrets platform, but it’s a practical step up from “passwords.yml” sitting in a repo. The official docs are clear and actionable: Ansible Vault.

A simple pattern:

  • Put secret vars in group_vars/prod/vault.yml (encrypted).
  • Keep non-secret configuration in normal group_vars.
  • Use separate Vault IDs/keys per environment so dev access doesn’t imply prod access.

We also watch for accidental leaks:
– Don’t debug variables that might contain secrets.
– Mark tasks with no_log: true where secret values might be printed.
– Be careful with registered variables from commands—they sometimes include sensitive output.

If the organization already has a secrets manager (HashiCorp Vault, AWS Secrets Manager, etc.), we integrate rather than reinvent. The main point is: secrets should be fetched just-in-time, with least privilege, and never stored on disk longer than needed. (Yes, this is where we all sigh and then do the right thing anyway.)

CI For Playbooks: Trust, But Verify

We treat ansible like code because it is code. That means pull requests, reviews, and automated checks. Our favourite playbooks are the ones that fail in CI and never get a chance to fail in production.

At minimum, we run:

  • ansible-lint for style and common mistakes.
  • Syntax checks (ansible-playbook --syntax-check).
  • Optional: molecule tests for roles, if we can justify the setup time.

Here’s an example GitHub Actions workflow we use as a starting point:

name: ansible-ci

on:
  pull_request:
  push:
    branches: [ main ]

jobs:
  lint-and-syntax:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install tooling
        run: |
          pip install ansible ansible-lint

      - name: Lint
        run: ansible-lint

      - name: Syntax check
        run: ansible-playbook -i inventories/stage/hosts.yml playbooks/site.yml --syntax-check

This won’t catch everything, but it catches the most common foot-guns early. And yes, we still do peer review. CI can’t tell you if a role is a good idea; it can only tell you if it’s broken in obvious ways.

When we want to go further, we add “can we reach hosts?” checks, and we test roles with Molecule. The ansible-lint docs are a good guide for rules and configuration.

Deploy Strategy: Canary Runs, Tags, And A Calm Exit Plan

Production rollouts should feel like steering a ship, not juggling knives. With ansible, we get a lot of control, but only if we design for it.

Three tactics we rely on:

1) Canary first
We run against a tiny group first (prod_canary). If it fails, we fix the issue before we touch the rest. It’s not fancy, but it’s effective.

2) Tags for targeted changes
Tags let us run only the parts we intend. We don’t tag everything (that becomes its own maintenance burden), but we do tag major areas like packages, config, deploy, restart.

3) Serial and health checks
For services behind a load balancer, we roll a few hosts at a time with serial. We also add explicit health checks when possible, so ansible isn’t just “done,” it’s “done and working.”

A small playbook sketch:

- name: Roll out app safely
  hosts: web_prod
  serial: 2
  become: true

  pre_tasks:
    - name: Verify we can reach the host
      ansible.builtin.ping:

  roles:
    - role: base
      tags: [base]
    - role: app
      tags: [deploy]

  post_tasks:
    - name: Check HTTP health endpoint
      ansible.builtin.uri:
        url: "http://localhost:8080/health"
        status_code: 200
      register: health
      retries: 10
      delay: 3
      until: health.status == 200

Finally, we plan for a calm exit: if a deploy fails mid-roll, we want to stop safely, not keep marching. Sometimes that means max_fail_percentage: 0. Sometimes it means a manual gate after canary. Either way, the goal is to make “stop” a normal operation, not a panic button.

Share