Ansible In Real Life: Less Drama, More Deploys

How we keep servers predictable without turning ops into theatre.

Why We Still Reach For ansible

We’ve all got that one server that “only works if you don’t look at it.” It’s running a mission-critical thing, it was hand-tweaked at 2 a.m., and nobody wants to touch it because the last person who did is now “between opportunities.” That’s the moment we stop pretending our infrastructure is a collection of artisanal snowflakes and start writing things down as code. For us, ansible is the quickest path from “tribal knowledge” to “repeatable reality.”

What we like is that it’s boring in the best way. It uses SSH, it doesn’t demand agents everywhere, and a playbook is readable by humans who don’t want to learn a new programming language just to install nginx. That readability matters when we’re debugging under pressure. Also, it’s easy to start small: one play, one role, one host group. The value shows up before we’ve even finished arguing about naming conventions.

ansible also fits nicely into the way most teams actually work. We can keep the “source of truth” in Git, review changes in pull requests, and run playbooks in CI/CD without making everything a giant platform project. If we want a GUI later, we can adopt something like AWX (the upstream for Automation Controller) without rewriting our automation.

Most importantly, ansible helps us answer three questions fast:
1) What changed? (Git)
2) Who changed it? (Git + reviews)
3) Can we reproduce it? (playbooks/roles)

If we can do those three, incidents get shorter and sleep gets longer.

Inventory: Where “It Ran On My Laptop” Goes To Die

Inventory is where ansible becomes a team sport instead of a local science experiment. Early on, we learned the hard way that “a pile of IPs in a file” works… until it doesn’t. As soon as you have dev/stage/prod, different SSH users, a couple of oddball ports, and hosts that come and go, you need a structure that scales without turning into a spreadsheet with feelings.

We typically use YAML inventory because it’s easy to read and extend. The key is to model the world the way you operate it: environments, roles, and shared variables. Keep sensitive values out of inventory (more on that later), and aim for “least surprise.” If a host is in prod, it should inherit prod defaults. If a host is a database, it should inherit database settings. It shouldn’t be a scavenger hunt across six files.

Here’s a simple pattern we use:

# inventory.yml
all:
  children:
    dev:
      hosts:
        dev-app-01:
          ansible_host: 10.10.1.11
    prod:
      children:
        app:
          hosts:
            prod-app-01:
              ansible_host: 10.20.1.21
            prod-app-02:
              ansible_host: 10.20.1.22
        db:
          hosts:
            prod-db-01:
              ansible_host: 10.20.2.31

  vars:
    ansible_user: ubuntu
    ansible_ssh_common_args: "-o StrictHostKeyChecking=no"

We’ll usually complement this with group_vars/ and host_vars/ so inventory stays focused on “who/where,” while behaviour lives elsewhere. When we need dynamic inventory (cloud autoscaling groups, ephemeral environments), we lean on ansible’s inventory plugins rather than inventing our own duct-taped scripts. The official docs are worth bookmarking: Ansible Inventory.

The goal: make it easy to target “all prod app servers” without anyone needing to remember IP addresses like it’s 2009.

Playbooks That Don’t Age Like Milk

Playbooks can either be a joy to maintain or a cursed scroll no one wants to open. The difference usually comes down to a few habits: stay declarative, keep tasks small, and avoid “shell: do_the_thing.sh” unless we genuinely have no better option.

We treat playbooks as orchestration: they should describe what we want and where, while the how lives in roles. That separation makes it easier to reuse components, test changes, and avoid massive playbooks that try to do everything in one file.

A sane, readable playbook might look like this:

# site.yml
- name: Configure web tier
  hosts: app
  become: true
  vars:
    nginx_worker_processes: auto
  roles:
    - common
    - nginx
    - app_deploy

- name: Configure database tier
  hosts: db
  become: true
  roles:
    - common
    - postgres

A few rules we try to follow:

Prefer built-in modules over shell commands. Modules are idempotent; shell scripts are “trust me, bro.”
Use handlers for restarts. Restarting services every run is a great way to cause “mysterious” blips.
Keep variables in the right place: defaults in role defaults, environment overrides in group vars.
Limit inline logic. A couple of when: conditions are fine; a playbook that reads like a programming puzzle is not.

When we do need shell, we’re explicit: set creates: or changed_when: to keep runs clean and predictable.

For module discovery and examples, we rely on the official module index: Ansible Collections and Modules. It saves us from reinventing a package manager inside a bash heredoc (which, admittedly, we’ve all done once).

Roles: The Only Way We Stay Sane

Once your automation grows past “install vim on a couple of boxes,” roles are where ansible starts paying rent. We use roles to package repeatable behaviour: baseline hardening, logging agents, web servers, app deployments, database setup, and so on. The magic isn’t the folder structure—it’s the discipline to keep each role focused and composable.

A typical role layout we stick to:

defaults/main.yml for safe defaults
vars/main.yml for role-internal constants
tasks/main.yml as the entry point
handlers/main.yml for service restarts/reloads
templates/ and files/ for configuration artifacts
meta/main.yml for dependencies when needed

Here’s a tiny example from an nginx role that keeps runs idempotent and restarts only when config changes:

# roles/nginx/tasks/main.yml
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present
    update_cache: true

- name: Deploy nginx config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    mode: "0644"
  notify: Reload nginx

- name: Ensure nginx is running
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true

# roles/nginx/handlers/main.yml
- name: Reload nginx
  ansible.builtin.service:
    name: nginx
    state: reloaded

This is dull code, and that’s the point. The best automation is the kind we don’t have to think about.

For sharing or bootstrapping, we sometimes pull in roles via Ansible Galaxy, but we’re picky. If we don’t understand what a community role does, we don’t run it in prod. “It had a lot of stars” is not a security policy.

Secrets: Vault, SOPS, Or “Please Don’t Put Passwords In Git”

Every team hits the same fork in the road: either we manage secrets intentionally, or we accidentally manage them via screenshots and Slack messages. We choose intentional, mostly because we like boring compliance conversations.

ansible gives us Ansible Vault, which is perfectly serviceable if we keep it simple: encrypt files that contain secrets, decrypt at runtime in CI/CD, and avoid mixing secret and non-secret data in the same variable files. Vault isn’t a full secret manager, but it’s a practical tool for many setups, especially when we’re not ready to introduce another system.

A workflow we’ve used successfully:

Store group_vars/prod/vault.yml encrypted
Keep group_vars/prod/main.yml unencrypted for non-sensitive overrides
Inject the vault password via CI secrets or a local password manager (not a shared text file named vault_pass.txt on someone’s desktop… we’ve all seen it)

We also keep an eye on integrations: if we’re already on a secret manager like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault, we’ll often fetch secrets dynamically rather than storing them in the repo. That can reduce rotation pain and limit blast radius. But it’s not “free”—it adds dependencies and failure modes. If the secret backend is down, our deploy is down too.

For teams that want Git-friendly encrypted blobs, Mozilla SOPS is another approach, though it’s not ansible-native. We only adopt it when there’s a clear operational win.

Regardless of tool, the rules don’t change:
– Don’t log secrets.
– Don’t copy secrets into templates unless you must.
– Rotate periodically and after staff changes.
– Review who can decrypt and where.

Security isn’t about paranoia; it’s about reducing “oops” moments.

Testing And CI: Because Production Is A Terrible Unit Test

Running ansible by hand from a laptop works until it doesn’t. Eventually, someone runs the wrong playbook against the wrong inventory, and we get an exciting afternoon. Our fix is to treat infrastructure code like application code: lint it, test it, and run it via pipelines with guardrails.

At minimum, we like:
– ansible-lint for style and best practices
– Syntax checks on every pull request
– A non-prod environment where we can run playbooks end-to-end

For roles, we’ve had good results with Molecule when we need deeper testing. It can spin up ephemeral instances (containers or VMs depending on driver) and validate that the role converges cleanly. Yes, it’s extra work. No, we don’t do it for every single role. We do it for roles that are foundational or risky (SSH hardening, base OS config, etc.).

We also use --check mode carefully. It’s useful, but it’s not a perfect “dry run,” especially for tasks that depend on runtime state. Same story with diffs: --diff is great for templates and file changes, but it can be noisy.

In CI, we typically:
– Validate YAML
– Run ansible-playbook --syntax-check
– Run ansible-lint
– Optionally run Molecule on changed roles
– Require approvals for production runs

When we want a more controlled execution environment, we’ll run automation via AWX/Controller so credentials, inventories, and job templates are centralized. The upstream project is here: AWX on GitHub. Even if we don’t deploy it, reading its model helps us think about separation of duties.

The punchline: the less we rely on “Dave’s laptop state,” the fewer surprises we get at 3 a.m.

Operations Patterns We’ve Learned The Hard Way

ansible is straightforward to start, but operating it well takes a few lessons—usually learned right after something breaks. Here are patterns we keep coming back to.

1) Idempotency is non-negotiable.
If we can’t run a playbook twice safely, it’s not automation; it’s a scripted gamble. We avoid shell and command unless necessary, and we use creates, removes, changed_when, and proper modules to keep runs stable.

2) Limit blast radius.
We use serial for rolling changes and max_fail_percentage to avoid “took out the whole fleet” moments. For example, deploy to 10% at a time, validate health, then proceed.

3) Tag ruthlessly.
Tags let us run only what we need: --tags nginx or --tags deploy. Without tags, every run becomes “hope nothing else changes.”

4) Keep templates simple.
Jinja2 is powerful, but we don’t want config generation to become a logic engine. If the template needs loops inside loops inside conditionals, we reconsider the design.

5) Document the contract.
Every role should state: supported OS versions, required variables, and what it changes. Future us will thank present us. Present us rarely listens, but we try.

6) Know when not to use ansible.
If we need continuous reconciliation and drift correction at massive scale, other tools might fit better. We still use ansible a lot, but we don’t force it into jobs it’s not great at. Sometimes the right solution is a golden image pipeline, sometimes it’s Kubernetes, sometimes it’s “stop mutating servers.”

If we stick to these patterns, ansible stays a calm tool—not a chaotic one. And calm is the whole point.