Ansible In Real Life: Calm, Repeatable Ops Wins
Less hero work, fewer surprises, and more sleep for everyone.
Why We Keep Coming Back To ansible
We’ve all been there: a “quick change” turns into a late-night archaeology dig through shell history, half-remembered commands, and a Slack thread that reads like a crime scene. The appeal of ansible is that it nudges us toward doing the same things the same way—without forcing us to adopt a whole new religion.
At its core, ansible is just automation that’s readable. YAML playbooks are approachable, and the execution model is simple enough that most of our team can follow what’s happening after one solid walkthrough. That matters in real operations: if only one person can understand the automation, we haven’t reduced risk—we’ve just moved it.
We also like that ansible is agentless. SSH in, do the work, leave the machine alone. That’s not always the deciding factor, but it reduces the “what’s running on my servers?” debate. Plus, the ecosystem is mature: modules for packages, services, users, cloud resources, network gear, and the weird corner cases that show up when you inherit systems built in 2014 “as a temporary fix.”
If you want the official north star, the Ansible documentation is still the best reference, and Ansible Galaxy is where we grab vetted roles when we’d rather not reinvent the wheel.
The big takeaway: ansible is not magic. It’s a way to write down intent, run it safely, and stop relying on tribal knowledge and muscle memory. Which, frankly, is the nicest gift we can give Future Us.
Getting Our Inventory And Variables Under Control
Most ansible pain doesn’t come from tasks—it comes from messy inventories and variables. If we don’t know what we’re targeting and which settings apply, everything else is just fancy guessing.
We like to start with an inventory that reflects reality: environments, regions, and roles. Even if we later move to a dynamic inventory, the mental model stays the same. Group structure should help us answer: “What servers are these, and why do they exist?”
Here’s a simple, readable inventory.ini that scales surprisingly far:
[web]
web-01 ansible_host=10.10.1.11
web-02 ansible_host=10.10.1.12
[db]
db-01 ansible_host=10.10.2.21
[prod:children]
web
db
[all:vars]
ansible_user=deploy
ansible_ssh_common_args='-o StrictHostKeyChecking=accept-new'
Then we add variables where they belong. Our rule of thumb: defaults in roles, shared values in group_vars, and host-specific overrides only when absolutely necessary (because they’re a maintenance tax).
Example structure:
group_vars/prod.ymlfor production settingsgroup_vars/web.ymlfor web tier settingshost_vars/web-01.ymlonly if it’s truly special (rare, but it happens)
We also keep secrets out of plain YAML. If we need encryption, we use Ansible Vault. It’s not glamorous, but it beats “passwords.yml (final_final_really).”
Good inventories make runs predictable, reviews easier, and incidents shorter. Bad inventories turn ansible into a roulette wheel. We prefer fewer casino vibes in our deployments.
Writing Playbooks That Don’t Scare Future Us
A playbook should read like a checklist, not like an escape room. When we review ansible changes, we ask: “Can someone new to the team tell what this does in five minutes?” If the answer is no, we refactor.
We keep playbooks thin and push logic into roles. That gives us reuse, cleaner diffs, and less temptation to cram everything into one file. We also lean on idempotency: if we run it twice, it shouldn’t “do stuff” the second time unless something changed.
Here’s a small playbook that shows our usual style—clear names, minimal variables, roles doing the heavy lifting:
---
- name: Configure web tier
hosts: web
become: true
vars:
app_user: myapp
roles:
- common
- nginx
- myapp
- name: Configure database tier
hosts: db
become: true
roles:
- common
- postgres
A few habits we stick to:
- Name everything: tasks, plays, handlers. When ansible prints output, those names become your debug log.
- Prefer modules over shell:
ansible.builtin.aptbeatsshell: apt-get .... Modules are idempotent and safer. - Use handlers for restarts: don’t bounce services unless config changed.
- Fail early when needed:
assertandfailtasks save time when prerequisites aren’t met.
And yes, we still use shell sometimes. But we treat it like hot sauce: useful, but we don’t want it on every dish.
If you’re building playbooks at scale, roles plus disciplined conventions beat cleverness. Cleverness is fun until it’s 2 a.m. and production is paging us with a “service down” alert and a smug playbook that won’t explain itself.
Roles, Collections, And Reuse Without The Chaos
Once we have more than a couple playbooks, roles become the difference between “manageable” and “spaghetti with YAML garnish.” Roles let us separate concerns: install packages, write config, manage services, and expose variables in a consistent way.
A role we trust usually has:
defaults/main.ymlfor safe, documented defaultstasks/main.ymlfor straightforward stepstemplates/andfiles/used intentionallyhandlers/main.ymlfor restarts/reloadsmeta/main.ymlif we want dependencies
We also standardize role interfaces. If our nginx role expects nginx_sites and nginx_extra_conf, we keep those names stable. Changing variable contracts casually is how you break three environments with one “tiny cleanup.”
Collections are the newer packaging story—modules, plugins, and roles bundled together. We use them when we need a maintained set of modules (cloud providers, network automation, etc.). The upside is versioning and repeatability. The downside is you need to pin versions and test upgrades instead of living on the edge and hoping for the best.
For external dependencies, we use Galaxy, but we treat third-party roles like any other dependency: review them, pin versions, and don’t assume they’re safe just because they’re popular. Galaxy is great, but it’s not a magical safety certification agency.
Helpful references we keep close:
– Ansible Galaxy for roles and collections
– Using collections for structure and install patterns
Reuse is great. Uncontrolled reuse is how we end up with five slightly different “common” roles that all fight each other like toddlers sharing one toy.
Safer Runs: Check Mode, Diff, Tags, And Guardrails
Running ansible against production shouldn’t feel like dropping a bowling ball off a roof and hoping it lands on a pillow. We want guardrails. The good news: ansible gives us a bunch of practical ones—if we actually use them.
First, we lean on check mode and diff whenever we can:
--checkshows what would change--diffshows how files/templates would change
Not every module fully supports check mode, but it’s still valuable as a first pass. It catches “oops, wrong group” mistakes before they become “why is the database server running nginx?”
Second, we use tags to limit blast radius. If we’re only updating TLS settings, we don’t need to re-run user creation, package updates, and unrelated config. Tags also make incident response faster: we can surgically reapply the fix without pulling every lever in the room.
Third, we reduce surprises with serial and max_fail_percentage. Rolling changes are nicer than all-at-once changes—especially for web tiers behind a load balancer.
Finally, we’re strict about human process:
– Require a PR for changes
– Have a basic review checklist (inventory scope, idempotency, handlers, vault usage)
– Run a staging apply before production
We also keep linting in the loop. ansible-lint catches a lot of “works on my laptop” sins early, and it’s much nicer to be scolded by tooling than by a pager.
The goal isn’t perfection. It’s to make the default path safe, repeatable, and boring. Boring is good. Boring means we’re not improvising in production like it’s open mic night.
Secrets, SSH, And Not Leaking The Crown Jewels
ansible makes it easy to automate… and equally easy to accidentally commit something spicy to git. So we treat secrets handling as a first-class part of the workflow, not an afterthought.
Our baseline: no cleartext secrets in repositories. If a variable looks like a password, token, private key, or connection string, it belongs in a secret store or in Vault-encrypted files. Ansible Vault isn’t perfect, but it’s simple and it works.
We keep Vault usage straightforward:
– one vault password source per environment (or a secure vault-id approach)
– avoid giant “everything.yml” encrypted blobs
– encrypt only what must be secret, so reviews are still useful
We also think about SSH hygiene. Agent forwarding is convenient until it isn’t. We prefer dedicated deploy users, tight sudo rules, and explicit SSH settings in inventory or ansible.cfg. If we need bastions, we configure them deliberately rather than relying on someone’s shell config.
Speaking of config: we keep ansible.cfg in the repo so runs are consistent across laptops and CI. Defaults matter. A lot. One teammate’s “helpful” global config can change behaviour in ways that are hard to spot.
And when we need to integrate with a broader secret strategy, we look at external secret managers and plugins rather than forcing Vault to be something it’s not. But even then, the guiding principle stays the same: make it hard to leak secrets by accident and easy to do the right thing on purpose.
Reference worth bookmarking: Ansible Vault guide. It’s the boring documentation that prevents exciting incidents.
CI/CD With ansible: Tests, Dry Runs, And Promotion
We don’t want ansible to be “run from a laptop and hope.” That’s not a process; that’s a tradition. Instead, we wire it into CI so every change gets the same basic scrutiny before it goes anywhere near production.
Our typical pipeline looks like this:
- Lint:
ansible-lint - Syntax check:
ansible-playbook --syntax-check - Optional:
--checkruns against a sandbox or ephemeral environment - Merge gates: require approvals for prod-affecting paths
- Promote: staging first, then production
Where possible, we add lightweight tests. Not everything needs a full integration environment, but we should at least validate that templates render, roles resolve, and inventories are sane. If we manage images or containers, we sometimes test roles by building a disposable VM/container and applying the role. When that’s too heavy, we still do linting plus a targeted check mode run.
We also keep environments separated by inventory and credentials, not by vibes. CI jobs should make it very hard to “accidentally” deploy to prod because the wrong variable file got included.
If you’re running AWX or Ansible Automation Platform, you get orchestration, RBAC, scheduling, and audit trails, which can be a big deal for larger teams. Even without it, a basic CI runner plus good conventions gets us 80% of the safety.
Bottom line: ansible shines when it’s part of a repeatable delivery path. The more we standardize how it’s run, the less we rely on whoever remembers the exact command flags from last quarter.



