Ansible Without Tears: Practical Automation We’ll Actually Use
Less yak-shaving, more predictable changes across our servers.
Why We Keep Coming Back To ansible
We’ve all met “automation” that promised to save time and instead gave us a new hobby: debugging someone else’s cleverness. ansible tends to stick around in our toolbelt for one simple reason: it’s readable. We can hand a playbook to a teammate, go grab a coffee, and reasonably expect they won’t set the datacenter on fire while trying to understand it.
At its core, ansible is just SSH plus a structured way to describe desired state. That “desired state” bit matters. We’re not writing step-by-step shell scripts that drift over time; we’re declaring what “done” looks like. When it works well, we get repeatable changes, fast rollbacks (by re-applying the right state), and fewer “it worked on that one box” moments.
Another reason we like it: low friction to start. There’s no agent to roll out everywhere, which is handy when we’re inheriting a mixed fleet or dealing with locked-down environments. And the ecosystem is mature enough that we can usually avoid reinventing wheels—collections, roles, and modules cover a ton of ground. The official docs are also refreshingly usable when we inevitably forget syntax at 4:57 PM. (Ansible documentation)
But: ansible is not magic. It can be slow at scale if we’re careless, and it will happily let us write spaghetti YAML. The goal of this post is to keep the good parts—clarity, repeatability, safety—without turning our repo into a haunted house of half-finished playbooks.
A Sensible Starting Layout For Our Repo
If we don’t decide on structure early, we’ll end up with “playbook_final_v7_reallyfinal.yml” and a collective sense of shame. A small amount of convention goes a long way. What we want is a repo that’s easy to navigate, easy to test, and hard to misuse.
A practical baseline is: inventories separated by environment, group vars and host vars kept tidy, and roles for anything that’s more than a handful of tasks. Playbooks should read like “intent,” not like a transcript of terminal commands. Also: we should keep secrets out of Git. Always. Even “temporary” ones. (Yes, we’ve all said that.)
Here’s a layout we’ve used repeatedly:
ansible/
ansible.cfg
inventories/
dev/
hosts.yml
group_vars/
all.yml
prod/
hosts.yml
group_vars/
all.yml
playbooks/
site.yml
web.yml
db.yml
roles/
common/
tasks/main.yml
handlers/main.yml
templates/
files/
defaults/main.yml
collections/
vault/
prod.vault.yml
A few opinions we’ve learned the hard way:
– Keep site.yml as the entrypoint that includes other playbooks. It becomes our “table of contents.”
– Put environment-specific values in inventory group_vars, not sprinkled throughout tasks.
– Use roles once a pattern repeats. If we copy/paste tasks twice, that’s a role begging to exist.
– Commit ansible.cfg so everyone runs with the same defaults (paths, forks, stdout formatting).
For guidance on recommended patterns, the docs and community examples are solid (Best practices). We’re aiming for boring, predictable structure—because boring infrastructure is the best kind.
Inventory And Variables: Where Things Usually Go Sideways
Most ansible pain isn’t in modules; it’s in variables. Specifically: “Where is this value coming from?” and “Why did prod just get dev’s settings?” If we want calm deployments, we need a variable strategy we can explain on a whiteboard without crying.
First, pick an inventory format and stick to it. YAML inventory is readable and plays nicely with nested groups. Example:
# inventories/prod/hosts.yml
all:
children:
web:
hosts:
web-01.prod.example.com:
web-02.prod.example.com:
db:
hosts:
db-01.prod.example.com:
Then define the values that apply to everyone:
# inventories/prod/group_vars/all.yml
app_name: "catalog"
app_env: "prod"
nginx_worker_processes: 4
Now the key: we keep “facts” separate from “choices.” Hostnames, IPs, and membership go in inventory. Configuration choices go in group_vars. Secrets go in Vault (we’ll get to that). When we need per-host overrides, we use host_vars/hostname.yml, but sparingly—per-host snowflakes are where standardization goes to die.
Also, we treat precedence like a loaded nail gun. ansible has clear precedence rules, but “clear” doesn’t mean “memorable.” When something’s odd, we use ansible-inventory --graph and ansible-inventory --host <host> to see what’s happening. Variable debugging early saves hours later. (Inventory intro)
Finally, we name variables with intent: nginx_worker_processes beats workers. And we avoid reusing generic names like port unless we enjoy guessing games.
Writing Playbooks That Don’t Age Like Milk
A playbook should tell a story: target hosts, become or not, what roles/tasks apply, and what’s safe to change. If our playbook reads like “run these 47 commands,” we’re just doing shell scripting with extra indentation.
We try to keep playbooks thin and push complexity into roles. Here’s a simple playbook that deploys a web tier and keeps things readable:
# playbooks/web.yml
- name: Configure web servers
hosts: web
become: true
vars:
nginx_sites:
- server_name: "catalog.example.com"
proxy_pass: "http://127.0.0.1:8080"
roles:
- common
- nginx
- app
In roles, we aim for idempotency. If a task changes something every run, we’ve got drift hiding in plain sight. We prefer modules over shell/command because modules understand state. When we must use shell, we add creates: or changed_when: so runs stay honest.
Handlers are our best friends for service restarts. Restarting Nginx on every run is a fun way to create “mystery blips” in monitoring.
We also keep tasks small and labeled. name: isn’t decoration; it’s future-us’s debugging handle. And we tag things. Tags let us run just what we need during development and incidents:
--tags nginx--skip-tags slow
One more habit: use check_mode during review and in CI where possible. It doesn’t catch everything, but it stops a lot of “oops.” (Playbooks intro)
Roles, Templates, And Handlers: The Grown-Up Stuff
Roles are where ansible starts paying rent. They let us encapsulate logic, defaults, templates, handlers, and files into a unit we can reuse across teams and environments. If we’re doing the same config pattern in three places, a role makes it consistent and reviewable.
A typical role flow:
– Defaults define safe baseline values.
– Tasks install packages, render templates, and manage services.
– Handlers restart/reload services when templates change.
– Templates (Jinja2) turn variables into config files.
Example: a minimal Nginx role snippet.
# roles/nginx/tasks/main.yml
- name: Install nginx
ansible.builtin.package:
name: nginx
state: present
- name: Deploy nginx site config
ansible.builtin.template:
src: site.conf.j2
dest: /etc/nginx/conf.d/{{ app_name }}.conf
mode: "0644"
notify: Reload nginx
- name: Ensure nginx is enabled and running
ansible.builtin.service:
name: nginx
state: started
enabled: true
And the handler:
# roles/nginx/handlers/main.yml
- name: Reload nginx
ansible.builtin.service:
name: nginx
state: reloaded
Templates should be boring. If our Jinja looks like a programming contest entry, we’re doing it wrong. Keep logic in vars, not in templates, and prefer simple conditionals.
Also: roles are easier to share when they’re opinionated but configurable—sensible defaults, with clear overrides in inventory. If we want reusable community roles, Ansible Galaxy is still a decent starting point, with the usual “read the code before you trust it” disclaimer. (Ansible Galaxy)
Secrets And Safety: Vault, Least Surprise, And Guard Rails
If we’ve ever found a password in a Git history from 2019, we know the feeling: a mix of dread and “how did that get there?” ansible gives us Ansible Vault to encrypt sensitive variables at rest. It’s not the only option (SOPS, external secret managers), but it’s built in and works reliably.
Our baseline approach:
– Keep secrets in separate vault/*.vault.yml files.
– Reference them via vars_files.
– Use different vault passwords/IDs per environment.
– Never decrypt secrets into logs.
Example playbook pattern:
- name: Deploy app with secrets
hosts: web
become: true
vars_files:
- "../vault/prod.vault.yml"
roles:
- app
We keep non-secret config in group_vars, and secrets only in Vault. That way we can review changes without accidentally leaking credentials in a pull request. For teams, vault IDs help avoid the “who has the password?” scramble. (Using Vault)
Now, safety isn’t just secrets. It’s also guard rails:
– Limit blast radius with --limit during changes.
– Use serial: for rolling updates.
– Use max_fail_percentage: to stop if things go wrong.
– Prefer strategy: free only when we understand the side effects.
And our favourite: require --check and a human review for high-risk playbooks. Computers are fast; regrets are slower.
CI, Linting, And Running ansible Like We Mean It
Running playbooks manually from laptops works… until it doesn’t. The moment more than one person touches automation, we need consistent checks. We don’t need a huge pipeline, but we do need enough automation to catch silly mistakes before they hit prod.
At minimum, we run:
– ansible-lint for style and risky patterns
– yamllint for syntax and formatting
– ansible-playbook --syntax-check
– Optional: Molecule tests for roles (especially shared roles)
We also pin versions. “It worked yesterday” is often “we upgraded ansible-core without noticing.” Use a requirements.yml for collections and a pinned Python environment (pip-tools/poetry/venv—pick your poison).
A lightweight GitHub Actions job might:
– Install dependencies
– Lint
– Syntax-check key playbooks
– Optionally run check mode against a test inventory
When we talk about “running ansible like we mean it,” we also mean:
– Centralized execution (runner in CI, or a controlled jump host)
– Consistent credentials (SSH keys, vault IDs, privilege escalation rules)
– Logs we can actually read during incidents
If we want an execution layer with audit trails and RBAC, AWX/Automation Controller is an option, but we don’t need it on day one. (AWX project)
The goal is simple: every merge should make our automation more reliable, not more exciting.
Scaling Without Regret: Performance, Idempotency, And Drift
ansible can handle big fleets, but we need to be deliberate. The two classic scaling killers are: doing too much per host, and doing it in the slowest way possible. We’ve all seen playbooks that gather every fact, run five shell commands per task, and restart services like they’re getting paid per restart.
Here’s what we do instead:
– Disable fact gathering when we don’t need it: gather_facts: false
– Use package modules and service modules, not shell loops
– Use async and poll for long-running tasks
– Increase forks carefully in ansible.cfg to match our environment
– Cache facts if we need them repeatedly (but only if it’s actually helpful)
Idempotency is the other half of scaling. If playbooks report changes every run, we can’t tell real changes from noise. That ruins trust and makes “drift” invisible. We want a run to be mostly green and quiet unless we intentionally changed something.
We also keep drift in check by reapplying baseline roles regularly (common hardening, users, time sync, logging) and by using immutable-ish approaches where appropriate (golden images, containers) so ansible config doesn’t carry the whole world on its back.
Finally, we measure. If a playbook takes 45 minutes, we profile: which tasks are slow, which hosts are laggards, which modules are chatty. Boring performance work beats heroic incident response every time.


