Skip to main content

Phase 10 — Ansible (Post-MAAS Bootstrap & Day-2 Ops)

The first nine phases of the platform built the running cluster: provisioning, k3s, Tailscale, MetalLB, Longhorn, Ingress, Harbor, monitoring, podinfo. Each one needed a small, manual node-level tweak (an apt install, an iscsid enable, a config file dropped into /etc/). Those tweaks worked, but they live as prose in CLAUDE.md rather than as code on disk. If a node's SSD dies tomorrow, every one of them is a chance to forget something.

Phase 10 codifies the post-MAAS, pre-k3s node bootstrap as Ansible roles. The proof that the codification is correct is ansible-playbook --check --diff: when it returns changed=0 against the live cluster, our roles match reality. When it doesn't, we've found drift — which is exactly what happened on this run.

Ansible is not used to install k3s itself. The cluster is live and stateful (etcd, Longhorn replicas, Harbor data, Prometheus TSDB); rerunning a curl-pipe-sh installer against a healthy node is risk without reward. The roles cover only the OS-level prerequisites that need to be re-applied if a node is reimaged.

A second playbook (upgrade.yml) demonstrates rolling Day-2 maintenance — kubectl drainapt upgrade → reboot if needed → wait for Ready → kubectl uncordon, one node at a time.


Architecture

Controller (this machine)
┌─────────────────────────┐
│ ansible-core 2.20 │
│ pipx-installed │
│ ssh keys to ubuntu@ │
│ 10.0.0.{2,4,7} │
└────────────┬────────────┘
│ SSH (key auth, NOPASSWD sudo)
┌───────────────────┼────────────────────┐
▼ ▼ ▼
set-hog fast-skunk fast-heron
(control plane) (worker) (worker)

roles applied to every node:
▸ common base utilities
▸ longhorn-prereq open-iscsi + iscsid enabled
▸ k3s-registries /etc/rancher/k3s/registries.yaml
▸ network /etc/netplan/99-default-gateway.yaml

Decisions

DecisionChoiceWhy
Install methodpipx install --include-deps ansible (ansible-core 2.20)apt install ansible ships an older 9.x release; pipx isolates ansible's Python deps from system Python and gives us the modern release
Repo location~/minicloud-ktaylorganisation/ansible/Sibling of the chart values files; not a separate git repo (yet)
Role boundaryOne role per concern — common, longhorn-prereq, k3s-registries, networkEach manual config from Phases 0–7 becomes its own role; clean isolation
No k3s install taskDocumented in Phase 1; not in any playbookCluster is live and stateful; running a curl-pipe-sh installer adds risk without value. New nodes get the install command run once manually, then site.yml.
First verification--check --diff against live nodesStandard Ansible idempotency proof. The receipt is changed=0 on a second run.
Day-2 opsserial: 1 rolling upgrade with kubectl drain / uncordon delegated to localhostReal production pattern; ensures workloads relocate before any node is upgraded
SecretsNone in scopeOS config only — no API keys, no passwords. Ansible Vault arrives in Phase 15.

Pre-flight

# Install ansible-core via pipx (no sudo for the ansible binary itself)
sudo apt install -y pipx python3-venv
pipx ensurepath
pipx install --include-deps ansible
ansible --version | head -1
# → ansible [core 2.20.x]

# SSH key auth must already work — Phase 0 set this up
ssh ubuntu@10.0.0.2 echo ok # set-hog
ssh ubuntu@10.0.0.4 echo ok # fast-skunk
ssh ubuntu@10.0.0.7 echo ok # fast-heron

Repo layout

ansible/
├── ansible.cfg # inventory path, callback format, ssh pipelining
├── inventory.yml # 3 nodes grouped: control_plane / workers / cluster
├── README.md
├── playbooks/
│ ├── site.yml # bootstrap: common + longhorn-prereq + k3s-registries + network
│ └── upgrade.yml # Day-2: rolling apt upgrade with drain/uncordon
└── roles/
├── common/ # tasks/main.yml + defaults/main.yml
├── longhorn-prereq/ # tasks/main.yml
├── k3s-registries/ # tasks/main.yml + files/registries.yaml
└── network/ # tasks/main.yml + files/99-default-gateway.yaml + handlers/main.yml

ansible.cfg

[defaults]
inventory = inventory.yml
host_key_checking = False
retry_files_enabled = False
callback_result_format = yaml
deprecation_warnings = False
roles_path = roles
forks = 5

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ForwardAgent=no

Note: in older docs you'll see stdout_callback = yaml. That callback was removed in community.general 12. The modern equivalent is callback_result_format = yaml on the built-in default callback.

inventory.yml

all:
vars:
ansible_user: ubuntu
ansible_python_interpreter: /usr/bin/python3
children:
control_plane:
hosts:
set-hog: {ansible_host: 10.0.0.2}
workers:
hosts:
fast-skunk: {ansible_host: 10.0.0.4}
fast-heron: {ansible_host: 10.0.0.7}
cluster:
children:
control_plane:
workers:

Roles

common — base utilities

Codifies a single declarative list of utilities every node should have.

roles/common/defaults/main.yml:

common_packages:
- htop
- vim
- curl
- jq
- net-tools
- traceroute
- rsync

roles/common/tasks/main.yml:

- name: Install base utilities
ansible.builtin.apt:
name: "{{ common_packages }}"
state: present
update_cache: true
cache_valid_time: 3600
become: true

longhorn-prereq — iSCSI

roles/longhorn-prereq/tasks/main.yml:

- name: Install open-iscsi (provides iscsid + iscsi_tcp module)
ansible.builtin.apt:
name: open-iscsi
state: present
become: true

- name: Ensure iscsid service is enabled and started
ansible.builtin.systemd:
name: iscsid
enabled: true
state: started
become: true

k3s-registries — Harbor mirror config

roles/k3s-registries/files/registries.yaml:

configs:
"harbor.10.0.0.200.nip.io":
tls:
insecure_skip_verify: true
mirrors:
"harbor.10.0.0.200.nip.io":
endpoint:
- "http://harbor.10.0.0.200.nip.io"

roles/k3s-registries/tasks/main.yml:

- name: Ensure /etc/rancher/k3s directory exists
ansible.builtin.file:
path: /etc/rancher/k3s
state: directory
owner: root
group: root
mode: "0755"
become: true

- name: Install /etc/rancher/k3s/registries.yaml
ansible.builtin.copy:
src: registries.yaml
dest: /etc/rancher/k3s/registries.yaml
owner: root
group: root
mode: "0644"
become: true

The k3s//v2-suffix mirror issue documented in Phase 7 is not fixed by this role — the role only preserves the current state. The proper fix arrives in Phase 15 with TLS.

network — explicit default gateway

roles/network/files/99-default-gateway.yaml:

network:
version: 2
ethernets:
enp0s31f6:
routes:
- to: default
via: 10.0.0.1

roles/network/tasks/main.yml:

- name: Install /etc/netplan/99-default-gateway.yaml
ansible.builtin.copy:
src: 99-default-gateway.yaml
dest: /etc/netplan/99-default-gateway.yaml
owner: root
group: root
# netplan refuses world-readable files (since 24.04) — must be 0600
mode: "0600"
become: true
notify: Apply netplan

roles/network/handlers/main.yml:

# `netplan generate` validates the YAML and produces the systemd-networkd
# config without applying it — a cheap safety check before `netplan apply`
# tears down interfaces. If `generate` fails, `apply` will not be invoked
# and the node keeps its current networking.
- name: Apply netplan
ansible.builtin.shell:
cmd: |
set -e
netplan generate
netplan apply
become: true
changed_when: true

The receipt — --check --diff against live nodes

cd ansible/
ansible-playbook playbooks/site.yml --check --diff

On this cluster, the first check revealed real drift in common:

TASK [common : Install base utilities] *********
The following NEW packages will be installed:
net-tools traceroute
changed: [set-hog] ← would install net-tools + traceroute
The following NEW packages will be installed:
traceroute
changed: [fast-heron] ← would install traceroute only
The following NEW packages will be installed:
net-tools traceroute
changed: [fast-skunk] ← would install net-tools + traceroute

PLAY RECAP
fast-heron : ok=7 changed=1
fast-skunk : ok=7 changed=1
set-hog : ok=7 changed=1

Every other task (longhorn-prereq, k3s-registries, network) reported ok — the codified state of those three concerns matched reality on every node. The drift was confined to base utilities: net-tools had been installed on fast-heron but not on the other two; traceroute was missing everywhere.

This is a more interesting outcome than "no drift at all." It shows that Ansible finds drift you didn't know existed, before it bites you in an incident.


Apply + idempotency

Apply the playbook:

ansible-playbook playbooks/site.yml --diff
TASK [common : Install base utilities] *********
changed: [set-hog] (installed: net-tools traceroute)
changed: [fast-skunk] (installed: net-tools traceroute)
changed: [fast-heron] (installed: traceroute)

PLAY RECAP — all hosts: ok=7 changed=1

Run it a second time, immediately:

PLAY RECAP — all hosts: ok=7 changed=0

changed=0 on every host. The playbook is idempotent. That's the receipt.


Day-2: rolling apt upgrade

playbooks/upgrade.yml does a serial: 1 rolling upgrade, draining workloads before each node is touched and waiting for Ready before moving on.

- name: Rolling apt upgrade across the cluster
hosts: cluster
serial: 1
become: true
pre_tasks:
- name: Cordon and drain the node from the controller
delegate_to: localhost
become: false
ansible.builtin.command:
cmd: >
kubectl drain {{ inventory_hostname }}
--ignore-daemonsets
--delete-emptydir-data
--timeout=5m
tasks:
- name: apt update
ansible.builtin.apt:
update_cache: true
- name: apt upgrade (safe upgrade — never removes packages)
ansible.builtin.apt:
upgrade: safe
- name: Check whether a reboot is required
ansible.builtin.stat:
path: /var/run/reboot-required
register: reboot_required
- name: Reboot the node if required
ansible.builtin.reboot:
reboot_timeout: 600
post_reboot_delay: 30
when: reboot_required.stat.exists
- name: Wait for kubelet to mark the node Ready
delegate_to: localhost
become: false
ansible.builtin.command:
cmd: kubectl wait --for=condition=Ready node/{{ inventory_hostname }} --timeout=5m
post_tasks:
- name: Uncordon the node
delegate_to: localhost
become: false
ansible.builtin.command:
cmd: kubectl uncordon {{ inventory_hostname }}

Pre-flight before running for real:

  • Every workload should have ≥ 2 replicas with anti-affinity across workers (Phase 9's podinfo HPA satisfies this).
  • Longhorn volumes should have ≥ 2 healthy replicas — drain fail-safes if it cannot relocate them.
  • Always --check and --syntax-check first.
ansible-playbook playbooks/upgrade.yml --syntax-check
ansible-playbook playbooks/upgrade.yml --list-tasks
ansible-playbook playbooks/upgrade.yml --limit fast-heron # one node first
ansible-playbook playbooks/upgrade.yml # all 3, rolling

Troubleshooting

community.general.yaml callback plugin has been removed

stdout_callback = yaml was removed in community.general 12. Replace with callback_result_format = yaml on the built-in default callback.

--check reports drift you didn't expect

Either the role doesn't match reality (rewrite the task) or a node has drifted from the others (apply and reconcile, or update the role to match what's on the node). The --diff output tells you which.

Netplan task ran on every node despite the file already existing

Probably a permission-bit mismatch — netplan files must be 0600 on 24.04+. The mode: parameter on the copy task ensures every run lands on the right permissions; if you delete the line, the first run after a fresh node will report changed because file mode differs from the default (0644).

sudo: a password is required

The ubuntu user on the target node lacks passwordless sudo. Fix the node (/etc/sudoers.d/90-ubuntu with NOPASSWD:ALL) or run with --ask-become-pass.


Done When

✔ ansible-core 2.18+ installed via pipx, ansible/ binaries on PATH
✔ ansible -m ping all returns "pong" from set-hog, fast-skunk, fast-heron
✔ ansible-playbook playbooks/site.yml --check --diff has been run and any
drift it found is documented (drift on first run is expected and useful)
✔ Apply + second run shows changed=0 (idempotency proven)
✔ playbooks/upgrade.yml passes --syntax-check and --list-tasks
✔ README.md committed
✔ Cluster is still healthy (kubectl get nodes, podinfo, grafana all reachable)

Real-world skills demonstrated

SkillWhere it applies in industry
Codification of "tribal knowledge"Every team has a mental list of "things you have to remember to do on a fresh node." Turning that list into roles is what separates a one-engineer hobby project from a real platform.
--check --diff as drift auditSame workflow real teams use to detect config drift before it causes an incident — Puppet's --noop, Chef's --why-run, Terraform's plan are all the same idea.
Idempotency as a first-class qualityThe "second run reports zero changes" rule is non-negotiable for production playbooks. Anything else is a foot-gun.
Risk-aware scopeChoosing not to install k3s through Ansible — codifying prerequisites but leaving the cluster bootstrap manual — is exactly the call senior engineers make on live, stateful systems.
Rolling upgrades with drain/uncordonStandard production-maintenance pattern. Same shape on EKS, GKE, AKS, OpenShift, and bare metal.
delegate_to: localhost for control-plane tasksUseful any time you need to coordinate Kubernetes operations alongside SSH-based maintenance — kubectl drain runs from where kubeconfig lives, not from inside the node being drained.
Pipx + venv hygieneModern way to install Python CLIs on PEP 668 systems (Ubuntu 24.04+, Debian 12+) without polluting system Python.