Velero β Cluster Backup & Disaster Recovery
Velero captures Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, RBAC, Ingresses) plus persistent-volume data (file-system level, via the node-agent DaemonSet) into an S3-compatible bucket. Combined with the hourly k3s SQLite snapshots on the same controller (see Phase 14 β Etcd / SQLite snapshots), this is the platform's full backup-and-DR layer.
The original plan called for a cluster-internal MinIO. We deliberately pivoted to running MinIO on the MAAS controller β outside the cluster β because backups stored inside the system being backed up are useless when the system fails. This page documents the controller-side architecture in full.
Architectureβ
Cluster (3 nodes) Controller (separate machine)
ββββββββββββββββββ βββββββββββββββββββββββββββββββ
β k3s control β β MinIO (Docker container) β
β plane (set-hog)β β binds 10.0.0.1:9000 (S3) β
β β β 10.0.0.1:9001 (UI) β
β Velero pod βββββΌββ S3 protocol ββββΆβ bucket: velero β
β + 3 node-agent β β data: /srv/backups/minioβ
β pods (PV bkp) β β config: /srv/config/minio β
β β β β
β β β systemd: minio.service β
β β β restarts on failure β
β β β bind-mounts data + config β
ββββββββββββββββββ βββββββββββββββββββββββββββββββ
The cluster pods reach MinIO via the cluster subnet 10.0.0.0/24 β no
public exposure, no Ingress required. The controller is a separate
physical machine, so a complete cluster wipe leaves backups intact.
Decisionsβ
| Decision | Choice | Why |
|---|---|---|
| Backup target location | MinIO on the MAAS controller, not in-cluster | Survives cluster failure (the canonical reason β backup storage living inside the thing it's backing up is circular) |
| MinIO install method | Docker + systemd, bind-mounted volumes | Standard "managed-on-host" pattern; survives image upgrades; doesn't consume cluster resources |
| Network binding | 10.0.0.1:9000 and 10.0.0.1:9001 only | Port 9000 already taken on 127.0.0.1 by another controller-local service (likely MAAS internal) |
| MinIO admin password | Generated to ~/.minio-admin (mode 600), bind-mounted into the container as /run/secrets/minio_admin | Same out-of-band pattern as Harbor/Grafana/ArgoCD; password file never lives in git or in process args |
| Velero install | Helm chart vmware-tanzu/velero v12.0.1 (App v1.18.0) | Standard install path; configurable values; matches the rest of the platform |
| Velero plugin | velero-plugin-for-aws:v1.13.0 | Talks to any S3-compatible (MinIO included) |
| Storage credentials | Out-of-band Secret velero-credentials in the velero namespace, created via kubectl create secret generic from a local credentials INI | Standard pattern; chart reads via existingSecret; rotated by re-creating the Secret |
| Volume backups | deployNodeAgent: true + defaultVolumesToFsBackup: true | Without this, only k8s objects are captured β no PV data, no actual app state. The node-agent uses Kopia to file-system-copy PV contents. |
| Snapshot location | Disabled (MinIO doesn't do CSI VolumeSnapshots) | Phase 15 will revisit if Longhorn CSI snapshots become useful; not needed for file-system-level backup |
| Schedule | Daily full-cluster at 03:00 UTC, 7-day TTL | Daily granularity, weekly retention is the common baseline for non-critical homelab workloads |
| Excluded namespaces | velero, kube-system, kube-public, kube-node-lease, backup-test | velero excluded so it doesn't back up itself; kube-* excluded because k3s recreates them on bootstrap; backup-test is the per-restore-test scratch namespace |
Pre-flightβ
- Controller has Docker (
docker --versionβ₯ 24) - Controller user is in the
dockergroup (sodocker rundoesn't need sudo) - 75+ GiB free at
/srvon the controller - Cluster is healthy (regression check passes)
Install MinIO on the controllerβ
1. Generate admin password (mode 600)β
openssl rand -base64 24 > ~/.minio-admin
chmod 600 ~/.minio-admin
2. Create the host-side directories (one-time sudo)β
sudo mkdir -p /srv/backups/minio /srv/config/minio
sudo chown -R "$USER:$USER" /srv/backups /srv/config
sudo chmod 700 /srv/backups/minio /srv/config/minio
3. systemd unit at /etc/systemd/system/minio.serviceβ
[Unit]
Description=MinIO S3-compatible object storage (Phase 14 backup target)
Requires=docker.service
After=docker.service network-online.target
[Service]
Type=simple
Restart=on-failure
RestartSec=10s
ExecStartPre=-/usr/bin/docker stop minio
ExecStartPre=-/usr/bin/docker rm minio
# Bind explicitly to 10.0.0.1 only (port 9000 is taken on 127.0.0.1).
ExecStart=/usr/bin/docker run \
--name minio --rm \
-p 10.0.0.1:9000:9000 \
-p 10.0.0.1:9001:9001 \
-e MINIO_ROOT_USER=admin \
-e MINIO_ROOT_PASSWORD_FILE=/run/secrets/minio_admin \
-v /srv/backups/minio:/data \
-v /srv/config/minio:/root/.minio \
-v /home/<user>/.minio-admin:/run/secrets/minio_admin:ro \
quay.io/minio/minio:latest \
server /data --console-address ":9001" --address ":9000"
ExecStop=/usr/bin/docker stop minio
[Install]
WantedBy=multi-user.target
4. Start the serviceβ
sudo systemctl daemon-reload
sudo systemctl enable minio.service
sudo systemctl start minio.service
# Verify
curl -sf http://10.0.0.1:9000/minio/health/live -o /dev/null -w "S3 API: %{http_code}\n"
curl -sf http://10.0.0.1:9001/ -o /dev/null -w "Console: %{http_code}\n"
# Both should return 200
5. Create the velero bucket via mcβ
# Install MinIO client (no root needed)
curl -sLO https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc && mv mc ~/.local/bin/
# Configure alias
mc alias set minilocal http://10.0.0.1:9000 admin "$(cat ~/.minio-admin)"
# Create bucket
mc mb minilocal/velero
Install Velero in the clusterβ
1. Install the Velero CLI on the controllerβ
curl -sLO https://github.com/vmware-tanzu/velero/releases/download/v1.18.0/velero-v1.18.0-linux-amd64.tar.gz
tar -xzf velero-v1.18.0-linux-amd64.tar.gz velero-v1.18.0-linux-amd64/velero
mv velero-v1.18.0-linux-amd64/velero ~/.local/bin/velero
chmod +x ~/.local/bin/velero
velero version --client-only
2. Create namespace + credentials secret (out of band)β
kubectl create namespace velero
cat > /tmp/cloud-credentials <<EOF
[default]
aws_access_key_id=admin
aws_secret_access_key=$(cat ~/.minio-admin)
EOF
chmod 600 /tmp/cloud-credentials
kubectl create secret generic velero-credentials \
-n velero --from-file=cloud=/tmp/cloud-credentials
rm /tmp/cloud-credentials
3. velero-values.yamlβ
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.13.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero
default: true
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://10.0.0.1:9000
publicUrl: http://10.0.0.1:9000
volumeSnapshotLocation: [] # MinIO doesn't do CSI snapshots
defaultBackupStorageLocation: default
defaultVolumesToFsBackup: true
credentials:
useSecret: true
existingSecret: velero-credentials
deployNodeAgent: true # the DaemonSet that does file-system PV backup
nodeAgent:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }
schedules:
daily-full:
disabled: false
schedule: "0 3 * * *"
template:
ttl: "168h" # 7 days
includedNamespaces: ["*"]
excludedNamespaces:
- velero
- kube-system
- kube-public
- kube-node-lease
- backup-test
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack
4. Helm installβ
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update vmware-tanzu
helm install velero vmware-tanzu/velero \
-n velero \
-f velero-values.yaml \
--wait --timeout 5m
5. Verifyβ
$ kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
node-agent-cl6xg 1/1 Running 0 51s
node-agent-dp6fx 1/1 Running 0 51s
node-agent-pvq2w 1/1 Running 0 51s
velero-597b886f5b-cnkqf 1/1 Running 0 51s
$ velero backup-location get
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
default aws velero Available β¦ ReadWrite true
$ velero schedule get
NAME STATUS SCHEDULE BACKUP TTL LAST BACKUP
velero-daily-full Enabled 0 3 * * * 168h0m0s n/a
End-to-end restore testβ
Create a throwaway namespace, back it up, destroy it, restore it, verify the data is identical.
# 1. Create test workload
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata: {name: backup-test}
---
apiVersion: v1
kind: ConfigMap
metadata: {name: tiny-config, namespace: backup-test}
data: {hello.txt: "Hello from before the backup!"}
---
apiVersion: apps/v1
kind: Deployment
metadata: {name: tiny-app, namespace: backup-test}
spec:
replicas: 1
selector: {matchLabels: {app: tiny-app}}
template:
metadata: {labels: {app: tiny-app}}
spec:
containers:
- name: tiny
image: ghcr.io/stefanprodan/podinfo:6.11.2
ports: [{containerPort: 9898}]
EOF
# 2. Take a backup (synchronous)
velero backup create test-backup --include-namespaces backup-test --wait
# 3. Bundle on disk (15 KiB total, 9 objects β Velero metadata + the
# tar.gz of all manifests)
mc ls -r minilocal/velero/backups/test-backup/
# 4. Destroy the namespace
kubectl delete namespace backup-test --wait
# 5. Restore
velero restore create test-restore --from-backup test-backup --wait
# 6. Verify
kubectl get all,configmap -n backup-test
kubectl get cm tiny-config -n backup-test -o jsonpath='{.data.hello\.txt}'
# β "Hello from before the backup!"
# 7. Cleanup
kubectl delete namespace backup-test
velero backup delete test-backup --confirm
The restore succeeds β same Pod name, same ConfigMap data β within ~5 seconds. Without PVs, this is a metadata-only round trip; for workloads with PVs, the node-agent reconstructs file-system contents from the Kopia restic-style backup.
ArgoCD interaction warningβ
If the namespace being restored is managed by ArgoCD (e.g. our homer,
whoami, platform-demo), ArgoCD's selfHeal: true may interpret the
restored manifests as drift from the gitops repo and try to revert them
mid-restore. Always:
- Pause sync on affected Applications first:
kubectl patch app -n argocd <app-name> \--type merge \-p '{"spec":{"syncPolicy":{"automated":null}}}'
- Run the restore.
- Verify it landed.
- Re-enable sync once verified:
kubectl patch app -n argocd <app-name> \--type merge \-p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
This is a real "two declarative systems on the same cluster" conflict, documented in the Velero issue tracker. The restore-then-pause pattern is standard practice in any GitOps + Velero shop.
Disaster Recovery Runbookβ
Scenario A: a single namespace is corruptedβ
# Pause ArgoCD if applicable
kubectl patch app -n argocd <name> --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'
# Find the latest backup
velero backup get | grep <namespace>
# Restore
velero restore create --from-backup <backup-name> --include-namespaces <namespace>
# Verify; re-enable ArgoCD
RTO: ~5 min for metadata-only namespaces, longer for workloads with PVs.
Scenario B: full cluster wipe, set-hog SSD diesβ
- Reprovision set-hog via MAAS (Phase 0 procedures).
- Install k3s control plane (Phase 1).
- Re-join
fast-skunkandfast-heron(Phase 1). - Copy
/srv/backups/k3s/state.db.<latest>from controller to set-hog only if rebuilding cluster from scratch isn't possible β see Phase 14 β Etcd / SQLite snapshots for the SQLite restore path. - Re-install Velero (
helm install velero vmware-tanzu/velero -n velero -f velero-values.yaml) β same chart, same values, same MinIO endpoint. velero restore create --from-backup velero-daily-full-<latest>β pulls all namespaces + PV data from the controller's MinIO.- Verify each namespace.
RTO: 45β60 min, dominated by MAAS reimage time.
Scenario C: controller dies (MinIO + snapshots gone)β
This is the uncovered failure mode. We have one off-cluster backup target; if the controller's SSD dies we lose backups. Phase 15+ will add an off-site copy (e.g., to Backblaze B2 or an external USB drive on a weekly cron). For homelab portfolio purposes this is documented as a known limitation, not a TODO.
Done Whenβ
β minio.service active on the controller, container has Up status
β http://10.0.0.1:9000/minio/health/live returns 200
β Velero pods (1 server + 3 node-agent) all Running in the cluster
β velero backup-location get shows default Available
β velero schedule get shows velero-daily-full Enabled
β One end-to-end test (backup β delete β restore) verified
β Bundle visible at /srv/backups/minio/velero/backups/<name>/
Real-world skills demonstratedβ
| Skill | Industry context |
|---|---|
| Decoupling backup storage from the source system | The single most important rule of DR; same as "always store backups off-site" in traditional sysadmin |
| MinIO as a self-hosted S3 substitute | Standard pattern in air-gapped, on-prem, and homelab Kubernetes. Same shape as production teams running MinIO Operator on dedicated nodes. |
| systemd + Docker for host services | Canonical way to run a third-party container on a bare metal host (vs. running it as a k8s pod when k8s is the thing being backed up) |
| Velero with file-system PV backup (Kopia) | The default for any cluster without CSI VolumeSnapshot support β covers Longhorn, NFS, hostPath, etc. |
| Out-of-band credential injection | The ~/.minio-admin β bind-mount into container β environment variable pattern. Same shape as Vault Agent injection. |
| ArgoCD/Velero coexistence pattern | Real production challenge β solved by pausing ArgoCD selfHeal during restore. Documented runbook is the actual deliverable. |
| Honest failure-mode documentation | The "controller dies β backups lost" gap is real. A portfolio that names the gap is more credible than one that pretends everything is recoverable. |