Skip to main content

Velero β€” Cluster Backup & Disaster Recovery

Velero captures Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, RBAC, Ingresses) plus persistent-volume data (file-system level, via the node-agent DaemonSet) into an S3-compatible bucket. Combined with the hourly k3s SQLite snapshots on the same controller (see Phase 14 β€” Etcd / SQLite snapshots), this is the platform's full backup-and-DR layer.

The original plan called for a cluster-internal MinIO. We deliberately pivoted to running MinIO on the MAAS controller β€” outside the cluster β€” because backups stored inside the system being backed up are useless when the system fails. This page documents the controller-side architecture in full.


Architecture​

Cluster (3 nodes) Controller (separate machine)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ k3s control β”‚ β”‚ MinIO (Docker container) β”‚
β”‚ plane (set-hog)β”‚ β”‚ binds 10.0.0.1:9000 (S3) β”‚
β”‚ β”‚ β”‚ 10.0.0.1:9001 (UI) β”‚
β”‚ Velero pod ────┼── S3 protocol ───▢│ bucket: velero β”‚
β”‚ + 3 node-agent β”‚ β”‚ data: /srv/backups/minioβ”‚
β”‚ pods (PV bkp) β”‚ β”‚ config: /srv/config/minio β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ systemd: minio.service β”‚
β”‚ β”‚ β”‚ restarts on failure β”‚
β”‚ β”‚ β”‚ bind-mounts data + config β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The cluster pods reach MinIO via the cluster subnet 10.0.0.0/24 β€” no public exposure, no Ingress required. The controller is a separate physical machine, so a complete cluster wipe leaves backups intact.


Decisions​

DecisionChoiceWhy
Backup target locationMinIO on the MAAS controller, not in-clusterSurvives cluster failure (the canonical reason β€” backup storage living inside the thing it's backing up is circular)
MinIO install methodDocker + systemd, bind-mounted volumesStandard "managed-on-host" pattern; survives image upgrades; doesn't consume cluster resources
Network binding10.0.0.1:9000 and 10.0.0.1:9001 onlyPort 9000 already taken on 127.0.0.1 by another controller-local service (likely MAAS internal)
MinIO admin passwordGenerated to ~/.minio-admin (mode 600), bind-mounted into the container as /run/secrets/minio_adminSame out-of-band pattern as Harbor/Grafana/ArgoCD; password file never lives in git or in process args
Velero installHelm chart vmware-tanzu/velero v12.0.1 (App v1.18.0)Standard install path; configurable values; matches the rest of the platform
Velero pluginvelero-plugin-for-aws:v1.13.0Talks to any S3-compatible (MinIO included)
Storage credentialsOut-of-band Secret velero-credentials in the velero namespace, created via kubectl create secret generic from a local credentials INIStandard pattern; chart reads via existingSecret; rotated by re-creating the Secret
Volume backupsdeployNodeAgent: true + defaultVolumesToFsBackup: trueWithout this, only k8s objects are captured β€” no PV data, no actual app state. The node-agent uses Kopia to file-system-copy PV contents.
Snapshot locationDisabled (MinIO doesn't do CSI VolumeSnapshots)Phase 15 will revisit if Longhorn CSI snapshots become useful; not needed for file-system-level backup
ScheduleDaily full-cluster at 03:00 UTC, 7-day TTLDaily granularity, weekly retention is the common baseline for non-critical homelab workloads
Excluded namespacesvelero, kube-system, kube-public, kube-node-lease, backup-testvelero excluded so it doesn't back up itself; kube-* excluded because k3s recreates them on bootstrap; backup-test is the per-restore-test scratch namespace

Pre-flight​

  • Controller has Docker (docker --version β‰₯ 24)
  • Controller user is in the docker group (so docker run doesn't need sudo)
  • 75+ GiB free at /srv on the controller
  • Cluster is healthy (regression check passes)

Install MinIO on the controller​

1. Generate admin password (mode 600)​

openssl rand -base64 24 > ~/.minio-admin
chmod 600 ~/.minio-admin

2. Create the host-side directories (one-time sudo)​

sudo mkdir -p /srv/backups/minio /srv/config/minio
sudo chown -R "$USER:$USER" /srv/backups /srv/config
sudo chmod 700 /srv/backups/minio /srv/config/minio

3. systemd unit at /etc/systemd/system/minio.service​

[Unit]
Description=MinIO S3-compatible object storage (Phase 14 backup target)
Requires=docker.service
After=docker.service network-online.target

[Service]
Type=simple
Restart=on-failure
RestartSec=10s

ExecStartPre=-/usr/bin/docker stop minio
ExecStartPre=-/usr/bin/docker rm minio

# Bind explicitly to 10.0.0.1 only (port 9000 is taken on 127.0.0.1).
ExecStart=/usr/bin/docker run \
--name minio --rm \
-p 10.0.0.1:9000:9000 \
-p 10.0.0.1:9001:9001 \
-e MINIO_ROOT_USER=admin \
-e MINIO_ROOT_PASSWORD_FILE=/run/secrets/minio_admin \
-v /srv/backups/minio:/data \
-v /srv/config/minio:/root/.minio \
-v /home/<user>/.minio-admin:/run/secrets/minio_admin:ro \
quay.io/minio/minio:latest \
server /data --console-address ":9001" --address ":9000"

ExecStop=/usr/bin/docker stop minio

[Install]
WantedBy=multi-user.target

4. Start the service​

sudo systemctl daemon-reload
sudo systemctl enable minio.service
sudo systemctl start minio.service

# Verify
curl -sf http://10.0.0.1:9000/minio/health/live -o /dev/null -w "S3 API: %{http_code}\n"
curl -sf http://10.0.0.1:9001/ -o /dev/null -w "Console: %{http_code}\n"
# Both should return 200

5. Create the velero bucket via mc​

# Install MinIO client (no root needed)
curl -sLO https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc && mv mc ~/.local/bin/

# Configure alias
mc alias set minilocal http://10.0.0.1:9000 admin "$(cat ~/.minio-admin)"

# Create bucket
mc mb minilocal/velero

Install Velero in the cluster​

1. Install the Velero CLI on the controller​

curl -sLO https://github.com/vmware-tanzu/velero/releases/download/v1.18.0/velero-v1.18.0-linux-amd64.tar.gz
tar -xzf velero-v1.18.0-linux-amd64.tar.gz velero-v1.18.0-linux-amd64/velero
mv velero-v1.18.0-linux-amd64/velero ~/.local/bin/velero
chmod +x ~/.local/bin/velero
velero version --client-only

2. Create namespace + credentials secret (out of band)​

kubectl create namespace velero

cat > /tmp/cloud-credentials <<EOF
[default]
aws_access_key_id=admin
aws_secret_access_key=$(cat ~/.minio-admin)
EOF
chmod 600 /tmp/cloud-credentials

kubectl create secret generic velero-credentials \
-n velero --from-file=cloud=/tmp/cloud-credentials
rm /tmp/cloud-credentials

3. velero-values.yaml​

initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.13.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins

configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero
default: true
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://10.0.0.1:9000
publicUrl: http://10.0.0.1:9000

volumeSnapshotLocation: [] # MinIO doesn't do CSI snapshots
defaultBackupStorageLocation: default
defaultVolumesToFsBackup: true

credentials:
useSecret: true
existingSecret: velero-credentials

deployNodeAgent: true # the DaemonSet that does file-system PV backup
nodeAgent:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }

resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }

schedules:
daily-full:
disabled: false
schedule: "0 3 * * *"
template:
ttl: "168h" # 7 days
includedNamespaces: ["*"]
excludedNamespaces:
- velero
- kube-system
- kube-public
- kube-node-lease
- backup-test

metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus-stack

4. Helm install​

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update vmware-tanzu

helm install velero vmware-tanzu/velero \
-n velero \
-f velero-values.yaml \
--wait --timeout 5m

5. Verify​

$ kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
node-agent-cl6xg 1/1 Running 0 51s
node-agent-dp6fx 1/1 Running 0 51s
node-agent-pvq2w 1/1 Running 0 51s
velero-597b886f5b-cnkqf 1/1 Running 0 51s

$ velero backup-location get
NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT
default aws velero Available … ReadWrite true

$ velero schedule get
NAME STATUS SCHEDULE BACKUP TTL LAST BACKUP
velero-daily-full Enabled 0 3 * * * 168h0m0s n/a

End-to-end restore test​

Create a throwaway namespace, back it up, destroy it, restore it, verify the data is identical.

# 1. Create test workload
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata: {name: backup-test}
---
apiVersion: v1
kind: ConfigMap
metadata: {name: tiny-config, namespace: backup-test}
data: {hello.txt: "Hello from before the backup!"}
---
apiVersion: apps/v1
kind: Deployment
metadata: {name: tiny-app, namespace: backup-test}
spec:
replicas: 1
selector: {matchLabels: {app: tiny-app}}
template:
metadata: {labels: {app: tiny-app}}
spec:
containers:
- name: tiny
image: ghcr.io/stefanprodan/podinfo:6.11.2
ports: [{containerPort: 9898}]
EOF

# 2. Take a backup (synchronous)
velero backup create test-backup --include-namespaces backup-test --wait

# 3. Bundle on disk (15 KiB total, 9 objects β€” Velero metadata + the
# tar.gz of all manifests)
mc ls -r minilocal/velero/backups/test-backup/

# 4. Destroy the namespace
kubectl delete namespace backup-test --wait

# 5. Restore
velero restore create test-restore --from-backup test-backup --wait

# 6. Verify
kubectl get all,configmap -n backup-test
kubectl get cm tiny-config -n backup-test -o jsonpath='{.data.hello\.txt}'
# β†’ "Hello from before the backup!"

# 7. Cleanup
kubectl delete namespace backup-test
velero backup delete test-backup --confirm

The restore succeeds β€” same Pod name, same ConfigMap data β€” within ~5 seconds. Without PVs, this is a metadata-only round trip; for workloads with PVs, the node-agent reconstructs file-system contents from the Kopia restic-style backup.


ArgoCD interaction warning​

If the namespace being restored is managed by ArgoCD (e.g. our homer, whoami, platform-demo), ArgoCD's selfHeal: true may interpret the restored manifests as drift from the gitops repo and try to revert them mid-restore. Always:

  1. Pause sync on affected Applications first:
    kubectl patch app -n argocd <app-name> \
    --type merge \
    -p '{"spec":{"syncPolicy":{"automated":null}}}'
  2. Run the restore.
  3. Verify it landed.
  4. Re-enable sync once verified:
    kubectl patch app -n argocd <app-name> \
    --type merge \
    -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

This is a real "two declarative systems on the same cluster" conflict, documented in the Velero issue tracker. The restore-then-pause pattern is standard practice in any GitOps + Velero shop.


Disaster Recovery Runbook​

Scenario A: a single namespace is corrupted​

# Pause ArgoCD if applicable
kubectl patch app -n argocd <name> --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'

# Find the latest backup
velero backup get | grep <namespace>

# Restore
velero restore create --from-backup <backup-name> --include-namespaces <namespace>

# Verify; re-enable ArgoCD

RTO: ~5 min for metadata-only namespaces, longer for workloads with PVs.

Scenario B: full cluster wipe, set-hog SSD dies​

  1. Reprovision set-hog via MAAS (Phase 0 procedures).
  2. Install k3s control plane (Phase 1).
  3. Re-join fast-skunk and fast-heron (Phase 1).
  4. Copy /srv/backups/k3s/state.db.<latest> from controller to set-hog only if rebuilding cluster from scratch isn't possible β€” see Phase 14 β€” Etcd / SQLite snapshots for the SQLite restore path.
  5. Re-install Velero (helm install velero vmware-tanzu/velero -n velero -f velero-values.yaml) β€” same chart, same values, same MinIO endpoint.
  6. velero restore create --from-backup velero-daily-full-<latest> β€” pulls all namespaces + PV data from the controller's MinIO.
  7. Verify each namespace.

RTO: 45–60 min, dominated by MAAS reimage time.

Scenario C: controller dies (MinIO + snapshots gone)​

This is the uncovered failure mode. We have one off-cluster backup target; if the controller's SSD dies we lose backups. Phase 15+ will add an off-site copy (e.g., to Backblaze B2 or an external USB drive on a weekly cron). For homelab portfolio purposes this is documented as a known limitation, not a TODO.


Done When​

βœ” minio.service active on the controller, container has Up status
βœ” http://10.0.0.1:9000/minio/health/live returns 200
βœ” Velero pods (1 server + 3 node-agent) all Running in the cluster
βœ” velero backup-location get shows default Available
βœ” velero schedule get shows velero-daily-full Enabled
βœ” One end-to-end test (backup β†’ delete β†’ restore) verified
βœ” Bundle visible at /srv/backups/minio/velero/backups/<name>/

Real-world skills demonstrated​

SkillIndustry context
Decoupling backup storage from the source systemThe single most important rule of DR; same as "always store backups off-site" in traditional sysadmin
MinIO as a self-hosted S3 substituteStandard pattern in air-gapped, on-prem, and homelab Kubernetes. Same shape as production teams running MinIO Operator on dedicated nodes.
systemd + Docker for host servicesCanonical way to run a third-party container on a bare metal host (vs. running it as a k8s pod when k8s is the thing being backed up)
Velero with file-system PV backup (Kopia)The default for any cluster without CSI VolumeSnapshot support β€” covers Longhorn, NFS, hostPath, etc.
Out-of-band credential injectionThe ~/.minio-admin β†’ bind-mount into container β†’ environment variable pattern. Same shape as Vault Agent injection.
ArgoCD/Velero coexistence patternReal production challenge β€” solved by pausing ArgoCD selfHeal during restore. Documented runbook is the actual deliverable.
Honest failure-mode documentationThe "controller dies β†’ backups lost" gap is real. A portfolio that names the gap is more credible than one that pretends everything is recoverable.