Velero — Cluster Backup & Disaster Recovery

Velero captures Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, RBAC, Ingresses) plus persistent-volume data (file-system level, via the node-agent DaemonSet) into an S3-compatible bucket. Combined with the hourly k3s SQLite snapshots on the same controller (see Phase 14 — Etcd / SQLite snapshots), this is the platform's full backup-and-DR layer.

The original plan called for a cluster-internal MinIO. We deliberately pivoted to running MinIO on the MAAS controller — outside the cluster — because backups stored inside the system being backed up are useless when the system fails. This page documents the controller-side architecture in full.

Architecture

    Cluster (3 nodes)                    Controller (separate machine)
    ┌────────────────┐                   ┌─────────────────────────────┐
    │ k3s control    │                   │ MinIO (Docker container)    │
    │ plane (set-hog)│                   │   binds 10.0.0.1:9000 (S3)  │
    │                │                   │         10.0.0.1:9001 (UI)  │
    │ Velero pod ────┼── S3 protocol ───▶│   bucket: velero            │
    │ + 3 node-agent │                   │   data:   /srv/backups/minio│
    │ pods (PV bkp)  │                   │   config: /srv/config/minio │
    │                │                   │                             │
    │                │                   │ systemd: minio.service      │
    │                │                   │   restarts on failure       │
    │                │                   │   bind-mounts data + config │
    └────────────────┘                   └─────────────────────────────┘

The cluster pods reach MinIO via the cluster subnet 10.0.0.0/24 — no public exposure, no Ingress required. The controller is a separate physical machine, so a complete cluster wipe leaves backups intact.

Decisions

Decision	Choice	Why
Backup target location	MinIO on the MAAS controller, not in-cluster	Survives cluster failure (the canonical reason — backup storage living inside the thing it's backing up is circular)
MinIO install method	Docker + systemd, bind-mounted volumes	Standard "managed-on-host" pattern; survives image upgrades; doesn't consume cluster resources
Network binding	`10.0.0.1:9000` and `10.0.0.1:9001` only	Port 9000 already taken on `127.0.0.1` by another controller-local service (likely MAAS internal)
MinIO admin password	Generated to `~/.minio-admin` (mode 600), bind-mounted into the container as `/run/secrets/minio_admin`	Same out-of-band pattern as Harbor/Grafana/ArgoCD; password file never lives in git or in process args
Velero install	Helm chart `vmware-tanzu/velero` v12.0.1 (App v1.18.0)	Standard install path; configurable values; matches the rest of the platform
Velero plugin	`velero-plugin-for-aws:v1.13.0`	Talks to any S3-compatible (MinIO included)
Storage credentials	Out-of-band Secret `velero-credentials` in the `velero` namespace, created via `kubectl create secret generic` from a local credentials INI	Standard pattern; chart reads via `existingSecret`; rotated by re-creating the Secret
Volume backups	`deployNodeAgent: true` + `defaultVolumesToFsBackup: true`	Without this, only k8s objects are captured — no PV data, no actual app state. The node-agent uses Kopia to file-system-copy PV contents.
Snapshot location	Disabled (MinIO doesn't do CSI VolumeSnapshots)	Phase 15 will revisit if Longhorn CSI snapshots become useful; not needed for file-system-level backup
Schedule	Daily full-cluster at 03:00 UTC, 7-day TTL	Daily granularity, weekly retention is the common baseline for non-critical homelab workloads
Excluded namespaces	`velero`, `kube-system`, `kube-public`, `kube-node-lease`, `backup-test`	velero excluded so it doesn't back up itself; kube-* excluded because k3s recreates them on bootstrap; backup-test is the per-restore-test scratch namespace

Pre-flight

Controller has Docker (docker --version ≥ 24)
Controller user is in the docker group (so docker run doesn't need sudo)
75+ GiB free at /srv on the controller
Cluster is healthy (regression check passes)

Install MinIO on the controller

1. Generate admin password (mode 600)

openssl rand -base64 24 > ~/.minio-admin
chmod 600 ~/.minio-admin

2. Create the host-side directories (one-time `sudo`)

sudo mkdir -p /srv/backups/minio /srv/config/minio
sudo chown -R "$USER:$USER" /srv/backups /srv/config
sudo chmod 700 /srv/backups/minio /srv/config/minio

3. systemd unit at `/etc/systemd/system/minio.service`

[Unit]
Description=MinIO S3-compatible object storage (Phase 14 backup target)
Requires=docker.service
After=docker.service network-online.target

[Service]
Type=simple
Restart=on-failure
RestartSec=10s

ExecStartPre=-/usr/bin/docker stop minio
ExecStartPre=-/usr/bin/docker rm minio

# Bind explicitly to 10.0.0.1 only (port 9000 is taken on 127.0.0.1).
ExecStart=/usr/bin/docker run \
    --name minio --rm \
    -p 10.0.0.1:9000:9000 \
    -p 10.0.0.1:9001:9001 \
    -e MINIO_ROOT_USER=admin \
    -e MINIO_ROOT_PASSWORD_FILE=/run/secrets/minio_admin \
    -v /srv/backups/minio:/data \
    -v /srv/config/minio:/root/.minio \
    -v /home/<user>/.minio-admin:/run/secrets/minio_admin:ro \
    quay.io/minio/minio:latest \
    server /data --console-address ":9001" --address ":9000"

ExecStop=/usr/bin/docker stop minio

[Install]
WantedBy=multi-user.target

4. Start the service

sudo systemctl daemon-reload
sudo systemctl enable minio.service
sudo systemctl start minio.service

# Verify
curl -sf http://10.0.0.1:9000/minio/health/live -o /dev/null -w "S3 API:  %{http_code}\n"
curl -sf http://10.0.0.1:9001/             -o /dev/null -w "Console: %{http_code}\n"
# Both should return 200

5. Create the `velero` bucket via `mc`

# Install MinIO client (no root needed)
curl -sLO https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc && mv mc ~/.local/bin/

# Configure alias
mc alias set minilocal http://10.0.0.1:9000 admin "$(cat ~/.minio-admin)"

# Create bucket
mc mb minilocal/velero

Install Velero in the cluster

1. Install the Velero CLI on the controller

curl -sLO https://github.com/vmware-tanzu/velero/releases/download/v1.18.0/velero-v1.18.0-linux-amd64.tar.gz
tar -xzf velero-v1.18.0-linux-amd64.tar.gz velero-v1.18.0-linux-amd64/velero
mv velero-v1.18.0-linux-amd64/velero ~/.local/bin/velero
chmod +x ~/.local/bin/velero
velero version --client-only

2. Create namespace + credentials secret (out of band)

kubectl create namespace velero

cat > /tmp/cloud-credentials <<EOF
[default]
aws_access_key_id=admin
aws_secret_access_key=$(cat ~/.minio-admin)
EOF
chmod 600 /tmp/cloud-credentials

kubectl create secret generic velero-credentials \
  -n velero --from-file=cloud=/tmp/cloud-credentials
rm /tmp/cloud-credentials

3. `velero-values.yaml`

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.13.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

configuration:
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      default: true
      config:
        region: minio
        s3ForcePathStyle: "true"
        s3Url: http://10.0.0.1:9000
        publicUrl: http://10.0.0.1:9000

  volumeSnapshotLocation: []           # MinIO doesn't do CSI snapshots
  defaultBackupStorageLocation: default
  defaultVolumesToFsBackup: true

credentials:
  useSecret: true
  existingSecret: velero-credentials

deployNodeAgent: true                  # the DaemonSet that does file-system PV backup
nodeAgent:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits:   { cpu: 1000m, memory: 1Gi }

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits:   { cpu: 1000m, memory: 1Gi }

schedules:
  daily-full:
    disabled: false
    schedule: "0 3 * * *"
    template:
      ttl: "168h"   # 7 days
      includedNamespaces: ["*"]
      excludedNamespaces:
        - velero
        - kube-system
        - kube-public
        - kube-node-lease
        - backup-test

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: kube-prometheus-stack

4. Helm install

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update vmware-tanzu

helm install velero vmware-tanzu/velero \
  -n velero \
  -f velero-values.yaml \
  --wait --timeout 5m

5. Verify

$ kubectl get pods -n velero
NAME                      READY   STATUS    RESTARTS   AGE
node-agent-cl6xg          1/1     Running   0          51s
node-agent-dp6fx          1/1     Running   0          51s
node-agent-pvq2w          1/1     Running   0          51s
velero-597b886f5b-cnkqf   1/1     Running   0          51s

$ velero backup-location get
NAME      PROVIDER   BUCKET/PREFIX   PHASE       LAST VALIDATED   ACCESS MODE   DEFAULT
default   aws        velero          Available   …                ReadWrite     true

$ velero schedule get
NAME                STATUS    SCHEDULE    BACKUP TTL   LAST BACKUP
velero-daily-full   Enabled   0 3 * * *   168h0m0s     n/a

End-to-end restore test

Create a throwaway namespace, back it up, destroy it, restore it, verify the data is identical.

# 1. Create test workload
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata: {name: backup-test}
---
apiVersion: v1
kind: ConfigMap
metadata: {name: tiny-config, namespace: backup-test}
data: {hello.txt: "Hello from before the backup!"}
---
apiVersion: apps/v1
kind: Deployment
metadata: {name: tiny-app, namespace: backup-test}
spec:
  replicas: 1
  selector: {matchLabels: {app: tiny-app}}
  template:
    metadata: {labels: {app: tiny-app}}
    spec:
      containers:
        - name: tiny
          image: ghcr.io/stefanprodan/podinfo:6.11.2
          ports: [{containerPort: 9898}]
EOF

# 2. Take a backup (synchronous)
velero backup create test-backup --include-namespaces backup-test --wait

# 3. Bundle on disk (15 KiB total, 9 objects — Velero metadata + the
#    tar.gz of all manifests)
mc ls -r minilocal/velero/backups/test-backup/

# 4. Destroy the namespace
kubectl delete namespace backup-test --wait

# 5. Restore
velero restore create test-restore --from-backup test-backup --wait

# 6. Verify
kubectl get all,configmap -n backup-test
kubectl get cm tiny-config -n backup-test -o jsonpath='{.data.hello\.txt}'
# → "Hello from before the backup!"

# 7. Cleanup
kubectl delete namespace backup-test
velero backup delete test-backup --confirm

The restore succeeds — same Pod name, same ConfigMap data — within ~5 seconds. Without PVs, this is a metadata-only round trip; for workloads with PVs, the node-agent reconstructs file-system contents from the Kopia restic-style backup.

ArgoCD interaction warning

If the namespace being restored is managed by ArgoCD (e.g. our homer, whoami, platform-demo), ArgoCD's selfHeal: true may interpret the restored manifests as drift from the gitops repo and try to revert them mid-restore. Always:

Pause sync on affected Applications first:

kubectl patch app -n argocd <app-name> \
  --type merge \
  -p '{"spec":{"syncPolicy":{"automated":null}}}'

Run the restore.
Verify it landed.

Re-enable sync once verified:

kubectl patch app -n argocd <app-name> \
  --type merge \
  -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

This is a real "two declarative systems on the same cluster" conflict, documented in the Velero issue tracker. The restore-then-pause pattern is standard practice in any GitOps + Velero shop.

Disaster Recovery Runbook

Scenario A: a single namespace is corrupted

# Pause ArgoCD if applicable
kubectl patch app -n argocd <name> --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'

# Find the latest backup
velero backup get | grep <namespace>

# Restore
velero restore create --from-backup <backup-name> --include-namespaces <namespace>

# Verify; re-enable ArgoCD

RTO: ~5 min for metadata-only namespaces, longer for workloads with PVs.

Scenario B: full cluster wipe, set-hog SSD dies

Reprovision set-hog via MAAS (Phase 0 procedures).
Install k3s control plane (Phase 1).
Re-join fast-skunk and fast-heron (Phase 1).
Copy /srv/backups/k3s/state.db.<latest> from controller to set-hog only if rebuilding cluster from scratch isn't possible — see Phase 14 — Etcd / SQLite snapshots for the SQLite restore path.
Re-install Velero (helm install velero vmware-tanzu/velero -n velero -f velero-values.yaml) — same chart, same values, same MinIO endpoint.
velero restore create --from-backup velero-daily-full-<latest> — pulls all namespaces + PV data from the controller's MinIO.
Verify each namespace.

RTO: 45–60 min, dominated by MAAS reimage time.

Scenario C: controller dies (MinIO + snapshots gone)

This is the uncovered failure mode. We have one off-cluster backup target; if the controller's SSD dies we lose backups. Phase 15+ will add an off-site copy (e.g., to Backblaze B2 or an external USB drive on a weekly cron). For homelab portfolio purposes this is documented as a known limitation, not a TODO.

Done When

✔ minio.service active on the controller, container has Up status
✔ http://10.0.0.1:9000/minio/health/live returns 200
✔ Velero pods (1 server + 3 node-agent) all Running in the cluster
✔ velero backup-location get shows default Available
✔ velero schedule get shows velero-daily-full Enabled
✔ One end-to-end test (backup → delete → restore) verified
✔ Bundle visible at /srv/backups/minio/velero/backups/<name>/

Real-world skills demonstrated

Skill	Industry context
Decoupling backup storage from the source system	The single most important rule of DR; same as "always store backups off-site" in traditional sysadmin
MinIO as a self-hosted S3 substitute	Standard pattern in air-gapped, on-prem, and homelab Kubernetes. Same shape as production teams running MinIO Operator on dedicated nodes.
systemd + Docker for host services	Canonical way to run a third-party container on a bare metal host (vs. running it as a k8s pod when k8s is the thing being backed up)
Velero with file-system PV backup (Kopia)	The default for any cluster without CSI VolumeSnapshot support — covers Longhorn, NFS, hostPath, etc.
Out-of-band credential injection	The `~/.minio-admin` → bind-mount into container → environment variable pattern. Same shape as Vault Agent injection.
ArgoCD/Velero coexistence pattern	Real production challenge — solved by pausing ArgoCD `selfHeal` during restore. Documented runbook is the actual deliverable.
Honest failure-mode documentation	The "controller dies → backups lost" gap is real. A portfolio that names the gap is more credible than one that pretends everything is recoverable.

Architecture​

Decisions​

Pre-flight​

Install MinIO on the controller​

1. Generate admin password (mode 600)​

2. Create the host-side directories (one-time sudo)​

3. systemd unit at /etc/systemd/system/minio.service​

4. Start the service​

5. Create the velero bucket via mc​

Install Velero in the cluster​

1. Install the Velero CLI on the controller​

2. Create namespace + credentials secret (out of band)​

3. velero-values.yaml​

4. Helm install​

5. Verify​

End-to-end restore test​

ArgoCD interaction warning​

Disaster Recovery Runbook​

Scenario A: a single namespace is corrupted​

Scenario B: full cluster wipe, set-hog SSD dies​

Scenario C: controller dies (MinIO + snapshots gone)​

Done When​

Real-world skills demonstrated​