k3s Control-Plane Snapshots β Pivoted from etcd to SQLite
The original Phase 14 plan called for k3s built-in etcd snapshots via
etcd-snapshot-schedule-cron. That assumed embedded etcd as the
datastore. Single-server k3s defaults to SQLite via kine, not embedded etcd β so the etcd-snapshot config is
silently ignored on our cluster.
This page documents the pivot: instead of etcd snapshots, we take
hourly online SQLite backups of k3s's state.db via sqlite3 .backup,
pulled to the MAAS controller by a cron job. Functionally equivalent β
we get a consistent point-in-time snapshot of the entire cluster control
plane β just via a different mechanism.
The other half of Phase 14 (Velero for application + PV data) is documented in Phase 14 β Velero.
Why SQLite (and not etcd)β
Single-server k3s without --cluster-init uses kine over SQLite as
the datastore. The state.db file at
/var/lib/rancher/k3s/server/db/state.db contains the entire Kubernetes
API state β every Deployment, Service, ConfigMap, Secret, RBAC rule,
custom resource. Backing it up = backing up the cluster's "memory."
Migrating a live single-server cluster to embedded etcd:
- Requires
--cluster-initand a one-time data migration - No automatic migration tooling β would need re-bootstrap + redeploy
- High risk of disruption for an already-working cluster
For a single-control-plane homelab, SQLite is fine. Production HA clusters use embedded etcd because etcd supports the multi-server raft consensus k3s needs to replicate state. We're not HA today; if a future phase pursues HA control plane, the migration would happen there.
How sqlite3 .backup worksβ
SQLite's online-backup API takes a consistent snapshot of a live
database while it's being written to. The .backup command in the
sqlite3 CLI is a thin wrapper:
sqlite3 /var/lib/rancher/k3s/server/db/state.db '.backup /tmp/snapshot.db'
The result is a self-contained SQLite database file with no -wal or
-shm sidecars to worry about. Cleanly copyable.
Without sqlite3's online API, naΓ―vely cp state.db ... while k3s is
writing produces a corrupt snapshot (the WAL contains uncommitted
transactions that the standalone .db file doesn't see).
Architectureβ
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β set-hog β β Controller β
β (control plane) β β β
β β pull-mode rsync β /srv/backups/k3s/ β
β /var/lib/rancher β βββββ SSH βββββββ state.db.<UTC> β
β /k3s/server/db/ β β β
β state.db β β /etc/cron.d/k3s-state- β
β β β snapshot (hourly, HH:17) β
β + sqlite3 .backupβ β β
β to /tmp first β β /usr/local/bin/k3s-state- β
β β β snapshot (the script) β
ββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
Pull mode (controller initiates) keeps the security boundary clean:
the cluster nodes don't need credentials to write to the controller. The
controller's user (ktayl) has SSH key access to ubuntu@10.0.0.2 with
passwordless sudo (set up in Phase 0); we use those existing creds.
Decisionsβ
| Decision | Choice | Why |
|---|---|---|
| Snapshot mechanism | sqlite3 .backup (online, no k3s downtime) | The only zero-disruption way to back up a live SQLite database |
| Snapshot direction | Pull from controller, not push from cluster | Keeps the security boundary one-way; cluster never touches controller filesystem |
| Cadence | Hourly at HH:17 | Hourly granularity covers most "yesterday's looking-good state" recovery; HH:17 to avoid clashing with on-the-hour cron rushes |
| Retention | 720 snapshots β 30 days | One month at hourly is ~22 GB worst case; controller has 75+ GiB free |
| Storage location | /srv/backups/k3s/state.db.<UTC> (mode 700, owned by ktayl) | Same /srv/backups/ root as MinIO's data β single backup tree to monitor |
| Logging | logger -t k3s-state-snapshot β syslog | Visible via journalctl -t k3s-state-snapshot; pages can find it |
Pre-flightβ
Install sqlite3 on set-hog (one-time):
ssh ubuntu@10.0.0.2 'sudo apt-get install -y sqlite3'
Note: this is also added to the Phase 10 Ansible
commonrole baseline so a fresh node provision picks it up automatically.
The pull scriptβ
/usr/local/bin/k3s-state-snapshot (owned by root, mode 0755):
#!/usr/bin/env bash
# k3s-state-snapshot β pulls a consistent online SQLite backup of set-hog's
# k3s control-plane state to the MAAS controller. Cron'd hourly.
set -euo pipefail
NODE="ubuntu@10.0.0.2"
DB="/var/lib/rancher/k3s/server/db/state.db"
DEST="/srv/backups/k3s"
RETAIN_COUNT=720 # 720 hourly snapshots = 30 days
TS=$(date -u +%Y%m%dT%H%M%SZ)
TARGET="${DEST}/state.db.${TS}"
mkdir -p "$DEST"
# Online backup on the source node, then rsync it back. The /tmp copy is
# removed even if rsync fails.
ssh -o BatchMode=yes "$NODE" "sudo sqlite3 ${DB} '.backup /tmp/k3s-state-pull.db' && sudo chmod 644 /tmp/k3s-state-pull.db"
rsync -az --partial "${NODE}:/tmp/k3s-state-pull.db" "$TARGET"
ssh -o BatchMode=yes "$NODE" "sudo rm -f /tmp/k3s-state-pull.db"
# Rotate
cd "$DEST"
ls -1t state.db.* 2>/dev/null | tail -n +$((RETAIN_COUNT + 1)) | xargs -r rm --
# Log
SIZE=$(stat -c %s "$TARGET" 2>/dev/null || echo 0)
COUNT=$(ls -1 state.db.* 2>/dev/null | wc -l)
logger -t k3s-state-snapshot "captured ${TS} (${SIZE} bytes), retaining ${COUNT} snapshots"
The cron entryβ
/etc/cron.d/k3s-state-snapshot:
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
17 * * * * ktayl /usr/local/bin/k3s-state-snapshot
(Runs as the ktayl user, which has SSH key access to ubuntu@10.0.0.2.)
Verificationβ
# Trigger one manually to confirm
/usr/local/bin/k3s-state-snapshot
# Verify a fresh snapshot landed
ls -la /srv/backups/k3s/
# Confirm it's a valid SQLite file (not a half-copied state)
file /srv/backups/k3s/state.db.<UTC>
# β SQLite 3.x database, last written using SQLite version 3045001, β¦
# Watch syslog for cron-triggered runs
journalctl -t k3s-state-snapshot -f
A typical snapshot is ~38β40 MB for our small cluster (handful of
namespaces, no large stateful workloads in the API state). At 720
snapshots Γ 40 MB β 28 GB worst case β well within the 75+ GiB free on
/srv.
Restoreβ
Scenario: set-hog SSD dies, control plane fully lostβ
The SQLite snapshots can't be used for a partial restore β they're all-or-nothing of the entire control-plane state. The procedure is:
- Reprovision set-hog via MAAS (Phase 0 procedures).
- Install k3s (Phase 1) but don't start it yet β we want to inject our snapshot first.
- Stop k3s if it auto-started:
sudo systemctl stop k3s
- Copy the latest snapshot from controller to the new set-hog:
# On controllerLATEST=$(ls -1t /srv/backups/k3s/state.db.* | head -1)scp "$LATEST" ubuntu@10.0.0.2:/tmp/k3s-restore.db
- Replace the live state.db on set-hog:
ssh ubuntu@10.0.0.2sudo systemctl stop k3ssudo mv /var/lib/rancher/k3s/server/db/state.db{,.dead}sudo mv /var/lib/rancher/k3s/server/db/state.db-{wal,shm} /tmp/ 2>/dev/null || truesudo cp /tmp/k3s-restore.db /var/lib/rancher/k3s/server/db/state.dbsudo chown root:root /var/lib/rancher/k3s/server/db/state.dbsudo chmod 644 /var/lib/rancher/k3s/server/db/state.dbsudo systemctl start k3s
- Wait for the API to come up:
The other two workers will rejoin automatically.kubectl get nodes
- Verify all namespaces and workloads are present:
kubectl get all --all-namespaces
If application data (PV contents) is also missing, follow up with a Velero restore β Velero handles the data, this snapshot handles the manifests/state.
Estimated RTO: 30β45 min total.
What this does NOT coverβ
- PV data. SQLite has only Kubernetes API state, not the contents of Longhorn / NFS volumes. Velero's node-agent + Kopia covers that.
- Container images. Harbor's data lives on Longhorn; Velero's PV backup covers it. Pulled images on each node are caches; the registry is the source of truth.
- Off-site replication. All snapshots live on the controller. If the controller dies, both backup layers are gone. Phase 15+ will add an off-site copy.
Done Whenβ
β sqlite3 installed on set-hog
β /usr/local/bin/k3s-state-snapshot present and runnable
β /etc/cron.d/k3s-state-snapshot installed
β One manual run produced a valid SQLite file in /srv/backups/k3s/
β Restore procedure documented
Real-world skills demonstratedβ
| Skill | Industry context |
|---|---|
| Recognizing the SQLite-vs-etcd distinction | Single-node k3s and k0s clusters, Rancher Desktop, Docker Desktop's k8s, KIND single-node β all use SQLite by default. Knowing which datastore is in play changes every backup decision. |
sqlite3 .backup for online-consistent snapshots | Same pattern works for any embedded SQLite database β Plex, Home Assistant, Open WebUI, Sonarr, etc. Zero downtime, no WAL drama. |
| Pull-mode backups across security boundaries | Cluster nodes shouldn't have write access to the backup target. Pull-mode keeps the directionality of trust correct. |
| Risk-aware "no migration" decision | Choosing not to migrate a live cluster from SQLite to embedded etcd, even though etcd-snapshots was the original plan, is a senior call. The right time to use embedded etcd is when you're building HA from day 1. |
| Cadence and retention math | 720 hourly snapshots Γ 40 MB β 28 GB. Always do the math before setting retention. |
logger -t for cron observability | Without this, cron-driven failures are silent until something else surfaces them. journalctl -t <tag> is the lightweight observability path before metrics-server. |