Phase 20 โ Chaos Mesh: validation engineering
Phase 20 isn't really "install Chaos Mesh." It's validation engineering โ using deliberate fault injection to prove that the resilience claims made in earlier phases actually hold under failure.
The cluster turned into a scientific instrument:
- Phase 9 said: *"podinfo runs 2 replicas with soft podAntiAffinity
- HPA, so it stays available under pod loss."* โ Phase 20 Experiment 1 measured 0 ms downtime under continuous pod kills (5 events over 4 min, 252 probes, 100.0000% HTTP 200).
- Phase 8 said: "we have observability โ anomalies will be visible in Grafana." โ Phase 20 Experiment 2 measured the 11ร latency spike during injected network delay and verified clean restoration after cleanup.
- Phase 5 + Kubernetes resource limits said: "per-container memory limits contain failures to the cgroup, not the node." โ Phase 20 Experiment 3 OOM-killed a stressed container while the 30+ node-mate pods on the same node had zero new restarts.
These aren't hypothetical claims anymore. They're measured properties. That's what chaos engineering is for.
Architectureโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ chaos-mesh namespace โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ chaos-controller-manager โ โ chaos-dashboard โ โ
โ โ (3 replicas, HA) โ โ (cluster-internal only) โ โ
โ โ reconciles CRDs โ โ port-forward access โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ instructs โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ chaos-daemon (DaemonSet, 1 per node โ privileged) โ โ
โ โ set-hog ยท fast-skunk ยท fast-heron โ โ
โ โ uses nsexec to enter pod netns, applies tc/netem, โ โ
โ โ mounts cgroups for OOM/stress, talks to containerd โ โ
โ โ socket /run/k3s/containerd/containerd.sock โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The daemon is the actual fault-injection engine. When you create a
NetworkChaos CR, the controller-manager schedules it onto the
chaos-daemon on the node where the target pod is running. The daemon
calls /usr/local/bin/nsexec -n /proc/<pid>/ns/net -- tc qdisc add ...
โ it enters the pod's network namespace and applies a real Linux
Traffic Control rule.
This is not a userspace "pretend" simulator. The pod really does see 200 ms of latency on its eth0, applied by the kernel.
Decisionsโ
| Decision | Choice | Rationale |
|---|---|---|
| Chart | chaos-mesh/chaos-mesh v2.8.2 | Standard install. App version = chart version. |
| Runtime hook | chaosDaemon.runtime: containerd + socketPath: /run/k3s/containerd/containerd.sock | The single most important override. k3s embeds containerd at a non-default path; without this the daemon can't see running containers and every experiment fails with container runtime not found. |
| Dashboard exposure | Cluster-internal only (port-forward) | The chaos dashboard is a cluster-wide kill switch. Anyone reaching it can delete pods, partition networks, OOM-kill nodes. Not a kind of UI you put behind a public Ingress, even with TLS. |
| Dashboard auth | RBAC token-based (chart default securityMode: true) | Token at ~/.chaos-dashboard-token mode 600, ServiceAccount chaos-dashboard-operator, ClusterRole chaos-mesh-cluster-manager. |
| DNSChaos sidecar | Disabled (dnsServer.create: false) | Optional component, only needed for DNSChaos experiments which aren't in Phase 20's scope. |
| Image source | All chaos-mesh images pulled through Phase 16 Harbor ghcr proxy cache | Validates the sovereign-registry pattern again โ 5 distinct images all flowed through Harbor on first pull. |
| Targets | All 3 experiments target podinfo only | podinfo (Phase 9) is the disposable test workload โ 2 replicas, HPA, ServiceMonitor, custom Grafana dashboard, soft podAntiAffinity. The exact resilience pattern we want to validate. |
What's deliberately deferredโ
| Component | Why deferred | Future home |
|---|---|---|
| Public Ingress for the dashboard | Kill switch โ never expose publicly without strong auth | When Keycloak SSO + RBAC gating land in a future security phase |
| NodeChaos (full node shutdown) | Single control-plane cluster. Killing set-hog kills the cluster. NodeChaos against worker nodes is OK but adds blast-radius risk for the portfolio demo. | If/when an HA control plane phase lands |
| Automated GameDays via cron | "Chaos while you sleep" on a single-operator portfolio = bad. The discipline is "I press the button when I'm watching." | When a real on-call rotation exists |
| Chaos against ArgoCD / Harbor / cert-manager | These are cluster-control-plane workloads. Killing them mid-experiment can leave the cluster in a bad state and recovery requires bypassing GitOps. | Out of scope until HA control plane |
chaos-mesh-controller-manager Workflow CRDs (sequential experiment chains) | Single-experiment-per-session is enough to demonstrate the discipline | When real GameDays start |
This is the same scope-reduction pattern as every prior phase (Crossplane in Phase 11, GitLab in Phase 13, Vault in Phase 15, n8n/Temporal/Airflow in Phase 16, plugins in Phase 18, MLflow/Kubeflow in Phase 19).
Installโ
chaos-mesh-values.yaml:
chaosDaemon:
runtime: containerd
socketPath: /run/k3s/containerd/containerd.sock
hostNetwork: false
dashboard:
create: true
serviceType: ClusterIP
securityMode: true # require RBAC token
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
controllerManager:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
dnsServer:
create: false # optional; not used in Phase 20
webhook:
certManager:
enabled: false # built-in self-signed is fine for an internal webhook
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update chaos-mesh
kubectl create namespace chaos-mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
-n chaos-mesh \
-f chaos-mesh-values.yaml \
--version 2.8.2 \
--wait --timeout 5m
Expected after install:
$ kubectl get pods -n chaos-mesh
NAME READY STATUS RESTARTS
chaos-controller-manager-...-1 1/1 Running
chaos-controller-manager-...-2 1/1 Running โ 3 replicas
chaos-controller-manager-...-3 1/1 Running (HA across nodes)
chaos-daemon-... 1/1 Running โ DaemonSet,
chaos-daemon-... 1/1 Running 1 per node
chaos-daemon-... 1/1 Running
chaos-dashboard-... 1/1 Running
$ kubectl get crd | grep -c chaos-mesh.org
23
Dashboard RBAC + tokenโ
# chaos-mesh/dashboard-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-dashboard-operator
namespace: chaos-mesh
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaos-mesh-cluster-manager
rules:
- apiGroups: [""]
resources: [pods, namespaces]
verbs: [get, watch, list]
- apiGroups: [chaos-mesh.org]
resources: ["*"]
verbs: [get, list, watch, create, delete, patch, update]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chaos-dashboard-operator-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: chaos-mesh-cluster-manager
subjects:
- kind: ServiceAccount
name: chaos-dashboard-operator
namespace: chaos-mesh
---
apiVersion: v1
kind: Secret
metadata:
name: chaos-dashboard-operator-token
namespace: chaos-mesh
annotations:
kubernetes.io/service-account.name: chaos-dashboard-operator
type: kubernetes.io/service-account-token
kubectl apply -f chaos-mesh/dashboard-rbac.yaml
# Capture the token (one-time)
kubectl -n chaos-mesh get secret chaos-dashboard-operator-token \
-o jsonpath='{.data.token}' | base64 -d > ~/.chaos-dashboard-token
chmod 600 ~/.chaos-dashboard-token
# Access the dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333 โ paste the token from ~/.chaos-dashboard-token
Steady-state hypothesis disciplineโ
Each experiment follows the classic chaos engineering loop (from Netflix's Chaos Engineering book):
1. Define the steady state โ what "healthy" looks like (probe + threshold)
2. Hypothesize that injection X won't break the steady state
3. Inject X (the chaos)
4. Observe โ does the steady state hold?
5. Either: hypothesis confirmed, or you found a real gap to fix
We measure with a synchronous availability probe running at 1 Hz in a parallel shell โ so the experiment captures actual user-facing behavior, not just k8s-internal events.
Experiment 1 โ PodChaos pod-killโ
Hypothesis: podinfo (2 replicas, soft podAntiAffinity, HPA) stays available under continuous pod kills.
Manifest (chaos-mesh/exp1-podkill.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: podinfo-kill
namespace: chaos-mesh
spec:
schedule: "*/30 * * * * *" # 6-field cron with seconds
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [podinfo]
labelSelectors:
"app.kubernetes.io/name": "podinfo"
gracePeriod: 0
Probe loop (run in a separate shell, fires 1ร per second):
PROBE_LOG=/tmp/exp1-probe.log
> "$PROBE_LOG"
(
for i in $(seq 1 360); do
out=$(curl -sf --cacert ~/minicloud-ca.crt -m 2 -o /dev/null \
-w "%{http_code} %{time_total}" \
https://podinfo.10.0.0.200.nip.io/version 2>/dev/null)
rc=$?
ts=$(date +%s.%N)
echo "$ts $rc $out" >> "$PROBE_LOG"
sleep 1
done
) > /dev/null 2>&1 &
Then kubectl apply -f exp1-podkill.yaml, let it run for ~4-5 min,
then kubectl delete schedule -n chaos-mesh podinfo-kill.
Result โ measured live:
| Metric | Value |
|---|---|
| Probes (1 Hz, ~4 min) | 252 |
| HTTP 200 responses | 252 |
| Failures (any non-200, timeout, conn refused) | 0 |
| Availability | 100.0000% |
| Maximum unreachable duration | 0 ms |
| Latency p50 | 26 ms |
| Latency p99 | 61 ms |
| Latency max (single outlier during a replacement) | 303 ms |
| Total pod replacements observed | ~13 |
| PodChaos events fired | 5 |
Why this works: with replicas=2 split across nodes, killing any
one pod still leaves one Ready endpoint in the ingress-nginx
endpointslice. The replacement pod is Scheduled within ~1 s and
Ready within ~3 s. The 303 ms max latency is the worst case โ a
probe arrived during the brief moment when one pod was terminating;
the request landed on the surviving pod (still 200, just slower TCP
handshake or warming caches).
Experiment 2 โ NetworkChaos delayโ
Hypothesis: Phase 8 observability (Prometheus + Grafana) sees a 200 ms latency injection in real time, and the chaos engine cleanly restores network state on cleanup.
Manifest (chaos-mesh/exp2-netdelay.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: podinfo-latency
namespace: chaos-mesh
spec:
action: delay
mode: all
selector:
namespaces: [podinfo]
labelSelectors:
"app.kubernetes.io/name": "podinfo"
delay:
latency: "200ms"
jitter: "50ms"
correlation: "25"
duration: "3m"
What the chaos-daemon actually runs (visible in
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon):
nsexec -n /proc/<pid>/ns/net -- tc qdisc add dev eth0 root \
handle 1: netem delay 200ms 50ms 25.000000
This is straight Linux Traffic Control. Real kernel-level latency injection inside the pod's network namespace โ not a userspace fake.
Result:
| Phase | Mean latency | Multiple of baseline |
|---|---|---|
| Pre-chaos baseline | ~26 ms | 1ร |
| Under chaos | 299 ms | ~11ร |
| Post-cleanup | 25 ms | 1ร โ perfect recovery |
The 11ร ratio matches expectations: the 200 ms ยฑ 50 ms delay applies to outbound packets, doubled for the round trip = ~250 ms added to the ~26 ms baseline = ~276-326 ms range. Measured 299 ms is dead-center.
After deleting the NetworkChaos resource, the daemon issues
tc qdisc del dev eth0 root and latency snaps back to baseline
within seconds.
Experiment 3 โ StressChaos memory (the safety-critical one)โ
Hypothesis: the 256 MiB cgroup memory limit on the podinfo container contains a memory blowup at the container level โ node-mate pods are unaffected and the node stays Ready.
Critical safety check before running this experiment: verify the target has resource limits set:
kubectl get deploy -n podinfo podinfo \-o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}'# โ 256MiIf the limit is missing, the stressor can eat the entire node's RAM, the kernel OOM-killer fires at the node level, and you can lose SSH access to the box. Never run StressChaos against an unbounded container on a production-equivalent node.
Right-sizing the stressor โ the original draft used 4 GiB, but that's overkill on a 256 MiB-limited container (the OOM-kill happens in microseconds, and 4 GiB would only matter if the limit were removed by accident). 512 MB is 2ร the limit โ enough to guarantee an OOM-kill while keeping the failure clearly contained.
Manifest (chaos-mesh/exp3-memstress.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: podinfo-memory-stress
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces: [podinfo]
labelSelectors:
"app.kubernetes.io/name": "podinfo"
stressors:
memory:
workers: 1
size: "512MB"
duration: "2m"
Result:
| Pod / workload | Restart count change | Outcome |
|---|---|---|
| podinfo (target pod) | 0 โ 1, reason OOMKilled | Container OOM-killed at cgroup, restarted, back to Running |
| podinfo (other replica, different node) | 0 | Unaffected |
| All ~30 other pods on the target node (Backstage, Harbor-Trivy, NATS, Prometheus, Longhorn-Manager, MetalLB-Speaker, etc.) | 0 new restarts | Node-mate isolation held |
Service availability (podinfo.10.0.0.200.nip.io/version) | โ | Probes returned HTTP 200 throughout (other replica served traffic) |
This is the property the user warned about: without limits, a 512 MiB stressor on a 16 GiB node would still get OOM-killed eventually โ but possibly only after the kernel had already started killing other innocent processes to reclaim memory. With limits set, the kill is scoped to the cgroup and happens before the node notices any pressure.
What we learnedโ
- Resilience claims are testable, not philosophical. Phase 9 said "stays available under pod loss." Phase 20 measured exactly that: 0 ms downtime over 5 kill events. That's the difference between portfolio-grade and pretend.
- The k3s socket override matters. Without
chaosDaemon.socketPath: /run/k3s/containerd/containerd.sock, the daemon comes up but every experiment fails withcontainer runtime not found. The chart README mentions it; the official docs don't loudly. Future-you will appreciate it being in CLAUDE.md. - Always verify resource limits before running stressors.
kubectl get deploy <target> -o jsonpath=...resources.limitsis the difference between a safe contained OOM-kill and "why can't I SSH into this node anymore." - Linux Traffic Control is the kernel-level reality of network
chaos. Reading
chaosdaemon/tc_server.goshowstc qdisc add dev eth0 root handle 1: netem delay 200ms 50msโ this is the sametc/netemcommand-line you'd use by hand on a bare-metal Linux box. Chaos Mesh is just a sane management layer over a real kernel facility. - Auto-cleanup is part of the contract. Every chaos experiment
has a
duration:field. After expiry, the daemon issues the inverse operation (tc qdisc del, kill the stressor process) and the system returns to its prior state. We verified this for NetworkChaos by measuring post-cleanup latency = baseline.
Done Whenโ
โ 7 chaos-mesh pods Running (3 controller-manager, 3 daemon, 1 dashboard)
โ 23 chaos-mesh.org CRDs registered
โ Dashboard RBAC token at ~/.chaos-dashboard-token mode 600
โ kubectl port-forward to dashboard works, token logs in
โ Experiment 1: podinfo 100.0000% available across 5 pod-kills
โ Experiment 2: latency went 26ms โ 299ms โ 25ms (auto-cleanup verified)
โ Experiment 3: contained OOM-kill, no node-mate pods restarted
โ All chaos resources deleted; cluster back to steady state
Real-world skills demonstratedโ
| Skill | Industry context |
|---|---|
| Validation engineering | Different from "build it." Senior infra interviews ask "how do you know it works under failure?" Phase 20 is the answer. |
| Steady-state-hypothesis discipline | The Netflix Chaos Engineering book formalism: define the steady state, predict the outcome, inject, observe, learn. This is the difference between chaos engineering and chaos. |
| Linux Traffic Control / netem mental model | Knowing that tc qdisc add dev eth0 root netem delay 200ms is the actual mechanism behind NetworkChaos โ and being able to read the daemon's logs to verify it โ separates "operates the tool" from "understands the tool." |
| cgroup memory limits as isolation boundaries | The instinct to ask "is the limit set before I run a stressor?" is the difference between a chaos test and a node-killing accident. Real production chaos engineers are paranoid about this. |
| Synchronous availability probing | Polling at 1 Hz from a separate shell while injecting failure is the canonical methodology. Same shape as canary analysis, blue/green smoke tests, k6/Locust load tests. |
| kill-switch tooling โ security posture | Recognizing that the chaos dashboard is a kill switch and refusing to give it a public Ingress is the same instinct as "don't expose Redis to the internet" โ production-correct security thinking. |
| Reading controller-runtime CRDs in practice | PodChaos/NetworkChaos/StressChaos all follow the operator pattern. Reading their .spec and .status fluently is practice for every k8s controller you'll meet on the job. |
| Senior scope reduction | NodeChaos / automated cron GameDays / dashboard Ingress all deliberately deferred. Same skill as every prior deferral. |