Skip to main content

Platform Roadmap — Complete Blueprint

From current bare-metal infra → full private cloud platform (production-grade).


Final Target Architecture

┌──────────────────────────────┐
│ Developer │
│ kubectl / Git / CI/CD │
└─────────────┬────────────────┘

┌────────▼────────┐
│ Git Platform │
│ (GitLab/Gitea) │
└────────┬────────┘

┌────────▼────────┐
│ GitOps Layer │
│ ArgoCD │
└────────┬────────┘

┌─────────────▼─────────────┐
│ Kubernetes (k3s) │
│ set-hog (CP) │
│ fast-skunk (worker) │
│ fast-heron (worker) │
└─────────────┬─────────────┘

┌─────────────────────▼───────────────────────┐
│ Monitoring / Observability Stack │
│ Prometheus + Grafana + Loki │
└─────────────────────────────────────────────┘

Infra Base:
MAAS + 10.0.0.0/24 isolated network

Final Stack

LayerTechnology
Infra provisioningMAAS
Clusterk3s (Kubernetes)
Container runtimecontainerd
GitOpsArgoCD
CI/CDGitLab / Gitea
MetricsPrometheus
DashboardsGrafana
LogsLoki
AutomationAnsible

Phase Status

PhaseDescriptionStatus
0MAAS + 3-node provisioning✅ Complete
1Kubernetes (k3s)✅ Complete
2kubectl local access✅ Complete
3Remote access (Tailscale + Homer)✅ Complete
4MetalLB load balancer✅ Complete
5Persistent storage (Longhorn + NFS)✅ Complete
6Ingress controller (F5 NGINX)✅ Complete
7Harbor + Trivy (private registry)✅ Complete
8Monitoring stack (Prometheus + Grafana)✅ Complete
9First workload (podinfo)✅ Complete
10Ansible (post-MAAS bootstrap + Day-2 ops)✅ Complete
11OpenTofu (IaC, MAAS provider) — Crossplane deferred✅ Complete
12ArgoCD (GitOps, App-of-Apps) — Homer + whoami managed✅ Complete
13CI/CD pipeline (GitHub Actions + ghcr.io + ArgoCD image promotion) — GitLab/Gitea deferred✅ Complete
14Backup/DR (Velero → MinIO on controller + hourly k3s SQLite snapshot) — etcd-snapshot pivoted to SQLite✅ Complete
15TLS / cert-manager (self-signed root CA + chained ClusterIssuer) — Vault + RBAC deferred✅ Complete
16Harbor as Sovereign Registry (4 proxy-cache projects + mirror config). Original n8n/Temporal/Airflow plan deferred.✅ Complete
17KEDA + NATS event-driven autoscaling (3-replica HA + JetStream + scale-to-zero verified)✅ Complete
18Backstage minimal IDP (catalog-only, 5 Components + 1 System) — Vault/plugins/templates deferred✅ Complete
19Self-hosted AI — Ollama (llama3.2:3b CPU inference, ~13 TPS) + Open WebUI chat at chat.10.0.0.200.nip.io. MLflow + Kubeflow deferred.✅ Complete
20Chaos Mesh — 3 validation experiments on podinfo. PodChaos: 0 ms downtime / 100% availability over 5 kill events. NetworkChaos: 200 ms latency injection (~11× baseline) + clean cleanup. StressChaos: contained cgroup OOM, 0 node-mate restarts. NodeChaos + dashboard Ingress + automated GameDays deferred.✅ Complete
21Loki single-binary + Promtail DaemonSet + Alertmanager 3-tier routing tree + in-cluster webhook receiver + custom PodinfoAvailabilityLost rule (with the user-mandated for: 2m anti-flap window). End-to-end alert validated via Chaos Mesh kill-both-replicas: chaos→webhook fire in 174 s. Day-2 housekeeping: open-webui PVC expanded 1→5 GiB online, node-exporter limit 100m→250m, explicit NTP servers on all 3 nodes. Jaeger deferred — no multi-service topology.✅ Complete
22Cilium + Hubble eBPF networking — runbook authored, execution deferred to fresh-cluster rebuild. cilium CLI v0.19.2 on controller; complete migration procedure in docs/networking/02-cilium.md (pre-flight Velero snapshot + 7 steps + rollback). Senior scope-reduction call: replacing the CNI on a 22-phase live cluster = high blast radius for marginal benefit at our scale.✅ Complete
12Container strategy🔜
13Ansible automation🔜
14Security hardening🔜
15Advanced observability🔜