Mini Cloud Platform β Bare-Metal Infrastructure
Private datacenter-equivalent infrastructure β built from scratch with MAAS, k3s, ArgoCD, and GitOps.
What This Project Isβ
This documentation covers a complete bare-metal infrastructure built locally using MAAS (Metal as a Service), provisioning a 3-node cluster ready for Kubernetes and production workloads.
Equivalent to: AWS EC2 + VPC + Auto Provisioning β but local.
Infrastructure at a Glanceβ
| Node | IP | Role | Hardware |
|---|---|---|---|
| set-hog | 10.0.0.2 | Control Plane | ThinkPad T15 Gen 1 |
| fast-skunk | 10.0.0.4 | Worker | ThinkPad T490 |
| fast-heron | 10.0.0.7 | Worker | ThinkPad T490 |
MAAS Controller: Ubuntu + dual NIC (WiFi β internet, Ethernet β 10.0.0.1)
Complete Roadmapβ
Each phase builds directly on the previous one β nothing requires something that hasn't been set up yet.
| Phase | Topic | Key Technology | Status |
|---|---|---|---|
| 0 | MAAS + 3-node provisioning | MAAS, PXE, cloud-init | β Done |
| 1 | Kubernetes cluster | k3s | β Done |
| 2 | kubectl local access | kubeconfig | β Done |
| 3 | Remote access from anywhere | Tailscale, Cloudflare Tunnel, Homer | β Done |
| 4 | Load balancer IPs on bare-metal | MetalLB | β Done |
| 5 | Persistent storage | Longhorn, NFS | β Done |
| 6 | Expose apps to the network | F5 NGINX Ingress | β Done |
| 7 | Private container registry | Harbor + Trivy | β Done |
| 8 | Cluster monitoring | Prometheus, Grafana | β Done |
| 9 | First real workload | podinfo, HPA, ServiceMonitor | β Done |
| 10 | Infrastructure automation | Ansible | β Done |
| 11 | Infrastructure as Code | OpenTofu (MAAS) β Crossplane deferred | β Done |
| 12 | GitOps deployment | ArgoCD (App-of-Apps) | β Done |
| 13 | CI/CD pipelines | GitHub Actions + ghcr.io + ArgoCD image promotion | β Done |
| 14 | Backup & disaster recovery | Velero + MinIO on controller + hourly k3s SQLite pull | β Done |
| 15 | TLS / cert-manager (internal PKI) β Vault + RBAC deferred | cert-manager, self-signed root CA | β Done |
| 16 | Harbor as Sovereign Registry β 4 proxy-cache projects, mirror+fallback, supply-chain control point. Original n8n/Temporal/Airflow plan deferred. | Harbor proxy cache | β Done |
| 17 | Event-driven autoscaling β KEDA + NATS JetStream HA, scale-to-zero verified end-to-end | KEDA, NATS | β Done |
| 18 | Backstage minimal IDP β catalog-only, off-the-shelf image, Vault/plugins/templates deferred | Backstage | β Done |
| 19 | Self-hosted AI β Ollama (CPU, llama3.2:3b, ~13 TPS) + Open WebUI chat. MLflow + Kubeflow deferred. | Ollama, Open WebUI | β Done |
| 20 | Reliability & chaos engineering β 3 validation experiments on podinfo: PodChaos (0 ms downtime under 5 pod kills), NetworkChaos (200 ms latency injection + clean recovery), StressChaos (contained cgroup OOM, 0 node-mate restarts). NodeChaos / dashboard Ingress / automated GameDays deferred. | Chaos Mesh | β Done |
| 21 | Logs (Loki single-binary, Promtail DaemonSet, Grafana datasource) + Alertmanager 3-tier routing tree + in-cluster webhook receiver + custom PodinfoAvailabilityLost rule. End-to-end alert validated via Chaos Mesh kill-both-replicas β webhook receives FIRING JSON. Jaeger / distributed tracing deferred β no multi-service topology to trace. | Loki, Promtail, Alertmanager | β Done |
| 22 | eBPF networking β migration runbook authored, execution deferred to fresh-cluster rebuild. cilium CLI installed on controller; dry-run helm values captured. Senior scope-reduction call: 111 live pods + 22 phases of validated infrastructure on top of Flannel make hot CNI swap not worth it at our cluster scale. | Cilium, Hubble | β Done |
| β | Data Layer | Kafka/Redpanda, ClickHouse, dbt, Superset, OpenMetadata | π |
| β | Security Layer | Keycloak, OPA/Gatekeeper, Falco, Cosign+SBOM, kube-bench | π |
Final Stack (When Complete)β
ββ INFRASTRUCTURE ββββββββββββββββββββββββββββββββββββββββββββββββββ
MAAS β bare-metal provisioning
k3s β Kubernetes cluster
MetalLB β load balancer IPs
Longhorn β distributed storage
Harbor β private container registry
ββ AUTOMATION & DELIVERY βββββββββββββββββββββββββββββββββββββββββββ
Ansible β infrastructure automation
Terraform β infrastructure as code
Crossplane β Kubernetes-native IaC
ArgoCD β GitOps
GitLab β CI/CD
ββ PLATFORM SERVICES βββββββββββββββββββββββββββββββββββββββββββββββ
Velero β backup & disaster recovery
Vault β secrets management
n8n β visual workflow automation
Temporal β code-based workflow orchestration
Airflow β data pipeline scheduling
KEDA β event-driven autoscaling
NATS β message broker
Backstage β developer portal
ββ OBSERVABILITY βββββββββββββββββββββββββββββββββββββββββββββββββββ
Prometheus β metrics
Grafana β dashboards
Loki β logs
Jaeger β traces
Chaos Mesh β reliability testing
Cilium β eBPF networking + Hubble
Tailscale β remote access VPN
ββ DATA LAYER ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Redpanda β event streaming (Kafka-compatible)
Debezium β change data capture (CDC)
ClickHouse β columnar analytics warehouse
dbt β SQL transformation layer
Superset β self-hosted BI dashboards
OpenMetadata β data catalog, lineage, governance
ββ SECURITY LAYER ββββββββββββββββββββββββββββββββββββββββββββββββββ
Keycloak β SSO / OIDC identity provider
OPA/Gatekeeperβ admission control (policy as code)
Falco β runtime threat detection (eBPF)
Cosign β image signing + SBOM supply chain
kube-bench β CIS compliance scoring
ββ AI / ML βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Ollama β local LLMs (Mistral, LLaMA 3)
MLflow β ML experiment tracking
Kubeflow β ML pipelines + distributed training
CV / LinkedIn Summaryβ
- Designed and deployed a 3-node bare-metal infrastructure using MAAS
- Implemented PXE-based automated OS provisioning via network boot (PXE)
- Built isolated cluster network (10.0.0.0/24) with DHCP/DNS management
- Resolved complex networking issues (IPv6 conflicts, DHCP overlap, alias interfaces)
- Deployed full Kubernetes platform: k3s, ArgoCD, Prometheus, Harbor, Vault
- Built private AI platform with local LLM serving (Ollama) and ML pipelines (Kubeflow)
- Implemented remote access via Tailscale VPN and Cloudflare Tunnel
- Applied chaos engineering with Chaos Mesh to validate cluster resilience
- Built end-to-end data platform: Kafka/Redpanda β ClickHouse β dbt β Superset with OpenMetadata governance
- Implemented enterprise security: Keycloak SSO, OPA/Gatekeeper admission control, Falco runtime detection, Cosign supply chain signing