Skip to main content

Phase 19 β€” Self-Hosted AI: Ollama + Open WebUI

The minicloud cluster gets its own LLM endpoint. Ollama runs the inference engine (Llama 3.2 3B on CPU); Open WebUI wraps it with a ChatGPT-style frontend. Together they deliver a tangible, useful AI service β€” accessible at https://chat.10.0.0.200.nip.io β€” that runs entirely on your bare-metal cluster with zero external API calls.

The original Phase 19 plan was "Ollama + MLflow + Kubeflow." We deliberately scoped down to Ollama only β€” same pattern as Phase 11 (OpenTofu shipped, Crossplane deferred), Phase 13 (GitHub Actions shipped, GitLab deferred), Phase 16 (Harbor proxy cache shipped, n8n/Temporal/Airflow deferred), Phase 18 (Backstage catalog shipped, plugins/templates/SSO deferred). MLflow and Kubeflow have no operational use case on a single-operator cluster with no active ML pipelines; deferring them keeps the cluster usable for actual workloads.


Architecture: why two components​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Browser β”‚ ai namespace β”‚
β”‚ HTTPS β”‚ β”‚
β”‚ + first- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ user- β”‚ β”‚ Open WebUI (StatefulSet) β”‚ β”‚
β”‚ becomes- β”‚ β”‚ Python web server β”‚ β”‚
β”‚ admin β”‚ β”‚ ~825 MiB steady-state β”‚ β”‚
β–Ό β”‚ β”‚ 1 GiB Longhorn PVC β”‚ β”‚
cert-manager TLS β”‚ β”‚ (SQLite + RAG β”‚ β”‚
+ NGINX Ingress ─▢│ β”‚ embeddings) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ HTTP β”‚
β”‚ β”‚ http://ollama: β”‚
β”‚ β”‚ 11434 β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Ollama (Deployment) β”‚ β”‚
β”‚ β”‚ inference engine β”‚ β”‚
β”‚ β”‚ ~2.6 GiB RAM (model β”‚ β”‚
β”‚ β”‚ hot in memory) β”‚ β”‚
β”‚ β”‚ 10 GiB Longhorn PVC β”‚ β”‚
β”‚ β”‚ (model weights) β”‚ β”‚
β”‚ β”‚ port 11434 (cluster- β”‚ β”‚
β”‚ β”‚ internal only) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Ollama is the engine; Open WebUI is the steering wheel. Without Open WebUI, all you have is a curl-able API β€” the architecturally correct thing for backends to consume, but useless for human chat. Open WebUI provides chat history, markdown rendering, multi-model switching, and (built-in) RAG document upload.

Why Ollama isn't exposed via Ingress: unauthenticated LLM endpoints are a real abuse vector β€” anyone reaching them can spam your CPU with prompts. Open WebUI sits in front with auth. Ollama stays cluster- internal.


Decisions​

DecisionChoiceRationale
Ollama installHelm chart otwld/ollama v1.56.0 (app 0.23.2)Standard install, configurable via Helm values
GPUNone β€” CPU-onlyNo NVIDIA GPUs on the ThinkPads. CPU works for 3B-7B models at usable speeds.
Modelllama3.2:3b (~2 GiB on disk + RAM)Sweet spot for CPU on ThinkPads. Larger models (7B+) are 2-3x slower, marginal quality gain at this size.
Inference TPS~13 tokens/sec sustained on CPUMeasured in the cold-start test. Roughly 10 words/sec β€” twice the human reading speed.
Ollama API exposureCluster-internal only (http://ollama.ai.svc:11434)Unauthenticated LLM = abuse vector if exposed. Open WebUI is the gatekeeper.
Ollama persistence10 GiB Longhorn PVC at /root/.ollamaModel survives pod restarts; room for a 7B follow-up if added later
Open WebUI installHelm chart open-webui/open-webui v14.4.0 (app 0.9.4)Mature chart with sane defaults
Open WebUI databaseSQLite on Longhorn (1 GiB)Postgres is overkill for single-user demo. SQLite handles RAG embeddings + chat history + user accounts fine.
Open WebUI authFirst-user-becomes-admin (Open WebUI default)Acceptable: TLS Ingress is internal-only via private nip.io hostname
Open WebUI memory1.5 GiB limit (initial 512 MiB OOMKilled the pod)Open WebUI bundles sentence-transformers + embedding models for RAG; startup needs 700-900 MiB before any traffic
TLScert-manager chat-tls Certificate, Phase 15 root CASame pattern as every other Ingress
Image sourceBoth ollama/ollama:latest and ghcr.io/open-webui/open-webui:0.9.4 pulled through Phase 16 Harbor proxy cacheValidates Sovereign Registry pattern again

What's deliberately deferred​

Same scope-reduction pattern as every prior phase:

ComponentWhy deferredFuture home
MLflowNo active ML training workload to trackFuture ML-pipeline phase when there's real model training
Kubeflow8-12 GiB RAM + Istio service-mesh dependency + days of setup; no pipelines to runFuture phase only if/when scale/use-case justifies it
GPU supportHardware constraint (no NVIDIA on ThinkPads)If/when a GPU node joins the cluster
External API exposure for OllamaSecurity: unauthenticated LLM = abuse vectorFuture "API gateway + auth" phase if a legitimate external consumer arrives
Multiple models loaded simultaneouslyRAM budget says one model at a timePull on demand
SSO for Open WebUIFirst-user-admin works for single-operatorFuture Keycloak phase

Pre-flight​

helm repo add otwld https://helm.otwld.com
helm repo add open-webui https://helm.openwebui.com
helm repo update

Install Ollama​

ollama-values.yaml:

image:
registry: docker.io
repository: ollama/ollama
tag: latest

ollama:
port: 11434
gpu:
enabled: false # CPU-only on ThinkPads

service:
type: ClusterIP
port: 11434

resources:
requests: { cpu: "1", memory: 2Gi }
limits: { cpu: "4", memory: 8Gi }

persistentVolume:
enabled: true
storageClass: longhorn
size: 10Gi
accessModes: [ReadWriteOnce]
kubectl create namespace ai
helm install ollama otwld/ollama -n ai -f ollama-values.yaml --wait --timeout 5m

# Pull the model (~2 GiB; takes 30-90s through Phase 16 Harbor proxy)
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2:3b

# Verify
kubectl exec -n ai deploy/ollama -- ollama list
# NAME ID SIZE MODIFIED
# llama3.2:3b a80c4f17acd5 2.0 GB Less than a minute ago

Note on the Ollama model registry: Ollama pulls models from registry.ollama.ai, not a standard OCI registry. The Phase 16 Harbor proxy cache doesn't intercept this β€” model pulls go direct. This is acceptable: model pulls are infrequent (once per model) and the model weights are persisted to Longhorn after the first pull.


Test Ollama API + measure TPS​

kubectl run ollama-test --rm -i --restart=Never -n ai --quiet \
--image=curlimages/curl:latest --image-pull-policy=IfNotPresent \
-- curl -sf http://ollama:11434/api/generate \
-d '{"model":"llama3.2:3b","prompt":"What is Kubernetes in one sentence?","stream":false}' \
| jq '{
eval_count,
eval_duration_s: (.eval_duration / 1e9),
tokens_per_second: (.eval_count / (.eval_duration / 1e9))
}'

Expected on cold-start (first prompt after pod starts):

MetricValue
Wall-clock total~12 s
load_duration_s (model into RAM)1.9 s
prompt_eval_duration_s1.0 s
eval_duration_s1.7 s
Tokens/sec (sustained)~13

Subsequent prompts within the keep_alive window (5 min default) skip the load step. 13 TPS = roughly 10 words/sec (about 2x human reading speed). Genuinely usable.

After model load, Ollama pod memory stabilizes around 2.6 GiB.


Install Open WebUI​

open-webui-values.yaml:

# Disable the chart's bundled Ollama subchart β€” we have our own.
ollama:
enabled: false

# Point at the existing Ollama Service via cluster DNS.
ollamaUrls:
- http://ollama.ai.svc:11434

# Disable plugins/sidecars not needed for Phase 19.
pipelines: { enabled: false }
tika: { enabled: false }
websocket: { enabled: false }
redis-cluster: { enabled: false }

resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 1000m
# 512 MiB OOMKills the pod during init (Open WebUI loads
# sentence-transformers + embedding models for RAG before serving
# any traffic). 1.5 GiB is the right floor.
memory: 1500Mi

persistence:
enabled: true
size: 1Gi
storageClass: longhorn
accessModes: [ReadWriteOnce]

ingress:
enabled: false # we add our own with cert-manager TLS

service:
type: ClusterIP
port: 80
containerPort: 8080

extraEnvVars:
- { name: WEBUI_NAME, value: "minicloud chat" }
- { name: WEBUI_URL, value: "https://chat.10.0.0.200.nip.io" }
helm install open-webui open-webui/open-webui -n ai \
-f open-webui-values.yaml --wait --timeout 5m

kubectl get pods -n ai
# ollama-... 1/1 Running
# open-webui-0 1/1 Running

kubectl get pvc -n ai
# ollama 10Gi
# open-webui 1Gi

TLS Ingress​

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: chat-tls
namespace: ai
spec:
secretName: chat-tls
issuerRef:
name: minicloud-ca
kind: ClusterIssuer
dnsNames: [chat.10.0.0.200.nip.io]
duration: 2160h
renewBefore: 720h
privateKey: { algorithm: ECDSA, size: 256 }
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: chat
namespace: ai
annotations:
nginx.org/redirect-to-https: "true"
# Generous body size for RAG document uploads
nginx.org/client-max-body-size: "10m"
# NB: do NOT add nginx.org/websocket-services here β€” see Gotchas
spec:
ingressClassName: nginx
tls:
- hosts: [chat.10.0.0.200.nip.io]
secretName: chat-tls
rules:
- host: chat.10.0.0.200.nip.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: open-webui
port: { number: 80 }

Two real install gotchas​

1. Open WebUI OOMKilled at 512 MiB​

Open WebUI bundles sentence-transformers + embedding models for RAG. Initial memory budget of 512 MiB seemed generous for a "small Python web server" but the bundled ML runtime needs 700-900 MiB before serving any traffic. The pod silently CrashLoopBackOff with exitCode: 137, reason: OOMKilled on the very first init. Bumping limit to 1.5 GiB fixes it; observed steady-state ~825 MiB.

2. F5 NGINX nginx.org/websocket-services annotation breaks routing​

I initially added nginx.org/websocket-services: "open-webui" thinking WebSocket-aware routing was needed for the streaming chat. The F5 NGINX controller misinterprets this and forwards traffic to http://127.0.0.1:8181/ (a phantom internal target). Result: 502 Bad Gateway on every request.

Fix: remove the annotation. WebSocket upgrade is handled automatically by NGINX without the websocket-services directive.

kubectl patch ingress chat -n ai --type=json \
-p='[{"op":"remove","path":"/metadata/annotations/nginx.org~1websocket-services"}]'

End-to-end verification​

# Open WebUI HTTPS reachable
curl -sI --cacert ~/minicloud-ca.crt -m 5 https://chat.10.0.0.200.nip.io/
# HTTP/1.1 200 OK
# Server: nginx/1.29.7

# HTTP redirects to HTTPS
curl -sI -m 5 http://chat.10.0.0.200.nip.io/
# HTTP/1.1 301 Moved Permanently

# HTML title
curl -s --cacert ~/minicloud-ca.crt https://chat.10.0.0.200.nip.io/ \
| grep -oE "<title>[^<]+</title>"
# <title>Open WebUI</title>

In the browser:

  1. Open https://chat.10.0.0.200.nip.io
  2. Click "Get started" β†’ first user becomes admin (signup)
  3. Send a prompt: "Hello, who are you?"
  4. Response streams in within ~5-10 seconds at ~13 tokens/sec

Memory pressure monitoring​

Phase 17's reviewer flagged this:

When the model is "hot" in RAM, the Ollama pod will sit close to its 4GB request. Keep an eye on kubectl top pods -n ai.

Real numbers:

$ kubectl top pod -n ai
NAME CPU(cores) MEMORY(bytes)
ollama-... 814m 2630Mi # model loaded + actively answering
open-webui-0 11m 825Mi # idle

Total ai namespace footprint at idle: ~3.5 GiB out of the cluster's ~48 GiB. Comfortable.

If you load a 7B model alongside the 3B, expect ~6.5 GiB additional usage. Pull on demand:

kubectl exec -n ai deploy/ollama -- ollama pull mistral:7b

Done When​

βœ” 2 pods Running in ai namespace (ollama + open-webui-0)
βœ” 2 PVCs Bound on Longhorn (10 GiB ollama, 1 GiB open-webui)
βœ” ollama list shows llama3.2:3b loaded
βœ” Cert + Ingress for chat.10.0.0.200.nip.io serves 200 over HTTPS
βœ” Browser signup works; first prompt returns a coherent response
βœ” Ollama-reported eval rate β‰₯ 10 tokens/sec
βœ” Homer has a "Chat" tile under Apps

Real-world skills demonstrated​

SkillIndustry context
Self-hosted LLM inference on bare-metal CPUThe fastest-growing portfolio piece in 2026. "I run my own AI" without API costs is a recruiter-recognizable headline.
Frontend/backend separation: Ollama + Open WebUISame pattern as every production AI deployment β€” inference engine separated from user-facing UI by an authenticated boundary
Cluster-internal API exposure patternUnauthenticated LLM endpoints exposed publicly are a real abuse vector. Keeping Ollama internal and routing through an authenticated Open WebUI is the production-correct shape.
Memory budget tuningOOMKilled-then-debug-then-bump cycle is real production work. The 512β†’1500 MiB jump for Open WebUI is the kind of empirical sizing every shop does on first install.
F5 NGINX annotation gotcha discoveryIncorrect annotations silently misroute traffic. Reading ingress controller logs to find 127.0.0.1:8181 upstream is real Day-2 debugging.
Tokens/sec measurementReading eval_count / eval_duration from Ollama's API response is the canonical way to measure inference performance. Same shape as tokens_per_second in vLLM, llama.cpp, etc.
Senior scope reductionChoosing Ollama only over Ollama + MLflow + Kubeflow is the same skill as every prior deferral (Crossplane, GitLab, Vault, Backstage plugins).
Honest "we don't have GPUs" framingNaming the constraint (CPU-only ThinkPads) and choosing a model size that works there (3B) is more credible than pretending CPU inference is "the same as GPU."