cai-exos-systems/daveadmin-exos-demo:exosneeds/infrastructure.md
VPS
Use one Linux VPS for the always-on demo control plane.
Recommended minimum:
- Ubuntu 24.04 LTS
- 4 vCPU
- 8 GB RAM
- 160 GB NVMe/SSD
- 20 TB or high included transfer
- public IPv4
- daily snapshots
- firewall restricted to HTTP/HTTPS/SSH and private service ports
Preferred starting provider:
- Hetzner CPX31 or current equivalent: enough RAM/disk for PHP, Nginx/Caddy, database, Qdrant, n8n, Directus, FOSSBilling, OPA, and traces.
If the full stack becomes cramped, move to:
- 8 vCPU
- 16 GB RAM
- 240-320 GB NVMe
GPU Use hourly GPU rental rather than buying hardware. Start with: - SimplePod-style RTX 4090, 24 GB VRAM - Ubuntu image with NVIDIA drivers/CUDA - Ollama or vLLM serving Qwen 7B - SSH or WireGuard between VPS and GPU pod - stop/destroy GPU when not developing Why RTX 4090: - enough VRAM for Qwen 7B and many 14B quantized tests - cheap hourly development - ideal for short demo/testing windows Use A100/H100 only when: - context windows grow large - multiple users need concurrent answers - batch evaluation speed matters - Azure/enterprise production sizing needs performance proof
Phase 1 Install List Install on the VPS: - Nginx or Caddy - PHP 8.3/8.4 with FPM - MariaDB or PostgreSQL - Redis - Docker Compose - FOSSBilling demo BSS - Directus catalog/data admin - OPA policy engine - n8n or equivalent workflow runner - Langfuse or OpenTelemetry-based trace store - Qdrant vector database - backup job to object storage - UFW firewall and fail2ban - Certbot or Caddy automatic TLS Install on the GPU pod: - NVIDIA driver/CUDA runtime, if not preinstalled - Docker + NVIDIA Container Toolkit - Ollama or vLLM - Qwen 7B model, for example `qwen2.5:7b` - optional Qwen 14B or Qwen3 8B for comparison - a single HTTP inference endpoint exposed only over VPN or IP allowlist
Phase 2: EXOS Controlled Evidence Layer Add retrieval after the Qwen-only baseline works. Components: - document ingestion for TM Forum docs, EXOS architecture docs, BSS runbooks, catalog notes, API schemas, and demo scenarios - chunking pipeline with stable document IDs and section titles - embedding model such as `nomic-embed-text` or BGE-M3 - Qdrant collections such as: - `exos_tmforum` - `exos_bss` - `exos_runbooks` - `exos_customer_scenarios` - retrieval endpoint - pinned evidence packet passed to the LLM - validation runner that records question, expected source, retrieved chunks, model output, tool calls, policy decision, and score For EXOS, present this as the EXOS controlled evidence layer, not CaveauAI.
Phase 3: Azure Target Map the demo stack to Azure as follows: | Demo Component | Azure Target | | --- | --- | | PHP demo app | Azure App Service, Container Apps, or AKS | | Nginx/Caddy | Azure Front Door + App Gateway where needed | | MariaDB/PostgreSQL | Azure Database for MySQL/PostgreSQL | | Qdrant | Qdrant on AKS/Container Apps, or Azure AI Search if acceptable | | Object/files | Azure Blob Storage | | Secrets | Azure Key Vault | | Logs/traces | Azure Monitor + Application Insights | | n8n workflows | Azure Logic Apps or Power Automate | | Dify-style agents | Microsoft Copilot Studio / Azure AI Foundry | | OPA | Containerized OPA or Azure Policy-adjacent governance | | GPU inference | Azure NCasT4_v3 for light inference, NC A100/H100 families for heavier workloads, or Azure ML managed online endpoints | | TM Forum API control plane | Exosphere / API Management / Container Apps / AKS | Development Operating Model 1. Keep the VPS always on. 2. Start the GPU pod only when developing, testing, or demoing LLM behavior. 3. Run one model request at a time on small GPUs. 4. Keep the Qwen 7B baseline separate from evidence-layer runs. 5. Log every agent action and answer. 6. Promote only validated flows into Azure design.
Security Basics - No public GPU inference endpoint. - Use WireGuard or SSH tunnel from VPS to GPU. - Put secrets in `.env` or a server-side secret store, never repo files. - Use per-demo accounts and scoped API keys. - Log tool calls, model ID, prompt mode, and policy decisions. - Mask customer/personally identifiable data in traces when moving to shared demos. ```
GPU Use hourly GPU rental rather than buying hardware. Start with: - SimplePod-style RTX 4090, 24 GB VRAM - Ubuntu image with NVIDIA drivers/CUDA - Ollama or vLLM serving Qwen 7B - SSH or WireGuard between VPS and GPU pod - stop/destroy GPU when not developing Why RTX 4090: - enough VRAM for Qwen 7B and many 14B quantized tests - cheap hourly development - ideal for short demo/testing windows Use A100/H100 only when: - context windows grow large - multiple users need concurrent answers - batch evaluation speed matters - Azure/enterprise production sizing needs performance proof
Phase 1 Install List Install on the VPS: - Nginx or Caddy - PHP 8.3/8.4 with FPM - MariaDB or PostgreSQL - Redis - Docker Compose - FOSSBilling demo BSS - Directus catalog/data admin - OPA policy engine - n8n or equivalent workflow runner - Langfuse or OpenTelemetry-based trace store - Qdrant vector database - backup job to object storage - UFW firewall and fail2ban - Certbot or Caddy automatic TLS Install on the GPU pod: - NVIDIA driver/CUDA runtime, if not preinstalled - Docker + NVIDIA Container Toolkit - Ollama or vLLM - Qwen 7B model, for example `qwen2.5:7b` - optional Qwen 14B or Qwen3 8B for comparison - a single HTTP inference endpoint exposed only over VPN or IP allowlist
Phase 2: EXOS Controlled Evidence Layer Add retrieval after the Qwen-only baseline works. Components: - document ingestion for TM Forum docs, EXOS architecture docs, BSS runbooks, catalog notes, API schemas, and demo scenarios - chunking pipeline with stable document IDs and section titles - embedding model such as `nomic-embed-text` or BGE-M3 - Qdrant collections such as: - `exos_tmforum` - `exos_bss` - `exos_runbooks` - `exos_customer_scenarios` - retrieval endpoint - pinned evidence packet passed to the LLM - validation runner that records question, expected source, retrieved chunks, model output, tool calls, policy decision, and score For EXOS, present this as the EXOS controlled evidence layer, not CaveauAI.
Phase 3: Azure Target Map the demo stack to Azure as follows: | Demo Component | Azure Target | | --- | --- | | PHP demo app | Azure App Service, Container Apps, or AKS | | Nginx/Caddy | Azure Front Door + App Gateway where needed | | MariaDB/PostgreSQL | Azure Database for MySQL/PostgreSQL | | Qdrant | Qdrant on AKS/Container Apps, or Azure AI Search if acceptable | | Object/files | Azure Blob Storage | | Secrets | Azure Key Vault | | Logs/traces | Azure Monitor + Application Insights | | n8n workflows | Azure Logic Apps or Power Automate | | Dify-style agents | Microsoft Copilot Studio / Azure AI Foundry | | OPA | Containerized OPA or Azure Policy-adjacent governance | | GPU inference | Azure NCasT4_v3 for light inference, NC A100/H100 families for heavier workloads, or Azure ML managed online endpoints | | TM Forum API control plane | Exosphere / API Management / Container Apps / AKS | Development Operating Model 1. Keep the VPS always on. 2. Start the GPU pod only when developing, testing, or demoing LLM behavior. 3. Run one model request at a time on small GPUs. 4. Keep the Qwen 7B baseline separate from evidence-layer runs. 5. Log every agent action and answer. 6. Promote only validated flows into Azure design.
Security Basics - No public GPU inference endpoint. - Use WireGuard or SSH tunnel from VPS to GPU. - Put secrets in `.env` or a server-side secret store, never repo files. - Use per-demo accounts and scoped API keys. - Log tool calls, model ID, prompt mode, and policy decisions. - Mask customer/personally identifiable data in traces when moving to shared demos. ```