cai-exos-systems/daveadmin-exos-demo:exosneeds/infrastructure.md

gitea 692 words Source ↗

VPS Use one Linux VPS for the always-on demo control plane. Recommended minimum: - Ubuntu 24.04 LTS - 4 vCPU - 8 GB RAM - 160 GB NVMe/SSD - 20 TB or high included transfer - public IPv4 - daily snapshots - firewall restricted to HTTP/HTTPS/SSH and private service ports Preferred starting provider: - Hetzner CPX31 or current equivalent: enough RAM/disk for PHP, Nginx/Caddy, database, Qdrant, n8n, Directus, FOSSBilling, OPA, and traces. If the full stack becomes cramped, move to: - 8 vCPU - 16 GB RAM - 240-320 GB NVMe

GPU Use hourly GPU rental rather than buying hardware. Start with: - SimplePod-style RTX 4090, 24 GB VRAM - Ubuntu image with NVIDIA drivers/CUDA - Ollama or vLLM serving Qwen 7B - SSH or WireGuard between VPS and GPU pod - stop/destroy GPU when not developing Why RTX 4090: - enough VRAM for Qwen 7B and many 14B quantized tests - cheap hourly development - ideal for short demo/testing windows Use A100/H100 only when: - context windows grow large - multiple users need concurrent answers - batch evaluation speed matters - Azure/enterprise production sizing needs performance proof

Phase 1 Install List Install on the VPS: - Nginx or Caddy - PHP 8.3/8.4 with FPM - MariaDB or PostgreSQL - Redis - Docker Compose - FOSSBilling demo BSS - Directus catalog/data admin - OPA policy engine - n8n or equivalent workflow runner - Langfuse or OpenTelemetry-based trace store - Qdrant vector database - backup job to object storage - UFW firewall and fail2ban - Certbot or Caddy automatic TLS Install on the GPU pod: - NVIDIA driver/CUDA runtime, if not preinstalled - Docker + NVIDIA Container Toolkit - Ollama or vLLM - Qwen 7B model, for example `qwen2.5:7b` - optional Qwen 14B or Qwen3 8B for comparison - a single HTTP inference endpoint exposed only over VPN or IP allowlist

Phase 2: EXOS Controlled Evidence Layer Add retrieval after the Qwen-only baseline works. Components: - document ingestion for TM Forum docs, EXOS architecture docs, BSS runbooks, catalog notes, API schemas, and demo scenarios - chunking pipeline with stable document IDs and section titles - embedding model such as `nomic-embed-text` or BGE-M3 - Qdrant collections such as: - `exos_tmforum` - `exos_bss` - `exos_runbooks` - `exos_customer_scenarios` - retrieval endpoint - pinned evidence packet passed to the LLM - validation runner that records question, expected source, retrieved chunks, model output, tool calls, policy decision, and score For EXOS, present this as the EXOS controlled evidence layer, not CaveauAI.

Phase 3: Azure Target Map the demo stack to Azure as follows: | Demo Component | Azure Target | | --- | --- | | PHP demo app | Azure App Service, Container Apps, or AKS | | Nginx/Caddy | Azure Front Door + App Gateway where needed | | MariaDB/PostgreSQL | Azure Database for MySQL/PostgreSQL | | Qdrant | Qdrant on AKS/Container Apps, or Azure AI Search if acceptable | | Object/files | Azure Blob Storage | | Secrets | Azure Key Vault | | Logs/traces | Azure Monitor + Application Insights | | n8n workflows | Azure Logic Apps or Power Automate | | Dify-style agents | Microsoft Copilot Studio / Azure AI Foundry | | OPA | Containerized OPA or Azure Policy-adjacent governance | | GPU inference | Azure NCasT4_v3 for light inference, NC A100/H100 families for heavier workloads, or Azure ML managed online endpoints | | TM Forum API control plane | Exosphere / API Management / Container Apps / AKS | Development Operating Model 1. Keep the VPS always on. 2. Start the GPU pod only when developing, testing, or demoing LLM behavior. 3. Run one model request at a time on small GPUs. 4. Keep the Qwen 7B baseline separate from evidence-layer runs. 5. Log every agent action and answer. 6. Promote only validated flows into Azure design.

Security Basics - No public GPU inference endpoint. - Use WireGuard or SSH tunnel from VPS to GPU. - Put secrets in `.env` or a server-side secret store, never repo files. - Use per-demo accounts and scoped API keys. - Log tool calls, model ID, prompt mode, and policy decisions. - Mask customer/personally identifiable data in traces when moving to shared demos. ```