diff --git a/.gitignore b/.gitignore
index 1f4a400..0bab75f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,8 +11,10 @@ wheels/
 .venv
 local_context_openadapt_ml_internal.md
 
-# Environment variables
+# Environment variables and secrets
 .env
+config.json
+vendor/WindowsAgentArena/config.json
 
 # Ephemeral synthetic assets (frames, debug sessions, etc.)
 synthetic/
diff --git a/CLAUDE.md b/CLAUDE.md
index c8c22fe..481d264 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,343 +4,124 @@
 
 **Philosophy**: "Less is more. 80/20 impact/complexity. Working code beats elegant design."
 
-**Before writing code, ask**:
-1. Can this be <100 lines? (ideally <50)
-2. Does this provide 80% of value?
-3. Is this the simplest approach?
+**Before writing code**: Can this be <100 lines? Does this provide 80% of value? Is this the simplest approach?
 
-**Red flags to avoid**:
-- Classes when functions work
-- Abstractions before 3rd use
-- Design docs for non-existent code
-- Multiple implementations of same thing
+**Avoid**: Classes when functions work, abstractions before 3rd use, design docs for non-existent code.
 
-**See**: `/Users/abrichr/oa/src/openadapt-evals/SIMPLICITY_PRINCIPLES.md` for full guidelines.
+See: `/Users/abrichr/oa/src/openadapt-evals/SIMPLICITY_PRINCIPLES.md` for full guidelines.
 
 ---
 
-## 🚨🚨🚨 CRITICAL: CLI-FIRST, NEVER RAW COMMANDS 🚨🚨🚨
+## CRITICAL RULES
 
-### THIS IS THE #1 RULE. VIOLATIONS FRUSTRATE THE USER.
+### 0. CHECK RESOURCES ON SESSION START
 
-**NEVER run commands that require user permission. ALWAYS use or extend the CLI.**
+**After context compaction or session start, check for running Azure resources:**
 
-❌ **BANNED** (these require permission, waste user's time):
 ```bash
-# Raw Azure CLI
-az vm start --name ...
-az vm run-command invoke ...
+uv run python -m openadapt_ml.benchmarks.cli resources
+```
 
-# Raw SSH
-ssh azureuser@IP "command"
+This prevents:
+- Forgetting about running VMs (costs ~$0.19-0.38/hr)
+- Creating duplicate resources
+- Losing track of what's deployed
 
-# Raw Python one-liners
-uv run python -c "import subprocess; ..."
+See `RESOURCES.md` for current status (auto-updated by the command).
 
-# Any command not in the pre-approved CLI
-```
+### 1. CLI-FIRST, NEVER RAW COMMANDS
+
+**NEVER run raw commands. ALWAYS use or extend the CLI.**
 
-✅ **REQUIRED** (these are pre-approved, don't ask permission):
 ```bash
-# ALL VM operations go through the CLI
+# BANNED (require user permission, waste time)
+ssh azureuser@IP "anything"
+az vm start --name ...
+az vm run-command invoke ...
+uv run python -c "import subprocess; ..."
+
+# REQUIRED (pre-approved, don't ask permission)
 uv run python -m openadapt_ml.benchmarks.cli vm start
 uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd "command"
 uv run python -m openadapt_ml.benchmarks.cli vm diag
 uv run python -m openadapt_ml.benchmarks.cli vm logs
 ```
 
-### When Functionality Is Missing
-
-**If a CLI command doesn't exist for what you need:**
-1. **EDIT the CLI** to add the new command/action
-2. **THEN call the CLI** command you just added
-3. **NEVER use raw commands** as a workaround
-
-**Example**: Need to restart Docker services?
-```python
-# 1. Add to cli.py under cmd_vm():
-elif action == "fix-docker":
-    # Restart containerd and docker
-    commands = [
-        "sudo systemctl restart containerd",
-        "sudo systemctl restart docker",
-        "docker ps"
-    ]
-    for cmd in commands:
-        run_on_vm(cmd)
-
-# 2. Then call it:
-uv run python -m openadapt_ml.benchmarks.cli vm fix-docker
-```
-
-**This rule exists because:**
-- Raw commands require user approval every time
-- CLI commands are pre-approved and don't interrupt workflow
-- CLI commands are documented and reusable
-- The user has told you this MANY times - LISTEN
-
----
-
-## 🔄 STANDARD WORKFLOW: VM Configuration Changes
-
-**When VM config needs to change (disk size, VM size, etc.):**
-
-1. **Delete the current VM** (if running):
-   ```bash
-   uv run python -m openadapt_ml.benchmarks.cli vm delete -y
-   ```
-
-2. **Update the code** that launches the VM (e.g., `cli.py` defaults)
-
-3. **Launch new VM** with the updated code:
-   ```bash
-   uv run python -m openadapt_ml.benchmarks.cli vm setup-waa  # API key loaded from .env
-   ```
-
-**DO NOT** try to resize/modify running VMs. It's simpler and faster to delete + recreate.
-
-**Current VM defaults** (in `cli.py`):
-- Size: `Standard_D8ds_v5` (300GB temp storage on /mnt)
-- Location: `eastus`
-- OS: Ubuntu 22.04 LTS
-
----
-
-## Project Status & Priorities
-
-**IMPORTANT**: Before starting work, always check the project-wide status document:
-- **Location**: `/Users/abrichr/oa/src/STATUS.md`
-- **Purpose**: Tracks P0 priorities, active background tasks, blockers, and strategic decisions
-- **Action**: Read this file at the start of every session to understand current priorities
-
-This ensures continuity between Claude Code sessions and context compactions.
-
----
-
-This file helps maintain context across sessions.
-
----
-## ⚠️⚠️⚠️ MANDATORY: START DASHBOARD FIRST ⚠️⚠️⚠️
-
-### STOP. READ THIS BEFORE DOING ANYTHING.
-
-**If ANY of these are true, you MUST run the dashboard command IMMEDIATELY:**
-- Session just started or was compacted
-- User mentions VMs, Azure, WAA, benchmark, or Windows
-- You're about to run ANY `vm` subcommand (probe, diag, logs, run-waa, etc.)
-- You want to check benchmark status
-
-**THE COMMAND (run this FIRST, not after other commands):**
-```bash
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
-```
-
-**ENHANCED FEATURES (as of Jan 2026):**
-The `vm monitor` command now provides comprehensive VM usage visibility:
-- **VM Status**: Real-time VM state, size, and IP
-- **Activity Detection**: What the VM is currently doing (idle, benchmark running, setup)
-- **Cost Tracking**: Current uptime, hourly rate, and total cost for session
-- **Azure ML Jobs**: Recent jobs from last 7 days with status
-- **Evaluation History**: Past benchmark runs and success rates (with --details flag)
-- **Dashboard & Tunnels**: Auto-starts web dashboard and SSH/VNC tunnels
-
-**Usage:**
-```bash
-# Basic monitoring
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
-
-# With detailed information (costs per day/week, evaluation history)
-uv run python -m openadapt_ml.benchmarks.cli vm monitor --details
-
-# With auto-shutdown after 2 hours
-uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
-```
-
-**WHY THIS MATTERS:**
-- VNC is ONLY accessible via SSH tunnel at `localhost:8006` (NOT the public IP like `http://20.x.x.x:8006`)
-- Azure NSG blocks port 8006 by design - direct access to public IP will NOT work
-- The dashboard auto-manages SSH tunnels for VNC access
-- Shows real-time costs to prevent budget overruns
-- Tracks all Azure ML jobs for visibility into what's running
-- Without it, you cannot see what Windows is doing
-- The user WILL be frustrated if you keep forgetting this
+**If a CLI command doesn't exist**: Edit cli.py to add it, THEN use it. NEVER use raw commands as workaround.
 
-**WRONG (what you keep doing):**
-```bash
-# DON'T do this - checking probe/diag/logs WITHOUT dashboard running
-uv run python -m openadapt_ml.benchmarks.cli vm probe
-uv run python -m openadapt_ml.benchmarks.cli vm diag
-# Then telling user to "run vm monitor" - NO! YOU run it FIRST!
-```
+### 2. START DASHBOARD FIRST FOR VM WORK
 
-**RIGHT (what you should do):**
+**Before ANY vm subcommand (probe, diag, logs, etc.):**
 ```bash
-# ALWAYS start dashboard FIRST, then it handles everything
 uv run python -m openadapt_ml.benchmarks.cli vm monitor
 ```
 
-**After every /compact or session restart, your LITERAL FIRST ACTION must be starting this dashboard if VMs are involved.**
-
----
-## 🔴 MANDATORY: VERIFY URLs BEFORE RECOMMENDING 🔴
+This manages:
+- SSH tunnels (VNC at localhost:8006, WAA at localhost:5001)
+- Real-time cost tracking
+- Azure ML job visibility
+- Auto-opens web dashboard
 
-**BEFORE telling the user to access ANY URL (localhost:XXXX, VNC, dashboard, etc.):**
+**WRONG**: Running `vm probe` then `vm diag` then telling user to run `vm monitor`
+**RIGHT**: Run `vm monitor` FIRST - it handles everything
 
-1. **MANUALLY VERIFY** the URL is accessible by running a curl/check command
-2. **NEVER assume** a service is running just because it was started earlier
-3. **NEVER recommend** a URL based on documentation alone - ALWAYS test first
+### 3. VERIFY URLs BEFORE RECOMMENDING
 
-**Example verification:**
+Always test URLs with curl before telling user to access them:
 ```bash
-# ALWAYS do this BEFORE telling user to visit localhost:8006
-curl -s --connect-timeout 5 http://localhost:8006/ > /dev/null && echo "VNC accessible" || echo "VNC NOT accessible"
+curl -s --connect-timeout 5 http://localhost:8006/ > /dev/null && echo "accessible" || echo "NOT accessible"
 ```
 
-**If verification fails:**
-- Do NOT tell user to access the URL
-- Diagnose why it's not working
-- Fix it first, THEN provide the URL
-
-**This rule exists because:** The user was told to access localhost:8006 when the container was gone. This is unacceptable.
-
----
-## 🚨🚨🚨 STOP! READ THIS BEFORE EVERY COMMAND 🚨🚨🚨
-
-### ABSOLUTELY NEVER USE RAW SSH COMMANDS
-
-**This is the #1 rule. You have been told this MANY times. STOP IGNORING IT.**
-
-❌ **BANNED** (never type these):
-- `ssh azureuser@IP "anything"`
-- `ssh $SSH_OPTS ...`
-- Any command starting with `ssh` to the VM
-
-✅ **REQUIRED** (always use these instead):
-- `uv run python -m openadapt_ml.benchmarks.cli vm exec --cmd "your command"`
-- `uv run python -m openadapt_ml.benchmarks.cli vm diag`
-- `uv run python -m openadapt_ml.benchmarks.cli vm logs`
-
-**If a CLI command doesn't exist, ADD IT TO THE CLI FIRST, then use it.**
-
-**Before running ANY command involving the VM, ask yourself:**
-1. Does this start with `ssh`? → STOP, use CLI instead
-2. Is this a raw shell command to the VM? → STOP, use CLI instead
-3. Can I use `vm exec --cmd`? → YES, use it
-
-This has been explained to you repeatedly. FOLLOW IT.
-
 ---
-## 🔧 DOCKERFILE/VM CHANGES: TEST INSIDE CONTAINER FIRST
 
-**Problem**: Each Dockerfile change triggers: rebuild (10 min) → Windows boot (15 min) → test → repeat. Hours wasted on tiny changes.
+## Project Status
 
-**Solution**: Test fixes INSIDE a running container BEFORE rebuilding:
-
-```bash
-# 1. Start a test container with bash entrypoint (seconds)
-uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
-  'docker run -d --name test-fix --entrypoint /bin/bash windowsarena/winarena:latest -c "sleep 3600"'
-
-# 2. Apply your fix manually INSIDE the container (seconds)
-uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
-  "docker exec test-fix sed -i 's/old/new/' /some/file.sh"
-
-# 3. Verify the fix works (seconds)
-uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
-  "docker exec test-fix cat /some/file.sh"
-
-# 4. Test the actual behavior (seconds)
-uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
-  "docker exec test-fix /some/script.sh && ls /expected/output"
-
-# 5. Cleanup
-uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'docker rm -f test-fix'
-
-# 6. ONLY AFTER fix is verified: Update Dockerfile and rebuild ONCE
-```
-
-**Why this matters**:
-- Testing a fix takes SECONDS instead of 30+ minutes
-- Iterate 10x on the fix before committing to a rebuild
-- Don't lose context waiting for long builds
-- Each rebuild should be the LAST rebuild, not a guess
-
----
+**IMPORTANT**: Check `/Users/abrichr/oa/src/STATUS.md` at session start for P0 priorities.
 
 ## Project Overview
 
-openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation agents. It provides:
-- Schemas for GUI interaction trajectories
-- Synthetic UI generation for bootstrapping
+openadapt-ml: Model-agnostic ML engine for GUI automation agents.
+- Schemas for GUI trajectories
 - VLM adapters (Qwen3-VL, Qwen2.5-VL, API backends)
 - Supervised fine-tuning pipeline
 - Runtime policy API
 
 ## Current Focus: Demo Retrieval
 
-**Validated**: Demo-conditioned prompting improves action accuracy (Dec 2024)
+**Validated (Dec 2024)**: Demo-conditioned prompting improves accuracy
 - Zero-shot: 33% correct first actions
 - With demo: 100% correct first actions
 - See `docs/experiments/demo_conditioned_prompting_results.md`
 
-**✅ VALIDATED (Jan 17, 2026)**: Demo persistence fix is working
-- The P0 fix in `openadapt-evals` ensures demo is included at EVERY step, not just step 1
-- Mock test confirms: agent behavior changes from 6.8 avg steps (random) to 3.0 avg steps (focused)
-- See `openadapt-evals/CLAUDE.md` for full validation details
-- **Next step**: Run full WAA evaluation (154 tasks) to measure episode success improvement
-
-**Next step**: Build demo retrieval to automatically select relevant demos from a library.
+**Validated (Jan 2026)**: Demo persistence fix working in openadapt-evals
+- Agent behavior: 6.8 avg steps (random) -> 3.0 avg steps (focused)
+- Next: Run full WAA evaluation (154 tasks)
 
-**Key insight**: OpenAdapt's value is **trajectory-conditioned disambiguation of UI affordances**, not "better reasoning".
+**Key insight**: OpenAdapt's value is trajectory-conditioned disambiguation of UI affordances.
 
 ## Benchmark Integration
 
-**Primary benchmark**: Windows Agent Arena (WAA)
+**Primary**: Windows Agent Arena (WAA)
 - 154 tasks across 11 Windows domains
-- MIT licensed, can run locally or on Azure
+- MIT licensed, runs locally or on Azure
 - SOTA: ~19.5% success (GPT-5.1 + OmniParser)
 
-**Future benchmarks** (not yet implemented):
-- WebArena/VisualWebArena (browser)
-- OSWorld (cross-platform desktop)
+**Future benchmarks** (not yet implemented): WebArena, OSWorld
 
----
-
-## 🎯 WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
+**Code location**: Benchmark code moved to `openadapt-evals` package. openadapt-ml handles VM management only.
 
-### Architecture Overview
+```python
+# NEW (preferred)
+from openadapt_evals import ApiAgent, WAAMockAdapter, evaluate_agent_on_benchmark
 
-```
-┌─────────────────────────────────────────────────────────────────────────┐
-│                         LOCAL MACHINE                                    │
-│                                                                          │
-│  openadapt-ml CLI              openadapt-evals CLI                      │
-│  (VM management)               (benchmark execution)                     │
-│       │                              │                                   │
-│       │  vm monitor                  │  live --server localhost:5001    │
-│       │  vm setup-waa                │  run (shortcut)                  │
-│       │  vm diag                     │                                   │
-│       ▼                              ▼                                   │
-│  ┌─────────────────────────────────────────────────────────────┐       │
-│  │              SSH TUNNELS (auto-managed)                      │       │
-│  │  localhost:5001 ──────► VM:5000 (WAA Flask API)             │       │
-│  │  localhost:8006 ──────► VM:8006 (noVNC)                     │       │
-│  └─────────────────────────────────────────────────────────────┘       │
-└─────────────────────────────────────────────────────────────────────────┘
-                                   │
-                                   │ SSH (port 22)
-                                   ▼
-┌─────────────────────────────────────────────────────────────────────────┐
-│                         AZURE VM (Ubuntu)                                │
-│                                                                          │
-│  Docker                                                                  │
-│  └── windowsarena/winarena:latest                                       │
-│       └── QEMU (Windows 11 Enterprise)                                  │
-│            ├── WAA Flask server (port 5000)                             │
-│            └── Navi agent (executes tasks)                              │
-└─────────────────────────────────────────────────────────────────────────┘
+# Backward compat
+from openadapt_ml.benchmarks import APIBenchmarkAgent, WAAMockAdapter
 ```
 
+---
+
+## WAA Workflow
+
 ### Two CLIs, Two Purposes
 
 | CLI | Repo | Purpose |
@@ -350,1139 +131,485 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
 
 ### API Keys
 
-**API keys are auto-loaded from `.env` via `config.py`**. No need to pass explicitly.
+Auto-loaded from `.env` via `config.py`. No need to pass explicitly.
 
 ```bash
-# .env file (create in repo root, not committed to git)
+# .env file (not committed to git)
 OPENAI_API_KEY=sk-...
 ANTHROPIC_API_KEY=sk-ant-...
 ```
 
-Optional override: `[--api-key KEY]` on any command that needs it.
+### Complete Workflow (Pool - Recommended)
 
-### Complete Workflow (Step by Step)
-
-**Step 1: Setup Azure VM with WAA (first time, ~15 min)**
+**Step 1: Create VM Pool (~10 min)**
 ```bash
-cd /Users/abrichr/oa/src/openadapt-ml
-uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
+# Single VM for quick tests
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 1
+
+# Multiple VMs for parallel evaluation
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3
 ```
-This creates VM, installs Docker, pulls Windows image, starts WAA server.
 
-**Step 2: Start Dashboard and Tunnels**
+**Step 2: Wait for WAA Ready (~5-15 min)**
 ```bash
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
+uv run python -m openadapt_ml.benchmarks.cli pool-wait
 ```
-This auto-manages SSH tunnels:
-- `localhost:5001` -> VM:5000 (WAA API)
-- `localhost:8006` -> VM:8006 (VNC)
 
-**Step 3: Run Benchmark (from openadapt-evals)**
+**Step 3: Run Benchmark**
 ```bash
-cd /Users/abrichr/oa/src/openadapt-evals
+# Run 3 tasks for quick validation
+uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 3
 
-# Quick smoke test (no API key needed)
-uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
-
-# Run with OpenAI (uses OPENAI_API_KEY from .env)
-uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
+# Run all 154 tasks
+uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
+```
 
-# Run with Claude (uses ANTHROPIC_API_KEY from .env)
-uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1
+**Step 4: View Progress and VNC**
+```bash
+# Check status
+uv run python -m openadapt_ml.benchmarks.cli pool-status
 
-# Override API key if needed
-uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1 --api-key sk-...
+# Open VNC to view Windows desktops
+uv run python -m openadapt_ml.benchmarks.cli pool-vnc
 
-# Multiple tasks
-uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2,browser_1
+# Stream logs
+uv run python -m openadapt_ml.benchmarks.cli pool-logs
 ```
 
-**Step 4: View Results**
+**Step 5: Cleanup (Stop Billing)**
 ```bash
-uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval
+uv run python -m openadapt_ml.benchmarks.cli pool-cleanup
 ```
 
-**Step 5: Deallocate VM (stops billing)**
+### CLI Commands Reference
+
 ```bash
-cd /Users/abrichr/oa/src/openadapt-ml
-uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
+# === POOL COMMANDS (Parallel VMs - Recommended) ===
+pool-create --workers N   # Create N VMs with Docker + WAA image
+pool-create --workers N --auto-shutdown-hours 6  # Custom auto-shutdown (default: 4h)
+pool-wait                 # Wait for WAA server ready on all workers
+pool-run --tasks N        # Run N tasks distributed across workers
+pool-status               # Show status of all pool VMs
+pool-vnc                  # Open VNC to pool workers (SSH tunnels)
+pool-logs                 # Stream logs from all workers
+pool-exec --cmd ''        # Execute command on all workers
+pool-cleanup -y           # Delete all pool VMs and resources (no prompt)
+
+# === SINGLE VM COMMANDS ===
+create --fast             # Create single VM (D8ds_v5)
+create --fast --auto-shutdown-hours 6  # Custom auto-shutdown (default: 4h)
+delete                    # Delete VM and all resources
+status                    # Show VM status
+start                     # Start WAA container
+stop                      # Stop WAA container
+probe                     # Check if WAA server is ready
+run --num-tasks N         # Run benchmark on single VM
+vm-start                  # Start a deallocated VM
+deallocate                # Stop VM (preserves disk, stops billing)
+logs                      # Show WAA logs
+vnc                       # Open VNC (SSH tunnel)
+exec --cmd ''             # Run command in container
+docker-exec --cmd ''      # Run command on VM host
+
+# === AZURE ML COMMANDS (Legacy) ===
+run-azure-ml --workers N  # Run on Azure ML compute instances
+azure-ml-quota            # Check quota status
+azure-ml-quota-wait       # Wait for quota approval
 ```
 
-### Quick Reference Commands
+### Quota Auto-Detection
+
+Wait for quota approval before running evaluation:
 
-**From openadapt-ml (VM management):**
 ```bash
-vm monitor        # Start dashboard, tunnels, show status
-vm setup-waa      # First-time VM + WAA setup
-vm diag           # Check disk, Docker, containers
-vm probe          # Check WAA server status
-vm logs           # View container logs
-vm deallocate     # Stop VM billing
-vm delete         # Remove VM entirely
+# Wait for quota (polls every 60 seconds, 24h timeout)
+uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait
+
+# Wait and automatically run evaluation when quota is approved
+uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --auto-run --tasks 20
+
+# Custom target (e.g., 16 vCPUs for 2 parallel workers)
+uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --target 16
+
+# Run in background (survives terminal close)
+nohup uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --auto-run &
 ```
 
-**From openadapt-evals (benchmarks):**
+See `docs/QUOTA_AUTO_DETECTION_DESIGN.md` for full documentation.
+
+### VM Auto-Shutdown and Orphan Prevention
+
+**Auto-shutdown policy**: All VMs are automatically configured with an Azure auto-shutdown policy as a safety net to prevent orphaned VMs from running indefinitely and consuming quota/money.
+
+- **Default**: 4 hours after VM creation
+- **Customizable**: `--auto-shutdown-hours N` (0 to disable)
+- **Azure-level enforcement**: Even if SSH connection drops, the VM will still be deallocated
+
 ```bash
-run               # Simplified live evaluation (uses localhost:5001)
-live              # Full control over server URL
-mock              # Mock evaluation (no VM needed)
-probe             # Check if WAA server is ready
-view              # Generate HTML results viewer
+# Default: auto-shutdown in 4 hours
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3
+
+# Custom: auto-shutdown in 8 hours for long-running evaluations
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 --auto-shutdown-hours 8
+
+# Disable auto-shutdown (not recommended)
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 --auto-shutdown-hours 0
 ```
 
-### Key Points to Remember
+**Test VM cleanup**: During `pool-create`, a test VM is created to check quota availability. This test VM is always cleaned up via try/finally, even if the command is interrupted or fails.
 
-1. **SSH tunnels are required** - Azure NSG blocks direct access to ports 5000/8006
-2. **WAA server runs INSIDE Windows** - The Flask server (port 5000) runs in Windows, not on the Ubuntu host
-3. **Default tunnel port is 5001** - Use `--server http://localhost:5001` (not 5000)
-4. **Monitor auto-manages tunnels** - Running `vm monitor` sets up everything
-5. **Results saved to benchmark_results/** - View with `view --run-name <name>`
+**Manual cleanup**: Use `pool-cleanup -y` to clean up orphaned resources without confirmation prompts (useful for automation):
+```bash
+uv run python -m openadapt_ml.benchmarks.cli pool-cleanup -y
+```
 
-### Troubleshooting
+### Azure ML Automated Workflow
+
+For parallel benchmark execution on Azure ML compute instances:
 
-**Problem: "Cannot connect to WAA server"**
 ```bash
-# 1. Is VM running?
-uv run python -m openadapt_ml.benchmarks.cli vm status
+# Single command handles everything:
+# 1. Create/start VM if needed
+# 2. Start Windows container with VERSION=11e
+# 3. Wait for WAA server ready (~15-20 min first time)
+# 4. Upload golden image to blob storage
+# 5. Run Azure ML benchmark with N workers
 
-# 2. Are tunnels active?
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
+uv run python -m openadapt_ml.benchmarks.cli run-azure-ml-auto --workers 4
 
-# 3. Check container
-uv run python -m openadapt_ml.benchmarks.cli vm diag
+# Setup only (golden image, no benchmark)
+uv run python -m openadapt_ml.benchmarks.cli run-azure-ml-auto --skip-benchmark
+
+# Cleanup when done (IMPORTANT - stops billing!)
+uv run python -m openadapt_ml.benchmarks.cli run-azure-ml --teardown --confirm
 ```
 
-**Problem: "Connection refused on localhost:5001"**
-```bash
-# Start tunnels via monitor
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
+See `docs/AZURE_ML_AUTOMATED_WORKFLOW.md` for full documentation.
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    LOCAL MACHINE                             │
+│  openadapt-ml CLI         openadapt-evals CLI               │
+│  (VM management)          (benchmark execution)              │
+│       │                        │                             │
+│  ┌────────────────────────────────────────────────────────┐ │
+│  │          SSH TUNNELS (auto-managed by monitor)          │ │
+│  │  localhost:5001 → VM:5000 (WAA API)                    │ │
+│  │  localhost:8006 → VM:8006 (noVNC)                      │ │
+│  └────────────────────────────────────────────────────────┘ │
+└─────────────────────────────────────────────────────────────┘
+                        │ SSH (port 22)
+                        ▼
+┌─────────────────────────────────────────────────────────────┐
+│                    AZURE VM (Ubuntu)                         │
+│  Docker                                                      │
+│  └── windowsarena/winarena:latest (Microsoft official)      │
+│       └── QEMU (Windows 11 Enterprise)                      │
+│            ├── WAA Flask server (port 5000)                 │
+│            └── Navi agent (executes tasks)                  │
+└─────────────────────────────────────────────────────────────┘
 ```
 
-**Problem: "Windows not booting"**
+**Key Points**:
+1. SSH tunnels required - Azure NSG blocks direct port access
+2. WAA server runs INSIDE Windows, not on Ubuntu host
+3. Default tunnel port is 5001 (not 5000)
+4. Uses vanilla Microsoft WAA image, no custom Dockerfile
+5. `VERSION=11e` auto-downloads Windows 11 Enterprise Evaluation
+
+---
+
+## VM Configuration Changes
+
+Delete + recreate (don't try to resize running VMs):
 ```bash
-# Check VNC (opens in browser via monitor)
-# Look at container logs
-uv run python -m openadapt_ml.benchmarks.cli vm logs
+uv run python -m openadapt_ml.benchmarks.cli vm delete -y
+# Update cli.py defaults
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
 ```
 
+**Current defaults** (in cli.py):
+- Size: `Standard_D8ds_v5` (8 vCPU, 32GB RAM, 300GB temp on /mnt)
+- Location: `eastus`
+- OS: Ubuntu 22.04 LTS
+
 ---
 
 ## Key Architecture Decisions
 
-1. **SoM (Set-of-Marks) mode** - Achieves 100% on synthetic benchmarks by using element IDs instead of coordinates (`CLICK([1])` not `CLICK(x=0.42, y=0.31)`)
-
-2. **Grounding module** - Keep but deprioritize. Useful for deployment on real UIs without SoM overlays. Located in `openadapt_ml/grounding/`
-
-3. **Schema design** - Actions should carry both coordinates AND element grounding (node_id, role, name, bbox) when available
-
-4. **Lossless preservation** - Always store raw benchmark configs verbatim in `raw_config`, `raw_observation`, `raw_action` fields
-
-5. **DOM/AX is mandatory in schema, optional at runtime** - Observations must support `accessibility_tree` and `dom_html` fields for evaluator compatibility (WebArena, WorkArena, Mind2Web need DOM for scoring), even if agents choose vision-only
-
-6. **Cloud-First Development** - While features should work locally for testing, immediately build out cloud compatibility (Azure free tier, Lambda Labs) because:
-   - Most users won't have 96GB RAM locally for VLM training
-   - Developer productivity suffers waiting for long training runs
-   - Training should be as short as possible with feedback as quickly as possible
-   - **Everything should feel fast** - offload heavy compute to cloud GPUs
-   - Cloud providers: Azure (primary, free tier available), Lambda Labs (GPU rental)
-   - See `docs/live_inference_design.md` for async inference architecture
-
-7. **Schema Purity** - The schema must remain domain-agnostic and generic:
-   - **External systems adapt TO the schema**, not the other way around
-   - Never add fields to accommodate specific external data structures
-   - Data transformation belongs in importers/exporters, not core schema
-   - Use `raw` and `metadata` dict fields for integration-specific data
-   - If a proposed field feels specific to one use case, it doesn't belong in the schema
-   - This is a standard open-source library: users import and call functions, they don't shape the API
-   - See `openadapt_ml/schemas/` for canonical definitions
-
-8. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first:
-   - **Never wait on real training to test UI/code changes**
-   - Use `--stub` flag to simulate training progress without GPU
-   - Generates fake loss curves, evaluations, checkpoints in seconds
-   - Enables rapid iteration on dashboard, viewer, stop button, etc.
-   - See `docs/stub_training_adapter.md` for implementation details
-   - Usage: `uv run python -m openadapt_ml.cloud.lambda_labs monitor --stub --open`
-
-## Expert Feedback
-
-1. **Prompting first** - Establish baselines with off-the-shelf models before fine-tuning
-2. **Prompt engineering matters** - Use structured format: Observation summary → Planning → Possible actions → Action
-3. **Element-based actions** - `Click [8]` instead of coordinates, similar to SoM
-4. **Larger base models** - They used Gemma3 27B; current 2B/8B might be too small
-
-## Benchmark Integration (MIGRATED TO openadapt-evals)
-
-> **IMPORTANT**: Benchmark code has been consolidated into the `openadapt-evals` package.
-> The `openadapt_ml/benchmarks/` directory now contains deprecation stubs that re-export from `openadapt-evals`.
->
-> **Use the new package:**
-> ```python
-> # NEW (preferred)
-> from openadapt_evals import ApiAgent, WAAMockAdapter, evaluate_agent_on_benchmark
->
-> # Also works (backward compat)
-> from openadapt_ml.benchmarks import APIBenchmarkAgent, WAAMockAdapter
-> ```
->
-> **CLI (now in openadapt-evals):**
-> ```bash
-> # NEW (preferred)
-> uv run python -m openadapt_evals.benchmarks.cli mock --tasks 10
-> uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://vm:5000
->
-> # openadapt-ml CLI still works for VM management
-> uv run python -m openadapt_ml.benchmarks.cli vm monitor
-> ```
-
-The benchmark integration module is now in `openadapt-evals`:
-- `openadapt_evals/adapters/` - BenchmarkAdapter, WAAAdapter, WAALiveAdapter
-- `openadapt_evals/agents/` - BenchmarkAgent, ApiAgent (with P0 demo persistence fix), PolicyAgent
-- `openadapt_evals/benchmarks/` - runner, metrics, viewer, data_collection
-
-### APIBenchmarkAgent
-
-The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-5.1) for benchmark evaluation baselines.
-This enables comparing fine-tuned models against off-the-shelf VLMs.
+1. **SoM mode** - Element IDs (`CLICK([1])`) instead of coordinates for 100% accuracy on synthetic benchmarks
 
-```python
-from openadapt_ml.benchmarks import APIBenchmarkAgent, evaluate_agent_on_benchmark
+2. **Grounding module** - Keep but deprioritize. Useful for real UIs without SoM overlays. Located in `openadapt_ml/grounding/`
 
-# Claude baseline
-agent = APIBenchmarkAgent(provider="anthropic")
-results = evaluate_agent_on_benchmark(agent, adapter)
+3. **Schema design** - Actions carry both coordinates AND element grounding when available
 
-# GPT-5.1 baseline
-agent = APIBenchmarkAgent(provider="openai")
-results = evaluate_agent_on_benchmark(agent, adapter)
-```
-
-CLI usage:
-```bash
-# Run Claude evaluation on mock tasks
-uv run python -m openadapt_ml.benchmarks.cli run-api --provider anthropic --tasks 5
+4. **Lossless preservation** - Store raw benchmark configs in `raw_config`, `raw_observation`, `raw_action` fields
 
-# Run GPT-5.1 evaluation
-uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai --tasks 5
+5. **Schema purity** - Domain-agnostic; external systems adapt TO the schema, not vice versa. See `openadapt_ml/schemas/`
 
-# Disable accessibility tree in prompts
-uv run python -m openadapt_ml.benchmarks.cli run-api --no-a11y --tasks 5
-```
+6. **Cloud-first** - Offload heavy compute to cloud GPUs (Azure, Lambda Labs). Everything should feel fast.
 
-The agent:
-- Converts BenchmarkObservation to API format (screenshot + structured prompt)
-- Parses VLM responses into BenchmarkActions using regex patterns
-- Supports CLICK(x,y), CLICK([id]), TYPE("text"), KEY(key), SCROLL(dir), DONE()
-- Stores raw VLM responses in `action.raw_action` for debugging
-
-### Azure Automation
-
-`scripts/setup_azure.py` fully automates Azure setup with 15 steps:
-1. Check Azure CLI installation
-2. Login to Azure
-3. Select subscription
-4. Register resource providers (Compute, ML, Storage, ContainerRegistry)
-5. Create resource group
-6. Create service principal with Contributor role
-7. Create ML workspace
-8. Create Azure Container Registry (ACR)
-9. Import WAA Docker image from Docker Hub to ACR
-10. Attach ACR to ML workspace
-11. Grant AcrPull role to workspace managed identity
-12. Sync workspace keys for ACR authentication
-13. Request GPU quota
-14. Create storage account
-15. Create inference queue and blob containers
-
-The script writes all credentials to `.env` including:
-- Service principal credentials (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
-- Workspace config (AZURE_SUBSCRIPTION_ID, AZURE_ML_RESOURCE_GROUP, AZURE_ML_WORKSPACE_NAME)
-- Docker image path (AZURE_DOCKER_IMAGE) pointing to ACR
-
-**Why ACR?** Azure ML cannot pull from Docker Hub or ghcr.io directly. The image must be in ACR.
-
-**ACR Authentication**: The script automatically configures ACR authentication by granting the workspace's managed identity AcrPull role on the ACR. This ensures compute instances can pull Docker images without requiring admin credentials.
-
-CLI usage:
-```bash
-# Set up Azure (creates resources, ACR, imports image, writes credentials to .env)
-python scripts/setup_azure.py
+7. **Stub training** - Use `--stub` flag for rapid UI iteration without GPU
 
-# Clean up all Azure resources
-python scripts/setup_azure.py --cleanup
+8. **DOM/AX mandatory in schema** - For evaluator compatibility (WebArena, Mind2Web need DOM), even if agents use vision-only
 
-# Estimate Azure costs
-python -m openadapt_ml.benchmarks.cli estimate --workers 40
+---
 
-# Test with mock adapter (no Windows required)
-python -m openadapt_ml.benchmarks.cli test-mock --tasks 20
+## Azure Automation
 
-# Check Azure status
-python -m openadapt_ml.benchmarks.cli status
+`scripts/setup_azure.py` automates 15-step Azure setup:
+- Creates resource group, service principal, ML workspace, ACR
+- Imports WAA Docker image to ACR
+- Configures ACR authentication (AcrPull role)
+- Writes credentials to `.env`
 
-# Run on Azure (WAA submodule auto-detected)
-python -m openadapt_ml.benchmarks.cli run-azure --workers 1
+```bash
+python scripts/setup_azure.py        # Setup
+python scripts/setup_azure.py --cleanup  # Cleanup
 ```
 
-Schema extensions completed in `openadapt_ml/schemas/sessions.py`:
-- `Action`: `target_node_id`, `target_role`, `target_name`, `answer`, `key`, `modifiers`, `scroll_direction`, `scroll_amount`, `end_x`, `end_y`
-- `Observation`: `accessibility_tree`, `dom_html`, `url`, `window_title`, `app_name`, `focused_element`
+---
 
 ## Cloud GPU Training
 
 See `docs/cloud_gpu_training.md` for full documentation.
 
-**Quick start:**
 ```bash
-# Lambda Labs - fully automated training pipeline
-uv run python -m openadapt_ml.cloud.lambda_labs train \
-  --capture /path/to/capture \
-  --goal "Task description"
+# Lambda Labs - automated pipeline
+uv run python -m openadapt_ml.cloud.lambda_labs train --capture /path --goal "Task"
 
-# Or step by step:
+# Step by step
 uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a10
 uv run python -m openadapt_ml.cloud.lambda_labs train-status
 uv run python -m openadapt_ml.cloud.lambda_labs terminate <id>
 ```
 
-**Important**: All cloud operations should be wrapped in CLI commands, not raw SSH. The Lambda Labs module provides:
-- `LambdaLabsClient.setup_instance()` - Clone repo, install deps
-- `LambdaLabsClient.upload_capture()` - rsync capture data
-- `LambdaLabsClient.run_training()` - Execute training
-- `LambdaLabsClient.get_training_status()` - Poll training progress
+---
 
-## Training & Visualization Commands
+## Training Commands
 
 ```bash
-# Train on a capture recording
+# Train on capture
 uv run python -m openadapt_ml.scripts.train \
   --config configs/qwen3vl_capture.yaml \
   --capture /path/to/capture \
-  --open  # opens dashboard in browser
+  --open
 
-# Serve dashboard/viewer via HTTP (RECOMMENDED)
-# Auto-regenerates dashboard.html and viewer.html before serving
+# Serve dashboard (auto-regenerates HTML)
 uv run python -m openadapt_ml.cloud.local serve --port 8080 --open
 
-# Skip regeneration if files are already up to date
-uv run python -m openadapt_ml.cloud.local serve --port 8080 --open --no-regenerate
-
-# Regenerate viewer/dashboard without serving
-# Useful after training completes or to refresh with latest code changes
+# Regenerate viewer without serving
 uv run python -m openadapt_ml.cloud.local viewer
 
-# Compare human vs model predictions
+# Compare human vs model
 uv run python -m openadapt_ml.scripts.compare \
   --capture /path/to/capture \
   --checkpoint checkpoints/model \
   --open
 ```
 
-## Benchmark Data Collection & Testing
-
-```bash
-# Test benchmark data collection (Phase 1)
-# Creates directory structure with screenshots, execution traces, and metadata
-uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5
-
-# Custom run name and output directory
-uv run python -m openadapt_ml.benchmarks.cli test-collection \
-  --tasks 10 \
-  --run-name my_test_run \
-  --output benchmark_results \
-  --model-id "my-agent-v1"
-
-# Run the standalone test script (equivalent to test-collection)
-uv run python test_data_collection.py
-```
-
-**Output directory structure:**
-```
-benchmark_results/
-├── {run_name}/
-│   ├── metadata.json        # Benchmark name, model ID, timestamp
-│   ├── summary.json         # Aggregate metrics (success rate, avg steps)
-│   └── tasks/
-│       ├── task_001/
-│       │   ├── task.json       # Task definition
-│       │   ├── execution.json  # Execution trace with steps
-│       │   └── screenshots/
-│       │       ├── step_000.png
-│       │       ├── step_001.png
-│       │       └── ...
-│       └── task_002/
-│           └── ...
-```
-
-**Key files:**
-- `execution.json`: Contains step-by-step trace with actions, reasoning, timestamps
-- `task.json`: Task definition with instruction, domain, time limits
-- `summary.json`: High-level metrics suitable for benchmark viewer
-- `screenshots/`: PNG screenshots at each step
-
-## Viewer Setup Troubleshooting
-
-**Problem**: Viewer shows "No model loaded" after training.
-
-**Root cause**: The viewer requires:
-1. A base `comparison.html` file (from capture or generated during training)
-2. Prediction JSON files (`predictions_*.json`)
-
-**Solution**:
-```bash
-# If comparison.html is missing, copy from the capture directory:
-cp /path/to/capture/comparison.html training_output/
-
-# Then regenerate the viewer:
-uv run python -m openadapt_ml.cloud.local viewer
-
-# Serve and open:
-uv run python -m openadapt_ml.cloud.local serve --open
-```
-
-**Key files in training_output/**:
-- `training_log.json` - Training progress, loss curves, evaluations
-- `dashboard.html` - Training dashboard (auto-regenerated by serve command)
-- `viewer.html` - Capture viewer with predictions (auto-regenerated by serve command)
-- `comparison.html` - Base viewer from capture (needed for viewer generation)
-- `predictions_*.json` - Model predictions by checkpoint (e.g., `predictions_epoch3.json`)
-
-## Files to Know
-
-- `docs/cloud_gpu_training.md` - Lambda Labs and Azure GPU training guide
-- `docs/benchmark_integration_plan.md` - Benchmark integration architecture
-- `docs/azure_waa_setup.md` - Azure WAA setup guide (quota increase, costs, troubleshooting)
-- `docs/design.md` - Overall system design
-- `docs/experiments/demo_conditioned_prompting_results.md` - Demo experiment results (validated Dec 2024)
-- `openadapt_ml/cloud/` - Cloud GPU providers (Lambda Labs, Azure)
-- `openadapt_ml/benchmarks/` - Benchmark integration module (WAA, base classes)
-- `openadapt_ml/experiments/demo_prompt/` - Demo-conditioned prompting experiment
-- `openadapt_ml/grounding/` - Grounding module (GeminiGrounder, etc.)
-- `openadapt_ml/ingest/capture.py` - Converts openadapt-capture recordings to Episodes
-- `scripts/run_demo_experiment.py` - Run demo-conditioned experiment
-- `configs/qwen3vl_synthetic_som.yaml` - SoM training config
+---
 
 ## Code Patterns
 
 ### Environment Variables
-Always load env vars through `openadapt_ml/config.py` using pydantic-settings, NOT directly from `os.environ`:
-
+Use `config.settings`, NOT `os.environ`:
 ```python
 # Good
 from openadapt_ml.config import settings
-api_key = settings.lambda_api_key
+api_key = settings.openai_api_key
 
 # Bad
-api_key = os.environ.get("LAMBDA_API_KEY")
+api_key = os.environ.get("OPENAI_API_KEY")
 ```
 
-This ensures `.env` file is automatically loaded. When adding new env vars:
+When adding new env vars:
 1. Add to `Settings` class in `config.py`
-2. Add to `.env.example` with documentation
-
-### API Keys for CLI Commands
+2. Add to `.env.example`
 
-CLI commands that need API keys (e.g., `waa`, `run-api`) follow this priority:
-1. Command-line argument: `--api-key YOUR_KEY`
-2. Config file: `settings.openai_api_key` from `.env`
-3. Environment variable: `$OPENAI_API_KEY`
+### API Keys for CLI
+Priority: `--api-key` flag > `.env` file > environment variable
 
-**Best practice**: Store keys in `.env` file (not committed to git):
-```bash
-# .env
-OPENAI_API_KEY=sk-...
-ANTHROPIC_API_KEY=sk-ant-...
-```
-
-Then CLI commands work without `--api-key`:
-```bash
-# These load API key from .env automatically
-uv run python -m openadapt_ml.benchmarks.cli waa
-uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai
-```
-
-## File Access
-
-The user has pre-approved read access to:
-- `~/oa/src/` - Parent directory containing related projects (openadapt-capture, etc.)
-
-Related paths:
-- Capture recordings: `/Users/abrichr/oa/src/openadapt-capture/`
-- Screenshots: `/Users/abrichr/oa/src/openadapt-capture/<capture-name>/screenshots/`
-
-## Shared Dashboard Components
-
-The training dashboard and capture viewer share UI components for visual consistency. When modifying dashboard UI:
-
-**Key files:**
-- `openadapt_ml/training/trainer.py` - Contains shared component functions:
-  - `_get_shared_header_css()` - CSS for the unified header
-  - `_generate_shared_header_html()` - HTML generator for nav tabs + controls
-
-**Pattern:**
-1. Define shared CSS/HTML in dedicated functions (prefixed with `_`)
-2. Both `generate_training_dashboard()` and `_enhance_comparison_to_unified_viewer()` call these functions
-3. Changes to shared functions automatically propagate to all dashboards
+---
 
-**Why this matters:**
-- Prevents visual inconsistencies when switching between Training and Viewer tabs
-- Single source of truth for styling (no duplicate CSS to maintain)
-- Easier to add new dashboards that match existing style
+## Dockerfile Testing
 
-## CRITICAL: Always Start Dashboard When Running Azure Resources
+Test fixes INSIDE container before rebuilding (saves 30+ min):
 
-See the ⚠️ MANDATORY section at the TOP of this file. Use:
 ```bash
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
-```
-
-## ⚠️ SAFE PROCESS MANAGEMENT ⚠️
+# 1. Start test container
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  'docker run -d --name test-fix --entrypoint /bin/bash windowsarena/winarena:latest -c "sleep 3600"'
 
-**NEVER use broad pkill patterns** - they can kill unrelated applications!
+# 2. Apply fix
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  "docker exec test-fix sed -i 's/old/new/' /some/file.sh"
 
-**WRONG (DANGEROUS):**
-```bash
-# These patterns are TOO BROAD and will kill unrelated apps:
-pkill -f "openadapt"      # Kills anything with "openadapt" in path
-pkill -f "python"         # Kills ALL Python processes
-pkill -9 -f "openadapt_ml"  # Killed Claude Code, Windsurf, Signal, Chrome tabs!
-```
+# 3. Verify
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  "docker exec test-fix cat /some/file.sh"
 
-**RIGHT (SAFE):**
-```bash
-# Use specific PID-based killing:
-lsof -i :8765 | grep python | awk '{print $2}' | xargs kill 2>/dev/null
+# 4. Cleanup
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'docker rm -f test-fix'
 
-# Or use specific process names with full path matching:
-pkill -f "python.*-m openadapt_ml.cloud.local serve"
+# 5. ONLY rebuild after fix is verified
+```
 
-# Or kill only the specific port listener:
-kill $(lsof -t -i :8765) 2>/dev/null
+---
 
-# Check what would be killed FIRST:
-pgrep -f "openadapt" -l  # Lists matching processes before killing
-```
+## Files to Know
 
-**Before any pkill command:**
-1. Run `pgrep -f "pattern" -l` to see what matches
-2. Verify only intended processes are listed
-3. Use the most specific pattern possible
-4. Prefer port-based or PID-based killing
+- `docs/WAA_APPROACH_REVIEW.md` - Full WAA setup documentation
+- `docs/cloud_gpu_training.md` - Lambda Labs/Azure training guide
+- `docs/azure_waa_setup.md` - Azure quota, costs, troubleshooting
+- `docs/design.md` - System design
+- `openadapt_ml/benchmarks/cli.py` - VM CLI commands
+- `openadapt_ml/cloud/ssh_tunnel.py` - SSH tunnel manager
+- `openadapt_ml/config.py` - Settings (pydantic-settings)
+- `openadapt_ml/schemas/` - Canonical schema definitions
 
-## Git Commit Style (Angular Convention)
+---
 
-**ALWAYS use Angular-style commit messages** for all commits across all OpenAdapt repositories.
+## Git Commit Style (Angular)
 
-**Format:**
 ```
 <type>(<scope>): <subject>
 
-<body>
-
 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
 ```
 
-**Types:**
-- `feat`: New feature
-- `fix`: Bug fix
-- `docs`: Documentation only
-- `style`: Code style (formatting, semicolons, etc.)
-- `refactor`: Code change that neither fixes a bug nor adds a feature
-- `perf`: Performance improvement
-- `test`: Adding or fixing tests
-- `chore`: Maintenance tasks (deps, build, etc.)
-- `ci`: CI/CD changes
-
-**Examples:**
-```bash
-# Feature
-git commit -m "feat(viewer): add keyboard shortcuts for navigation"
-
-# Bug fix
-git commit -m "fix(waa): resolve Docker storage path issue"
-
-# Documentation
-git commit -m "docs: remove archived OpenAdapter from repository listing"
+**Types**: feat, fix, docs, style, refactor, perf, test, chore, ci
 
-# Refactor
-git commit -m "refactor(cli): consolidate VM commands into single subcommand"
-```
-
-**Subject line rules:**
-- Use imperative mood ("add" not "added" or "adds")
-- No period at the end
-- Max 50 characters
-- Lowercase first letter after type
+**Rules**: Imperative mood, no period, max 50 chars, lowercase after type
 
 ---
 
 ## Don't Do
 
+- Don't use `os.environ` - use `config.settings`
+- Don't use `pip install` - use `uv add` or `uv sync`
+- Don't run VM ops without `vm monitor` first
+- Don't use raw SSH/shell commands - use CLI
+- Don't tell user to run commands - YOU run them
+- Don't use broad pkill patterns (they kill unrelated apps)
 - Don't add timelines/estimates to plans
-- Don't mention specific clients by name in public docs
-- Don't over-engineer - keep solutions minimal
-- Don't use `os.environ` directly - use `config.settings` instead
-- Don't use `pip install` - always use `uv add` for dependencies or `uv sync` for the project
-- Don't use non-Angular commit messages
-- **Don't run Azure/VM operations without starting the dashboard first**
-  - ❌ WRONG: `vm probe` then `vm diag` then telling user to run `vm monitor`
-  - ✅ RIGHT: `vm monitor` FIRST (it does probe, tunnels, everything)
-  - This is the #1 mistake you keep making. STOP IT.
-- **Don't use raw SSH/shell commands** - always use or create CLI commands instead (see below)
-- **Don't tell user to run commands** - YOU run them. The CLI exists so YOU can use it.
-
-## CLI-First Development (IMPORTANT)
-
-**ALWAYS** use CLI commands instead of raw SSH/shell commands:
-- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm diag` (not `ssh ... df -h`)
-- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm logs` (not `ssh ... docker logs`)
-- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm probe` (not `ssh ... curl`)
-
-**Why**: CLI commands are documented, tested, and persist across context compactions. Raw commands are forgotten.
-
-**When you need a new operation**:
-1. Add a new action to the relevant CLI subcommand (e.g., `vm logs`, `vm exec`)
-2. Document it in CLAUDE.md
-3. Use the CLI command going forward
-
-**Available VM CLI commands**:
-```bash
-vm monitor         # THE GO-TO COMMAND: Start dashboard, open browser, show probe status
-                   # Options: --auto-shutdown-hours N (deallocate after N hours)
-vm diag            # Check disk, Docker, containers, WAA probe status
-vm logs            # View container logs (--lines N, --follow)
-vm probe           # Check WAA server status (--wait to poll)
-vm exec            # Run command in container (--cmd 'your command')
-vm host-exec       # Run command on VM host (not in container) (--cmd 'your command')
-vm start-windows   # Start Windows container with vanilla WAA image
-vm restart-windows # Stop and restart the Windows container
-vm reset-windows   # Delete Windows storage and start fresh installation
-vm docker-prune    # Clean Docker images, containers, build cache (free disk space)
-vm docker-move     # Move Docker/containerd to /mnt via symlinks (300GB space with D8ds_v5)
-vm status          # Azure VM status
-vm ssh             # Interactive SSH
-vm deallocate      # Stop VM billing (preserves disk), use -y to skip confirmation
-vm start           # Start a deallocated VM
-vm delete          # Delete VM (use -y to skip confirmation)
-
-# Use 'waa' command instead of deprecated 'vm setup-waa' and 'vm run-waa':
-waa --setup-only   # Full VM setup with Docker and vanilla WAA image
-waa --num-tasks N  # Run benchmark with N tasks
-```
+- Don't mention specific clients by name
 
-## TODO / Known Issues
-
-### Session-Based Cost/Time Tracking
-**Status**: FIXED (Jan 2026)
-
-**Problem**: Dashboard showed cumulative cost/time from VM creation, not current session.
-- User deallocated VM overnight, restarted it today
-- Dashboard showed "$8.82 running cost" and "22h 58m elapsed"
-- This was lifetime cost, not current session cost
-
-**Root cause**: Session tracker (`session_tracker.py`) wasn't integrated with CLI commands.
-- `vm deallocate` didn't call `pause_session()`, so timer kept running
-- `vm start` didn't call `start_session()` to resume properly
-- `vm delete` didn't call `end_session()` or `clear_session()`
-
-**Solution implemented**:
-
-1. **CLI integration**: Added session tracker calls to VM lifecycle commands
-   - `vm deallocate`: Calls `pause_session()` and shows session summary
-   - `vm start`: Calls `start_session()` to resume with accumulated time
-   - `vm delete`: Calls `end_session()` and `clear_session()`
-   - Auto-shutdown in monitor: Calls `pause_session()`
-   - cleanup-stale: Calls `pause_session()` for deallocated VMs
-
-2. **Dashboard hybrid display**: Shows BOTH session and total costs
-   - "This Session: $0.14" - current running time since last start
-   - "Total Cost: $8.82" - accumulated across all sessions
-   - "Total Elapsed: 23h" - total time VM has been running
-
-3. **API enhancements**: Added fields to status response
-   - `current_session_seconds`: Time since last resume
-   - `current_session_cost_usd`: Cost for current session only
-   - `accumulated_seconds`: Time from previous sessions
-
-**Files changed**:
-- `openadapt_ml/benchmarks/cli.py` - Session tracker calls in VM commands
-- `openadapt_ml/cloud/local.py` - API returns session breakdown
-- `openadapt_ml/training/azure_ops_viewer.py` - Dashboard shows both session and total
-
-### PyPI Publishing
-**Status**: DONE
+---
 
-Completed by background agent:
-- Updated `pyproject.toml` with package metadata (description, authors, classifiers, URLs, license)
-- Created `LICENSE` (MIT, matching related projects)
-- Created `.github/workflows/publish.yml` for automated PyPI publishing on version tags
-- Build system: hatchling
+## Safe Process Management
 
-To publish:
-1. Set up PyPI trusted publishing (PyPI → Account Settings → Publishing)
-2. `git tag v0.1.0 && git push origin v0.1.0`
+```bash
+# WRONG (kills unrelated apps)
+pkill -f "openadapt"
+pkill -f "python"
 
-### Azure WAA Evaluation - ACR Auth Issue
-**Status**: FIXED - setup_azure.py now configures ACR authentication automatically
+# RIGHT (specific)
+kill $(lsof -t -i :8765) 2>/dev/null
+pkill -f "python.*-m openadapt_ml.cloud.local serve"
 
-**Problem**: Azure ML compute instances cannot pull from ACR even after attaching ACR to workspace.
+# Check before killing
+pgrep -f "pattern" -l
 ```
-Failed to pull Docker image openadaptacr.azurecr.io/winarena:latest
-```
-
-**Root cause**: The workspace's managed identity needed AcrPull role on the ACR, which wasn't being granted automatically.
-
-**Solution implemented**:
-1. Added `grant_acr_pull_role()` function to setup_azure.py that:
-   - Gets workspace managed identity principal ID
-   - Assigns AcrPull role on ACR to that identity
-2. Added `sync_workspace_keys()` to refresh workspace credentials
-3. Updated setup flow from 12 steps to 15 steps:
-   - Step 10: Attach ACR to workspace
-   - Step 11: Grant AcrPull role to workspace managed identity
-   - Step 12: Sync workspace keys
-
-**Related files**:
-- `scripts/setup_azure.py` - Azure setup automation (includes ACR auth)
-- `openadapt_ml/benchmarks/azure.py` - Azure orchestration
-- `.env` - AZURE_DOCKER_IMAGE setting
 
-**References**:
-- [Azure ML Managed Identity ACR Authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-identity-based-service-authentication)
-- [ACR Pull Role Assignment](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-authentication-managed-identity)
-
-### Azure WAA Evaluation - Dedicated VM Setup
-**Status**: WORKING - Vanilla Microsoft WAA (Jan 2026)
+---
 
-**IMPORTANT**: See `docs/WAA_APPROACH_REVIEW.md` for full documentation.
+## File Access
 
-**CRITICAL**: Uses vanilla Microsoft WAA (windowsarena/winarena). No custom Dockerfile.
+Pre-approved read access to `~/oa/src/` (related projects like openadapt-capture).
 
-**How it works**:
-- Uses official `windowsarena/winarena:latest` Docker image from Microsoft
-- Uses `VERSION=11e` env var to auto-download Windows 11 Enterprise Evaluation
-- Container runs `entry.sh` which boots Windows and starts WAA server automatically
-- First run: Downloads Windows + installs (~15-20 min)
-- Subsequent runs: Boots from cached disk image (~2-3 min)
+## Current Capture
 
-**FULLY AUTOMATED - Via CLI**:
+Path: `/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift`
+Task: Turn off Night Shift in macOS System Settings
 
-```bash
-# 1. Setup Azure VM with Docker and pull vanilla WAA image (~10 min)
-uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --setup-only
+---
 
-# 2. Run benchmark
-uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --num-tasks 20
+## TODO / Known Issues
 
-# 3. Monitor (optional, for debugging)
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
-# Opens browser to VNC at http://localhost:8006
+### Benchmark Viewer - Phase 4
+**Status**: TODO
 
-# 4. Delete VM when done (IMPORTANT: stops billing!)
-uv run python -m openadapt_ml.benchmarks.cli vm delete -y
-```
+Add failure clustering and regression detection. Phases 1-3 done:
+- Data collection with ExecutionTraceCollector
+- Viewer generation with `view --run-name {name}`
+- UI with summary, task list, step replay, playback controls
 
-**Diagnostic commands**:
-```bash
-uv run python -m openadapt_ml.benchmarks.cli vm diag     # Check disk, Docker, containers
-uv run python -m openadapt_ml.benchmarks.cli vm status   # Azure VM status
-uv run python -m openadapt_ml.benchmarks.cli vm ssh      # Interactive SSH
-uv run python -m openadapt_ml.benchmarks.cli vm probe    # Check WAA server readiness
-uv run python -m openadapt_ml.benchmarks.cli vm logs     # View container logs
-```
+### Azure ML Experiment ID
+**Status**: TODO
 
-**Screenshot capture** (for PR documentation):
-```bash
-# List available screenshot targets
-uv run python -m openadapt_ml.benchmarks.cli screenshot --list
-
-# Capture WAA-specific screenshots for PR
-uv run python -m openadapt_ml.benchmarks.cli screenshot --waa --pr-mode
-
-# Capture specific targets
-uv run python -m openadapt_ml.benchmarks.cli screenshot --target status --target probe --pr-mode
-
-# Available targets:
-#   status    - Azure VM status
-#   probe     - WAA probe endpoint status
-#   diag      - VM diagnostic info
-#   vm-screen - Windows VM screen (via QEMU)
-#   vnc       - VNC viewer (localhost:8006)
-#   terminal  - VM monitor terminal output
-#   azure-ops - Azure ops dashboard
-#   training  - Training dashboard
-```
+Retrieve experiment_id dynamically instead of hardcoded UUID.
 
-**Key requirements**:
-1. **VM Size**: `Standard_D8ds_v5` recommended (8 vCPU, 32GB RAM, 300GB temp storage for nested virtualization)
-2. **API key**: `config.json` with OPENAI_API_KEY (or set env var)
-3. **Valid model**: Use real OpenAI model name (gpt-4o, gpt-4o-mini)
+### Azure ML Port 80 Conflict
+**Status**: INVESTIGATING
 
-**Architecture**:
+Azure ML compute instances have Microsoft infrastructure services on port 80. When vanilla WAA's dockur/windows container starts, nginx tries to bind to port 80 and fails:
 ```
-Azure VM (Standard_D8ds_v5, nested virt enabled, 300GB /mnt)
-  └── Docker (data on /mnt)
-       └── windowsarena/winarena:latest (official Microsoft image)
-            └── QEMU running Windows 11 (IP: 172.30.0.2)
-                 └── WAA Flask server on port 5000
-                 └── Navi agent executing tasks
+nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
 ```
 
-**How vanilla WAA works**:
-1. Uses `windowsarena/winarena:latest` from Docker Hub
-2. `VERSION=11e` triggers auto-download of Windows 11 Enterprise Evaluation
-3. `entry.sh` handles Windows boot and server startup
-4. No custom patching or Dockerfile required
-
-**Monitor progress**:
-- VNC: `http://localhost:8006` (via SSH tunnel, auto-managed by dashboard)
-- Logs: `uv run python -m openadapt_ml.benchmarks.cli vm logs`
-
-**Files**:
-- `docs/WAA_APPROACH_REVIEW.md` - Full analysis (updated Jan 2026)
-- `vendor/WindowsAgentArena/` - Official WAA scripts (run-local.sh, etc.)
-- `openadapt_ml/benchmarks/cli.py` - CLI commands
-
-### Docker Disk Space Management
-**Status**: FIXED - Automatic cleanup (Jan 2026)
-
-**Problem**: Docker build cache on /mnt was growing to 90+ GB during builds, exhausting disk space and causing builds to fail with "no space left on device". Note: With Standard_D8ds_v5, /mnt is now 300GB which should be sufficient.
-
-**Root cause**: Docker's build cache and containerd snapshotter accumulate data that isn't cleaned by `docker system prune`:
-- `/mnt/docker/buildkit/containerd-overlayfs` - BuildKit layer cache
-- `/mnt/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots` - Containerd snapshots
-- These can grow to 30-40 GB each, even with no images present
+**Key insight**: Port 80 is just nginx redirecting to noVNC on port 8006. **NOT essential for WAA**.
+- Port 5000: WAA Flask API (benchmark execution) - ESSENTIAL
+- Port 8006: noVNC (browser VNC) - ESSENTIAL
+- Port 80: nginx redirect - NOT ESSENTIAL
 
-**Solution implemented** (3 parts):
+**What we're testing**:
+1. `WEB=N` env var to disable nginx entirely
+2. SSH tunnel to access ports 8006 and 5000 for debugging
+3. Enhanced diagnostics in run_entry.py to verify Windows boots despite nginx failure
 
-1. **Automatic pre-build cleanup**: Before Docker builds, the CLI now runs `docker builder prune -af` and checks available disk space, warning if < 50GB.
+**SSH key support added**: Compute instances now use your local SSH key (~/.ssh/id_rsa) for direct SSH access.
 
-2. **Automatic post-build cleanup**: After successful builds, the CLI cleans build cache and dangling images to prevent accumulation.
+See `docs/AZURE_ML_PORT_80_FIX.md` for full analysis and options.
 
-3. **BuildKit garbage collection**: New VMs are configured with `/etc/buildkit/buildkitd.toml` that limits cache to 30GB max.
+### Azure ML CLI Commands
 
-4. **Enhanced docker-prune command**: Now includes "deep cleanup" that stops Docker/containerd and removes orphaned snapshots that normal prune misses.
-
-**Usage**:
 ```bash
-# Quick cleanup (standard prune + deep cleanup + configure GC)
-uv run python -m openadapt_ml.benchmarks.cli vm docker-prune
-
-# For severe disk issues, delete VM and recreate (comes with GC pre-configured)
-uv run python -m openadapt_ml.benchmarks.cli vm delete -y
-uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ```
-
-**Files changed**:
-- `openadapt_ml/benchmarks/cli.py` - Pre/post build cleanup, enhanced docker-prune
-- New VMs get BuildKit GC config during setup
-
-### Windows "Select Operating System" Prompt Fix
-**Status**: N/A with vanilla WAA (Jan 2026)
-
-**Note**: This issue was specific to the custom waa-auto Dockerfile approach which has been deprecated.
-
-With vanilla WAA (`windowsarena/winarena:latest`), using `VERSION=11e` automatically selects Windows 11 Enterprise Evaluation which has proper autounattend.xml handling.
-
-**If you still see the prompt**:
-1. Delete cached storage: `uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'rm -rf /mnt/waa-storage/*'`
-2. Re-run setup: `uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --fresh`
-
-### SSH Tunnel Management (VNC/WAA Access)
-**Status**: DONE
+# Status and monitoring
+azure-ml-status           # Show compute instances and recent jobs
+azure-ml-logs --job NAME  # Stream logs from running job
+azure-ml-monitor          # Interactive monitor with VNC tunnel
 
-**Problem**: Azure VMs have Network Security Groups (NSGs) that only expose port 22 (SSH) by default. Ports 8006 (VNC) and 5000 (WAA) are not accessible directly.
+# Run benchmarks
+run-azure-ml-auto --workers N  # Fully automated workflow
 
-**Solution**: Automatic SSH tunnel management via `SSHTunnelManager`:
+# Cleanup (IMPORTANT - stop billing!)
+azure-ml-cancel           # Cancel running job (or --job NAME)
+azure-ml-delete-compute   # Delete compute instance (--name NAME or --all)
+azure-ml-cleanup --yes    # Cancel all jobs + delete all instances
 
-```
-Browser → localhost:8006 → SSH Tunnel → Azure VM:8006 → Docker → noVNC
-Browser → localhost:5001 → SSH Tunnel → Azure VM:5000 → WAA Flask
-```
-
-**Architecture**:
-1. When VM's WAA probe becomes "ready", tunnels auto-start
-2. When VM goes offline, tunnels auto-stop
-3. Dashboard shows tunnel status next to VNC button
-4. VNC button links to localhost:port (tunnel endpoint)
-
-**Files**:
-- `openadapt_ml/cloud/ssh_tunnel.py` - SSHTunnelManager class
-- `openadapt_ml/cloud/local.py` - Integration with dashboard server
-- `openadapt_ml/training/benchmark_viewer.py` - UI showing tunnel status
-
-**API Endpoints**:
-- `GET /api/tunnels` - Returns tunnel status for VNC and WAA
-- `GET /api/vms` - Includes `tunnels` field with per-tunnel status
-
-**Key features**:
-- Auto-start on VM online (idempotent - safe to call repeatedly)
-- Auto-stop on VM offline
-- Port conflict detection
-- Graceful shutdown on process exit
-- No manual SSH commands needed
-
-**Manual usage** (if needed):
-```python
-from openadapt_ml.cloud.ssh_tunnel import get_tunnel_manager
-
-manager = get_tunnel_manager()
-manager.start_tunnels_for_vm("172.171.112.41", "azureuser")
-status = manager.get_tunnel_status()
-manager.stop_all_tunnels()
-```
-
-**Why not open NSG ports?**
-1. VNC has no authentication by default - anyone can connect
-2. SSH tunnel encrypts all traffic
-3. Requires SSH key auth - no password guessing
-4. No Azure NSG changes needed
-
-**Alternative: Mock evaluation** for testing without Windows:
-```bash
-uv run python -m openadapt_ml.benchmarks.cli test-mock --tasks 20
+# Resource management
+resources                 # Show all Azure resources and costs
 ```
 
-**References**:
-- [Windows Agent Arena GitHub](https://github.com/microsoft/WindowsAgentArena)
-- [Azure nested virtualization](https://learn.microsoft.com/en-us/azure/virtual-machines/acu)
-
-### Training Dashboard - Terminal Output Streaming
-**Status**: DONE
-
-**Goal**: Show training command line output in the browser dashboard in real-time.
-
-**Implementation**: File-based polling approach
-1. Training writes stdout to `training_output/training.log` with timestamps
-2. Browser polls training.log every 2 seconds alongside training_log.json
-3. Displays last 500 lines in scrollable terminal panel with auto-scroll
-4. Terminal panel features:
-   - Dark terminal theme (black background, green/colored text)
-   - Auto-scroll toggle (on by default)
-   - Text wrap toggle
-   - Collapse/expand button
-   - Line counter
-   - Syntax highlighting (errors in red, warnings in orange, success in green)
-
-**Files changed**:
-- `openadapt_ml/training/trainer.py`:
-  - Added terminal panel CSS styles
-  - Added terminal panel HTML section
-  - Added JavaScript polling function `fetchTerminalOutput()`
-  - Added `TrainingLogger._log_to_terminal()` method
-  - Updated `train_supervised()` to log key messages to training.log
-- `openadapt_ml/training/stub_provider.py`:
-  - Added `_log()` method for dual stdout/file logging
-  - All training output now written to training.log
-- `openadapt_ml/cloud/local.py`:
-  - No changes needed - serve command already serves all files from training_output
-
-**Usage**: Terminal output automatically appears in dashboard during training. Works with both stub and real training.
-
-### Early Termination Controls
-**Status**: DONE
-
-**Problem**: Training runs until completion even when loss is low enough. Wastes GPU credits ($0.75/hr for A10).
-
-**Solution implemented**:
-1. **Auto-termination**: `early_stop_loss` and `early_stop_patience` in stub_provider.py
-2. **Dashboard button**: "Stop Training" button calls `/api/stop` endpoint
-3. **Stop signal**: Creates `STOP_TRAINING` file that training loop checks
-4. **Termination status**: Dashboard shows termination reason (auto_complete, auto_low_loss, user_stop)
-
-**Files changed**:
-- `openadapt_ml/cloud/local.py` - Added `/api/stop` POST endpoint
-- `openadapt_ml/training/stub_provider.py` - Added early stop logic, termination status
-- `openadapt_ml/training/trainer.py` - Added `updateTerminationStatus()` JS function
-
-### Cloud Cost Estimation in Viewers
-**Status**: DONE
-
-Added cost display panel to viewer that shows:
-- Running cost based on instance type and elapsed time
-- Instance type and hourly rate
-- Only visible for cloud training (hidden for local/stub)
-
-Supported rates:
-- Lambda Labs: $0.75/hr for A10, $1.29/hr for A100
-- Automatic detection from `instance_type` in training_log.json
-
-### Current Working Capture
-**Path**: `/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift`
-**Task**: Turn off Night Shift in macOS System Settings
-**Screenshots**: 20 frames
-**Notes**: Real-world macOS settings navigation capture for training/evaluation
-
-### Evaluation Samples Display Enhancement
-**Status**: DONE
-
-Enhanced evaluation gallery in dashboard with:
-- **Filter controls**: Dropdown filters for epoch and correctness (All/Correct/Incorrect)
-- **Visual markers**: H (human) and AI (predicted) click markers on screenshots
-- **Expandable model output**: "Show full output" toggle for raw model reasoning
-- **Better layout**: Image container with overlay, content section with coordinates
-- **Sample count**: "Showing X of Y samples" with filter status
-
-Files changed:
-- `openadapt_ml/training/trainer.py` - Enhanced CSS, HTML, and JS for eval gallery
-
-### Viewer Playback Controls
-**Status**: DONE
-
-Added full playback controls to the viewer:
-- **Buttons**: ⏮ Rewind, ◀ Prev, ▶ Play/Pause, ▶ Next, ⏭ End
-- **Speed control**: 0.5x, 1x, 2x, 4x playback speeds
-- **Progress bar**: Click-to-seek to any step
-- **Keyboard shortcuts**: Space (play/pause), Home/End (jump), Arrow keys (step)
-- **Enhanced details panel**: Shows full model output with scrollable raw prediction data
-
-### Viewer Code Consolidation
-**Status**: DONE
-
-**Problem**: Viewer code was fragmented across multiple locations:
-1. `generate_training_dashboard()` - generates unified viewer template
-2. `_enhance_comparison_to_unified_viewer()` - injected checkpoint_script into comparison.html
-3. `comparison.html` from capture - had its own display logic
-
-**Solution implemented**:
-- `generate_unified_viewer_from_output_dir()` now always uses `_generate_unified_viewer_from_extracted_data()`
-- This generates a complete standalone viewer.html without script injection
-- `_enhance_comparison_to_unified_viewer()` marked as deprecated
-- All viewer display logic is now in one place (`_generate_unified_viewer_from_extracted_data`)
-- Changes to viewer code now propagate reliably
-
-### README API Documentation
-**Status**: VERIFIED
-
-The README §7.1 API-backed adapters section uses correct model names:
-- "Claude Sonnet 4.5" → `claude-sonnet-4-5-20250929` in api_adapter.py ✓
-- "GPT-5.1" → `gpt-5.1` in api_adapter.py ✓
-
-Verified:
-- API key environment variable names: ANTHROPIC_API_KEY, OPENAI_API_KEY ✓
-- Backend flag options: `claude`, `openai` in CLI ✓
-
-### Benchmark Viewer Integration
-**Status**: Phases 1-3 DONE, Phase 4 TODO
-
-**Goal**: Integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer.
-
-**Design doc**: `docs/benchmark_viewer_integration.md`
-
-**Key features**:
-1. **Benchmarks tab**: Third tab alongside Training and Viewer
-2. **Task-level view**: List of benchmark tasks with pass/fail status
-3. **Step-by-step replay**: Same UI as Viewer tab for benchmark executions
-4. **Model comparison**: Side-by-side comparison of different models on same task (TODO)
-5. **Aggregate metrics**: Success rate by domain, difficulty rankings
-
-**Implementation phases**:
-1. ✅ **Data collection** (DONE): Save screenshots during benchmark runs
-   - Created `openadapt_ml/benchmarks/data_collection.py` with `ExecutionTraceCollector`
-   - Updated `runner.py` to save execution traces automatically
-   - Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5`
-   - Directory structure: `benchmark_results/{run_name}/tasks/{task_id}/`
-   - Each task has: `task.json`, `execution.json`, `screenshots/`
-   - Test script: `test_data_collection.py` validates all files are created
-2. ✅ **Viewer backend** (DONE): `generate_benchmark_viewer()` function
-   - Created `openadapt_ml/benchmarks/viewer.py` with viewer generation
-   - Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli view --run-name {name}`
-   - Generates standalone HTML with same styling as training viewer
-   - Uses shared header components via `shared_ui.py`
-3. ✅ **UI components** (DONE - Basic): Summary dashboard, task list, replay
-   - Summary panel with total tasks, passed/failed, success rate
-   - Domain breakdown with per-domain statistics
-   - Filter controls (domain, status)
-   - Task list with status badges
-   - Step-by-step viewer with screenshots, actions, reasoning
-   - Playback controls (prev/next, play/pause, speed)
-   - Keyboard shortcuts (Space, arrows, Home/End)
-4. **Analysis** (TODO): Failure clustering, regression detection
-
-**View benchmark results:**
-```bash
-# Generate HTML viewer and serve it
-uv run python -m openadapt_ml.benchmarks.cli view --run-name {name}
-
-# Options:
-# --embed-screenshots  Embed screenshots as base64 (standalone HTML)
-# --no-open            Don't auto-open browser
-# --port 9000          Use custom port
-```
-
-## Preventing Stale Data Issues
-
-**CRITICAL**: When working on dashboard/viewer code, follow this process to avoid showing stale data:
-
-### After Code Changes
-
-1. **Always regenerate HTML files** after modifying trainer.py, viewer.py, or local.py:
-   ```bash
-   uv run python -m openadapt_ml.cloud.local viewer
-   ```
-
-2. **Verify regeneration worked** by checking key values:
-   ```bash
-   # Check elapsed time was updated (should NOT be 0)
-   grep "baseElapsedTime" training_output/current/dashboard.html
-
-   # Check comparison data exists in viewer
-   grep "predictionsByCheckpoint" training_output/current/viewer.html
-   ```
-
-3. **Hard refresh browser** to bypass cache:
-   - macOS: `Cmd+Shift+R`
-   - Windows/Linux: `Ctrl+Shift+R`
-   - Or use DevTools → Network → "Disable cache" checkbox
-
-4. **Use HTTP serving** (not file://) for auto-refresh:
-   ```bash
-   uv run python -m openadapt_ml.cloud.local serve --port 8080 --open
-   ```
-
-### Before Showing User
-
-Before presenting dashboard/viewer to user, verify:
-- [ ] Elapsed time shows correct value (not 0m 0s)
-- [ ] Comparison screenshots load (not blank/404)
-- [ ] Model predictions appear in dropdown
-- [ ] Loss curve shows data
-- [ ] Timestamp info panel shows recent dates
-
-### Automatic Data Loading Checklist
-
-The viewer should automatically load:
-- [ ] Capture data from `comparison_epoch*.html` files (extracts `window.comparisonData`)
-- [ ] Predictions from same comparison HTML files (human + predicted actions per step)
-- [ ] Evaluations from `training_log.json` (if present)
-- [ ] Recording events from capture data (note: `recording.end` depends on capture source)
-
-### Common Issues
+---
 
-| Symptom | Cause | Fix |
-|---------|-------|-----|
-| Elapsed time shows 0m 0s | `elapsed_time` not loaded from training_log.json | Check `state.elapsed_time = data.get("elapsed_time", 0.0)` in local.py |
-| No comparison screenshots | Paths point to Lambda not local | Update `capture_path` in training_log.json to local path |
-| Missing model predictions | No `comparison_epoch*.html` files or wrong data format | Run compare script: `uv run python -m openadapt_ml.scripts.compare --capture ... --checkpoint ...` |
-| Predictions not extracted | HTML uses `window.comparisonData` but regex expects `const` | Use regex `(?:const\s+\|window\.)comparisonData` pattern |
-| Stale data after code change | Browser caching HTML | Hard refresh (Cmd+Shift+R) or disable cache |
-| Screenshots 404 | Screenshot symlink broken | Recreate: `ln -sf /path/to/capture/screenshots training_output/current/screenshots` |
+## Troubleshooting
 
-### UI/Display Guidelines
+### Dashboard/Viewer Stale Data
+After code changes:
+1. Regenerate: `uv run python -m openadapt_ml.cloud.local viewer`
+2. Hard-refresh browser: Cmd+Shift+R
 
-**Placeholder data must be clearly marked** when displaying values that may not reflect actual data:
-- If task counts, worker counts, etc. come from local tracking (not synced with Azure), mark them with an asterisk: "3* tasks • 1* worker(s)"
-- Add a footnote: "[*: placeholder, actual values may differ]"
-- This applies to any data that is locally cached but not confirmed from the authoritative source
+### WAA Connection Issues
+1. Is VM running? `vm status`
+2. Are tunnels active? `vm monitor`
+3. Check container: `vm diag`
 
-### Azure ML Integration Notes
+### Windows Not Booting
+1. Check VNC via `vm monitor`
+2. Check logs: `vm logs`
 
-**Experiment ID**: The Azure ML experiments page URL requires an experiment ID which is workspace-specific:
-- Current hardcoded ID: `ad29082c-0607-4fda-8cc7-38944eb5a518`
-- **TODO**: Retrieve experiment_id dynamically from Azure using `az ml experiment list`
-- The experiment name is `openadapt-ml` but the URL requires the UUID format
+### Common Issues Table
 
-**Azure ML URL format**:
-- Jobs list: `https://ml.azure.com/experiments/id/{experiment_id}?wsid={workspace_id}`
-- Specific job: `https://ml.azure.com/experiments/id/{experiment_id}/runs/{run_id}?wsid={workspace_id}`
+| Symptom | Fix |
+|---------|-----|
+| Connection refused localhost:5001 | Run `vm monitor` to start tunnels |
+| Windows not booting | Check VNC, check `vm logs` |
+| Elapsed time shows 0 | Check training_log.json has elapsed_time |
+| No comparison screenshots | Update capture_path in training_log.json |
+| Stale data after code change | Hard refresh (Cmd+Shift+R) |
 
-**WAA Docker command**: Use `python run.py` not `python -m client.run` (the client directory is not a Python package)
+See `docs/` for detailed troubleshooting guides.
diff --git a/README.md b/README.md
index dd7b38c..9b3af6a 100644
--- a/README.md
+++ b/README.md
@@ -813,48 +813,31 @@ uv run python -m openadapt_ml.cloud.local serve --port 8080 --open
 
 *View benchmark evaluation results with task-level filtering, success/failure status, and run comparison. Shows Claude achieving 30% on mock evaluation tasks (simulated environment for testing the pipeline - real WAA evaluation requires Windows VMs).*
 
-### 13.4 VM Monitoring Dashboard
+### 13.4 VM Pool Monitoring
 
-For managing Azure VMs used in benchmark evaluations, the `vm monitor` command provides a comprehensive dashboard:
+For managing Azure VMs used in benchmark evaluations:
 
 ```bash
-# Start VM monitoring dashboard (auto-opens browser)
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
-
-# Show detailed information (evaluation history, daily/weekly costs)
-uv run python -m openadapt_ml.benchmarks.cli vm monitor --details
-```
-
-**VM Monitor Dashboard (Full View):**
-
-![VM Monitor Dashboard](docs/screenshots/vm_monitor_dashboard_full.png)
-
-*The VM monitor dashboard shows: (1) VM status (name, IP, size, state), (2) Current activity (idle/benchmark running), (3) Cost tracking (uptime, hourly rate, total cost), (4) Recent Azure ML jobs from last 7 days, and (6) Dashboard & access URLs.*
-
-**VM Monitor Dashboard (With --details Flag):**
+# Check pool status (VM state, IPs, WAA readiness)
+uv run python -m openadapt_ml.benchmarks.cli pool-status
 
-![VM Monitor Dashboard Details](docs/screenshots/vm_monitor_details.png)
+# Open VNC to view Windows desktops (via SSH tunnels)
+uv run python -m openadapt_ml.benchmarks.cli pool-vnc
 
-*The --details flag adds: (5) Evaluation history with success rates and agent types, plus extended cost information (daily/weekly projections).*
+# Stream logs from all workers
+uv run python -m openadapt_ml.benchmarks.cli pool-logs
+```
 
 **Features:**
 - **Real-time VM status** - Shows VM size, power state, and IP address
-- **Activity detection** - Identifies if VM is idle, running benchmarks, or in setup
-- **Cost tracking** - Displays uptime hours, hourly rate, and total cost for current session
-- **Azure ML jobs** - Lists recent jobs from last 7 days with status indicators
-- **Evaluation history** - Shows past benchmark runs with success rates (with --details flag)
-- **Dashboard & tunnels** - Auto-starts web dashboard and SSH/VNC tunnels for accessing Windows VM
+- **WAA readiness** - Shows if WAA server is ready on each worker
+- **VNC access** - Opens SSH tunnels to view Windows desktops
+- **Log streaming** - Interleaved logs from all pool workers
 
-**Mock mode for testing:**
+**Cleanup (important to stop billing):**
 ```bash
-# Generate screenshots or test dashboard without a VM running
-uv run python -m openadapt_ml.benchmarks.cli vm monitor --mock
-```
-
-**Auto-shutdown option:**
-```bash
-# Automatically deallocate VM after 2 hours to prevent runaway costs
-uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
+# Delete all pool VMs and resources
+uv run python -m openadapt_ml.benchmarks.cli pool-cleanup
 ```
 
 ### 13.5 Benchmark Execution Logs
@@ -1017,20 +1000,24 @@ Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains.
 
 ### 14.2 Single VM Workflow
 
-For quick testing or small runs:
+For quick testing or small runs (use pool-create with --workers 1):
 
 ```bash
-# Setup VM with WAA
-uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
+# 1. Create single-VM pool
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 1
 
-# Start monitoring dashboard (auto-opens VNC, manages SSH tunnels)
-uv run python -m openadapt_ml.benchmarks.cli vm monitor
+# 2. Wait for WAA ready
+uv run python -m openadapt_ml.benchmarks.cli pool-wait
+
+# 3. Run benchmark (e.g., 3 tasks for quick test)
+uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 3
 
-# Run benchmark
-uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10
+# 4. Check status / VNC
+uv run python -m openadapt_ml.benchmarks.cli pool-status
+uv run python -m openadapt_ml.benchmarks.cli pool-vnc
 
-# Deallocate when done (stops billing)
-uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
+# 5. Cleanup (stop billing)
+uv run python -m openadapt_ml.benchmarks.cli pool-cleanup
 ```
 
 ### 14.3 Parallel Pool Workflow (Recommended)
@@ -1102,8 +1089,7 @@ Azure (N VMs, Standard_D8ds_v5)
 
 **Tips:**
 - Always run `pool-cleanup` when done to delete VMs and stop billing
-- Use `vm deallocate` (not delete) to pause billing but keep disk
-- Set `--auto-shutdown-hours 2` on `vm monitor` for safety
+- Use `deallocate` (not `delete`) to pause billing but keep disk for single VM
 - Prices vary by Azure region
 
 ---
diff --git a/openadapt_ml/benchmarks/cli.py b/openadapt_ml/benchmarks/cli.py
index b6504ad..010e2cc 100644
--- a/openadapt_ml/benchmarks/cli.py
+++ b/openadapt_ml/benchmarks/cli.py
@@ -80,6 +80,10 @@
     "LogLevel=ERROR",
     "-o",
     "ConnectTimeout=10",
+    "-o",
+    "ServerAliveInterval=60",  # Send keepalive every 60s to prevent timeout
+    "-o",
+    "ServerAliveCountMax=10",  # Allow 10 missed keepalives (~10 min) before disconnect
 ]
 
 
@@ -329,6 +333,101 @@ def wait_for_ssh(ip: str, timeout: int = 120) -> bool:
     return False
 
 
+def set_vm_auto_shutdown(
+    vm_name: str,
+    resource_group: str = RESOURCE_GROUP,
+    shutdown_hours: int = 4,
+) -> bool:
+    """Set Azure auto-shutdown policy on a VM.
+
+    This is a safety net to prevent orphaned VMs from running indefinitely.
+    The VM will be automatically deallocated after the specified hours.
+
+    Args:
+        vm_name: Name of the VM
+        resource_group: Azure resource group
+        shutdown_hours: Hours from now when VM should auto-shutdown (default 4)
+
+    Returns:
+        True if auto-shutdown was set successfully
+    """
+    # Calculate shutdown time (hours from now)
+    from datetime import timedelta
+
+    shutdown_time = datetime.utcnow() + timedelta(hours=shutdown_hours)
+    # Format: HH:MM in UTC
+    shutdown_time_str = shutdown_time.strftime("%H:%M")
+
+    result = subprocess.run(
+        [
+            "az",
+            "vm",
+            "auto-shutdown",
+            "-g",
+            resource_group,
+            "-n",
+            vm_name,
+            "--time",
+            shutdown_time_str,
+        ],
+        capture_output=True,
+        text=True,
+    )
+
+    return result.returncode == 0
+
+
+def delete_test_vm_resources(test_name: str, resource_group: str = RESOURCE_GROUP):
+    """Delete a test VM and its associated resources.
+
+    Used for cleanup after quota checking or failed operations.
+    """
+    # Delete VM
+    subprocess.run(
+        [
+            "az",
+            "vm",
+            "delete",
+            "-g",
+            resource_group,
+            "-n",
+            test_name,
+            "--yes",
+            "--force-deletion",
+            "true",
+        ],
+        capture_output=True,
+    )
+    # Delete NIC
+    subprocess.run(
+        [
+            "az",
+            "network",
+            "nic",
+            "delete",
+            "-g",
+            resource_group,
+            "-n",
+            f"{test_name}VMNic",
+        ],
+        capture_output=True,
+    )
+    # Delete public IP
+    subprocess.run(
+        [
+            "az",
+            "network",
+            "public-ip",
+            "delete",
+            "-g",
+            resource_group,
+            "-n",
+            f"{test_name}PublicIP",
+        ],
+        capture_output=True,
+    )
+
+
 # =============================================================================
 # Commands
 # =============================================================================
@@ -420,6 +519,15 @@ def cmd_create(args):
         f"Successfully created {successful_size} (${successful_cost:.2f}/hr) in {region}",
     )
 
+    # Set auto-shutdown as safety net (prevents orphaned VMs)
+    auto_shutdown_hours = getattr(args, "auto_shutdown_hours", 4)
+    if auto_shutdown_hours > 0:
+        log("CREATE", f"Setting auto-shutdown in {auto_shutdown_hours} hours...")
+        if set_vm_auto_shutdown(VM_NAME, RESOURCE_GROUP, auto_shutdown_hours):
+            log("CREATE", "Auto-shutdown configured")
+        else:
+            log("CREATE", "Warning: Failed to set auto-shutdown (VM will stay running)")
+
     # Wait for SSH
     log("CREATE", "Waiting for SSH...")
     if not wait_for_ssh(ip):
@@ -789,88 +897,58 @@ def cmd_pool_create(args):
     working_size = None
     working_region = None
     working_cost = None
+    test_vm_to_cleanup = None  # Track test VM for cleanup
 
     log("POOL", "Finding available region and VM size...")
-    for vm_size, cost in sizes_to_try:
-        for region in VM_REGIONS:
-            # Quick check if this size/region combo works
-            test_name = f"waa-pool-test-{int(time.time())}"
-            result = subprocess.run(
-                [
-                    "az",
-                    "vm",
-                    "create",
-                    "--resource-group",
-                    RESOURCE_GROUP,
-                    "--name",
-                    test_name,
-                    "--location",
-                    region,
-                    "--image",
-                    "Ubuntu2204",
-                    "--size",
-                    vm_size,
-                    "--admin-username",
-                    "azureuser",
-                    "--generate-ssh-keys",
-                    "--public-ip-sku",
-                    "Standard",
-                    "--no-wait",  # Don't wait for completion
-                ],
-                capture_output=True,
-                text=True,
-            )
-            if result.returncode == 0:
-                working_size = vm_size
-                working_region = region
-                working_cost = cost
-                # Delete the test VM and wait for completion
-                log("POOL", "  Found working combo, cleaning up test VM...")
-                subprocess.run(
+    try:
+        for vm_size, cost in sizes_to_try:
+            for region in VM_REGIONS:
+                # Quick check if this size/region combo works
+                test_name = f"waa-pool-test-{int(time.time())}"
+                test_vm_to_cleanup = test_name  # Track for cleanup
+                result = subprocess.run(
                     [
                         "az",
                         "vm",
-                        "delete",
-                        "-g",
+                        "create",
+                        "--resource-group",
                         RESOURCE_GROUP,
-                        "-n",
+                        "--name",
                         test_name,
-                        "--yes",
-                        "--force-deletion",
-                        "true",
-                    ],
-                    capture_output=True,
-                )
-                # Also clean up associated resources
-                subprocess.run(
-                    [
-                        "az",
-                        "network",
-                        "nic",
-                        "delete",
-                        "-g",
-                        RESOURCE_GROUP,
-                        "-n",
-                        f"{test_name}VMNic",
-                    ],
-                    capture_output=True,
-                )
-                subprocess.run(
-                    [
-                        "az",
-                        "network",
-                        "public-ip",
-                        "delete",
-                        "-g",
-                        RESOURCE_GROUP,
-                        "-n",
-                        f"{test_name}PublicIP",
+                        "--location",
+                        region,
+                        "--image",
+                        "Ubuntu2204",
+                        "--size",
+                        vm_size,
+                        "--admin-username",
+                        "azureuser",
+                        "--generate-ssh-keys",
+                        "--public-ip-sku",
+                        "Standard",
+                        "--no-wait",  # Don't wait for completion
                     ],
                     capture_output=True,
+                    text=True,
                 )
+                if result.returncode == 0:
+                    working_size = vm_size
+                    working_region = region
+                    working_cost = cost
+                    # Delete the test VM and wait for completion
+                    log("POOL", "  Found working combo, cleaning up test VM...")
+                    delete_test_vm_resources(test_name, RESOURCE_GROUP)
+                    test_vm_to_cleanup = None  # Cleanup done
+                    break
+                else:
+                    test_vm_to_cleanup = None  # Creation failed, nothing to cleanup
+            if working_size:
                 break
-        if working_size:
-            break
+    finally:
+        # Ensure test VM is cleaned up even if an exception occurred
+        if test_vm_to_cleanup:
+            log("POOL", f"Cleaning up test VM {test_vm_to_cleanup}...")
+            delete_test_vm_resources(test_vm_to_cleanup, RESOURCE_GROUP)
 
     if not working_size:
         log("POOL", "ERROR: No available VM size/region found")
@@ -882,6 +960,11 @@ def cmd_pool_create(args):
 
     log("POOL", f"Using {working_size} (${working_cost:.2f}/hr) in {working_region}")
 
+    # Get auto-shutdown hours (default 4 hours as safety net)
+    auto_shutdown_hours = getattr(args, "auto_shutdown_hours", 4)
+    if auto_shutdown_hours > 0:
+        log("POOL", f"VMs will auto-shutdown in {auto_shutdown_hours} hours")
+
     def create_worker(worker_idx: int) -> tuple[str, str | None, str | None]:
         """Create a single worker VM. Returns (name, ip, error)."""
         name = f"waa-pool-{worker_idx:02d}"
@@ -967,6 +1050,8 @@ def create_worker(worker_idx: int) -> tuple[str, str | None, str | None]:
         try:
             vm_info = json.loads(result.stdout)
             ip = vm_info.get("publicIpAddress", "")
+            # Set auto-shutdown as safety net (prevents orphaned VMs)
+            set_vm_auto_shutdown(name, RESOURCE_GROUP, auto_shutdown_hours)
             return (name, ip, None)
         except json.JSONDecodeError:
             return (name, None, "Failed to parse VM creation output")
@@ -8138,6 +8223,60 @@ def cmd_azure_ml_teardown(args):
     return 0
 
 
+def cmd_view_pool(args):
+    """Generate HTML viewer for WAA pool benchmark results.
+
+    Parses log files from pool_run_* directories and generates an interactive
+    HTML viewer with summary stats, per-worker breakdown, and task list.
+    """
+    import webbrowser
+
+    from openadapt_ml.benchmarks.pool_viewer import generate_pool_results_viewer
+
+    results_dir = Path(args.results_dir) if args.results_dir else Path("benchmark_results")
+
+    # Find pool run directory
+    if args.run_name:
+        pool_dir = results_dir / args.run_name
+        if not pool_dir.exists():
+            # Try with pool_run_ prefix
+            pool_dir = results_dir / f"pool_run_{args.run_name}"
+    else:
+        # Find most recent pool_run_* directory
+        pool_dirs = sorted(results_dir.glob("pool_run_*"), reverse=True)
+        if not pool_dirs:
+            print("No pool_run_* directories found in benchmark_results/")
+            print("Run 'pool-run' to generate benchmark results")
+            return 1
+        pool_dir = pool_dirs[0]
+
+    if not pool_dir.exists():
+        print(f"Directory not found: {pool_dir}")
+        return 1
+
+    # Check for log files
+    log_files = list(pool_dir.glob("waa-pool-*.log"))
+    if not log_files:
+        print(f"No waa-pool-*.log files found in {pool_dir}")
+        return 1
+
+    print(f"Generating viewer for: {pool_dir}")
+    print(f"Found {len(log_files)} log file(s)")
+
+    # Generate viewer
+    output_path = pool_dir / "results.html"
+    generate_pool_results_viewer(pool_dir, output_path)
+
+    print(f"Generated: {output_path}")
+
+    # Open in browser
+    if not args.no_open:
+        print("Opening in browser...")
+        webbrowser.open(f"file://{output_path.absolute()}")
+
+    return 0
+
+
 def cmd_tail_output(args):
     """List or tail background task output files."""
     task_dir = Path("/private/tmp/claude-501/-Users-abrichr-oa-src-openadapt-ml/tasks/")
@@ -8312,6 +8451,12 @@ def main():
         default=1,
         help="Number of worker VMs to create for parallel evaluation (default: 1)",
     )
+    p_create.add_argument(
+        "--auto-shutdown-hours",
+        type=int,
+        default=4,
+        help="Auto-shutdown VM after N hours (0 to disable, default: 4)",
+    )
     p_create.set_defaults(func=cmd_create)
 
     # delete
@@ -8358,6 +8503,12 @@ def main():
     p_pool_create.add_argument(
         "--standard", action="store_true", help="Use D4 (4 vCPU) VMs to save costs"
     )
+    p_pool_create.add_argument(
+        "--auto-shutdown-hours",
+        type=int,
+        default=4,
+        help="Auto-shutdown VMs after N hours (0 to disable, default: 4)",
+    )
     p_pool_create.set_defaults(func=cmd_pool_create)
 
     # pool-wait
@@ -9270,6 +9421,42 @@ def main():
     )
     p_resources.set_defaults(func=cmd_resources)
 
+    # view-pool - Generate HTML viewer for pool benchmark results
+    p_view_pool = subparsers.add_parser(
+        "view-pool",
+        help="Generate HTML viewer for WAA pool benchmark results",
+        description="""
+Generate an interactive HTML viewer for WAA pool benchmark results.
+
+Parses log files from pool_run_* directories to extract task results and
+generates a standalone HTML viewer with:
+  - Summary stats (total tasks, success rate, avg time per task)
+  - Per-worker breakdown
+  - Task list with pass/fail status
+  - Domain breakdown (success rate per domain)
+  - Filters for domain and status
+
+Examples:
+  view-pool                     # View most recent pool_run_* results
+  view-pool --run-name pool_run_20260204  # View specific run
+  view-pool --no-open           # Generate HTML without opening browser
+""",
+    )
+    p_view_pool.add_argument(
+        "--run-name",
+        help="Name of pool run directory (e.g., pool_run_20260204)",
+    )
+    p_view_pool.add_argument(
+        "--results-dir",
+        help="Base results directory (default: benchmark_results/)",
+    )
+    p_view_pool.add_argument(
+        "--no-open",
+        action="store_true",
+        help="Don't auto-open browser",
+    )
+    p_view_pool.set_defaults(func=cmd_view_pool)
+
     args = parser.parse_args()
     sys.exit(args.func(args))
 
diff --git a/openadapt_ml/benchmarks/pool_viewer.py b/openadapt_ml/benchmarks/pool_viewer.py
new file mode 100644
index 0000000..aa3eeb9
--- /dev/null
+++ b/openadapt_ml/benchmarks/pool_viewer.py
@@ -0,0 +1,685 @@
+"""WAA Pool Results Viewer - HTML viewer for parallel benchmark runs.
+
+Parses log files from pool_run_* directories to extract task results and
+generates a standalone HTML viewer with summary stats, per-worker breakdown,
+and domain analysis.
+
+Usage:
+    from openadapt_ml.benchmarks.pool_viewer import generate_pool_results_viewer
+
+    generate_pool_results_viewer(
+        pool_dir=Path("benchmark_results/pool_run_20260204"),
+        output_path=Path("benchmark_results/pool_run_20260204/results.html"),
+    )
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+
+def parse_pool_logs(pool_dir: Path) -> dict[str, Any]:
+    """Parse WAA pool log files to extract task results.
+
+    Args:
+        pool_dir: Directory containing waa-pool-*.log files
+
+    Returns:
+        Dictionary with:
+            - tasks: List of task results
+            - workers: Per-worker stats
+            - metadata: Run metadata (timestamps, model, etc.)
+    """
+    log_files = sorted(pool_dir.glob("waa-pool-*.log"))
+    if not log_files:
+        return {"tasks": [], "workers": {}, "metadata": {}}
+
+    tasks = []
+    workers = {}
+    metadata = {
+        "run_name": pool_dir.name,
+        "log_count": len(log_files),
+        "first_timestamp": None,
+        "last_timestamp": None,
+        "model": None,
+        "num_workers": None,
+    }
+
+    # Regex patterns
+    timestamp_re = re.compile(r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')
+    domain_re = re.compile(r'\[Domain\]: (\S+)')
+    example_re = re.compile(r'\[Example ID\]: (\S+)')
+    instruction_re = re.compile(r'\[Instruction\]: (.+)')
+    finished_re = re.compile(r'Finished (\S+)/(\S+)')
+    result_re = re.compile(r'Result: ([0-9.]+)')
+    worker_re = re.compile(r'worker_id=(\d+)')
+    model_re = re.compile(r"model='([^']+)'")
+    num_workers_re = re.compile(r'num_workers=(\d+)')
+    step_re = re.compile(r'Step (\d+):')
+
+    for log_file in log_files:
+        worker_id = log_file.stem.replace("waa-pool-", "")
+        workers[worker_id] = {"tasks": 0, "successes": 0, "failures": 0}
+
+        current_task = None
+        last_result = None
+
+        with open(log_file, "r", errors="ignore") as f:
+            for line in f:
+                # Strip ANSI codes
+                clean = re.sub(r'\x1b\[[0-9;]*m', '', line)
+
+                # Extract timestamp
+                ts_match = timestamp_re.search(clean)
+                if ts_match:
+                    ts_str = ts_match.group(1)
+                    if metadata["first_timestamp"] is None:
+                        metadata["first_timestamp"] = ts_str
+                    metadata["last_timestamp"] = ts_str
+
+                # Extract model name
+                if metadata["model"] is None:
+                    model_match = model_re.search(clean)
+                    if model_match:
+                        metadata["model"] = model_match.group(1)
+
+                # Extract num workers
+                if metadata["num_workers"] is None:
+                    nw_match = num_workers_re.search(clean)
+                    if nw_match:
+                        metadata["num_workers"] = int(nw_match.group(1))
+
+                # Domain (comes before Example ID)
+                domain_match = domain_re.search(clean)
+                if domain_match:
+                    if current_task is None:
+                        current_task = {"worker_id": worker_id, "steps": 0}
+                    current_task["domain"] = domain_match.group(1)
+
+                # Example ID
+                example_match = example_re.search(clean)
+                if example_match:
+                    if current_task is None:
+                        current_task = {"worker_id": worker_id, "steps": 0}
+                    current_task["task_id"] = example_match.group(1)
+
+                # Instruction
+                instr_match = instruction_re.search(clean)
+                if instr_match and current_task:
+                    current_task["instruction"] = instr_match.group(1)
+
+                # Step count
+                step_match = step_re.search(clean)
+                if step_match and current_task:
+                    step_num = int(step_match.group(1))
+                    if step_num > current_task.get("steps", 0):
+                        current_task["steps"] = step_num
+
+                # Result line
+                result_match = result_re.search(clean)
+                if result_match:
+                    last_result = float(result_match.group(1))
+
+                # Finished line - finalize task
+                finished_match = finished_re.search(clean)
+                if finished_match:
+                    domain = finished_match.group(1)
+                    task_id = finished_match.group(2)
+
+                    if current_task is None:
+                        current_task = {"worker_id": worker_id, "steps": 0}
+
+                    current_task["domain"] = domain
+                    current_task["task_id"] = task_id
+                    current_task["result"] = last_result if last_result is not None else 0.0
+                    current_task["success"] = last_result is not None and last_result > 0
+                    current_task["timestamp"] = metadata["last_timestamp"]
+
+                    # Update worker stats
+                    workers[worker_id]["tasks"] += 1
+                    if current_task["success"]:
+                        workers[worker_id]["successes"] += 1
+                    else:
+                        workers[worker_id]["failures"] += 1
+
+                    tasks.append(current_task)
+                    current_task = None
+                    last_result = None
+
+    return {
+        "tasks": tasks,
+        "workers": workers,
+        "metadata": metadata,
+    }
+
+
+def get_domain_stats(tasks: list[dict]) -> dict[str, dict[str, int]]:
+    """Calculate per-domain statistics."""
+    domain_stats = {}
+
+    for task in tasks:
+        domain = task.get("domain", "unknown")
+        if domain not in domain_stats:
+            domain_stats[domain] = {"total": 0, "success": 0, "fail": 0}
+
+        domain_stats[domain]["total"] += 1
+        if task.get("success"):
+            domain_stats[domain]["success"] += 1
+        else:
+            domain_stats[domain]["fail"] += 1
+
+    return domain_stats
+
+
+def generate_pool_results_viewer(
+    pool_dir: Path,
+    output_path: Path | None = None,
+) -> Path:
+    """Generate HTML viewer for WAA pool benchmark results.
+
+    Args:
+        pool_dir: Directory containing waa-pool-*.log files
+        output_path: Output HTML path. Defaults to pool_dir/results.html
+
+    Returns:
+        Path to generated HTML file.
+    """
+    pool_dir = Path(pool_dir)
+    if output_path is None:
+        output_path = pool_dir / "results.html"
+
+    # Parse logs
+    data = parse_pool_logs(pool_dir)
+    tasks = data["tasks"]
+    workers = data["workers"]
+    metadata = data["metadata"]
+
+    # Calculate stats
+    num_tasks = len(tasks)
+    num_success = sum(1 for t in tasks if t.get("success"))
+    success_rate = (num_success / num_tasks * 100) if num_tasks > 0 else 0
+
+    # Domain stats
+    domain_stats = get_domain_stats(tasks)
+
+    # Calculate elapsed time
+    elapsed_str = "N/A"
+    if metadata.get("first_timestamp") and metadata.get("last_timestamp"):
+        try:
+            fmt = "%Y-%m-%d %H:%M:%S"
+            start = datetime.strptime(metadata["first_timestamp"], fmt)
+            end = datetime.strptime(metadata["last_timestamp"], fmt)
+            elapsed = end - start
+            hours, remainder = divmod(int(elapsed.total_seconds()), 3600)
+            minutes, seconds = divmod(remainder, 60)
+            if hours > 0:
+                elapsed_str = f"{hours}h {minutes}m {seconds}s"
+            elif minutes > 0:
+                elapsed_str = f"{minutes}m {seconds}s"
+            else:
+                elapsed_str = f"{seconds}s"
+        except Exception:
+            pass
+
+    # Avg time per task
+    avg_time_str = "N/A"
+    if num_tasks > 0 and metadata.get("first_timestamp") and metadata.get("last_timestamp"):
+        try:
+            fmt = "%Y-%m-%d %H:%M:%S"
+            start = datetime.strptime(metadata["first_timestamp"], fmt)
+            end = datetime.strptime(metadata["last_timestamp"], fmt)
+            elapsed = end - start
+            avg_seconds = elapsed.total_seconds() / num_tasks
+            if avg_seconds >= 60:
+                avg_time_str = f"{avg_seconds / 60:.1f}m"
+            else:
+                avg_time_str = f"{avg_seconds:.0f}s"
+        except Exception:
+            pass
+
+    # Generate HTML
+    html = _generate_pool_viewer_html(
+        tasks=tasks,
+        workers=workers,
+        metadata=metadata,
+        domain_stats=domain_stats,
+        num_tasks=num_tasks,
+        num_success=num_success,
+        success_rate=success_rate,
+        elapsed_str=elapsed_str,
+        avg_time_str=avg_time_str,
+    )
+
+    # Write output
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(html)
+
+    return output_path
+
+
+def _generate_pool_viewer_html(
+    tasks: list[dict],
+    workers: dict,
+    metadata: dict,
+    domain_stats: dict,
+    num_tasks: int,
+    num_success: int,
+    success_rate: float,
+    elapsed_str: str,
+    avg_time_str: str,
+) -> str:
+    """Generate HTML content for pool results viewer."""
+
+    # Worker rows HTML
+    worker_rows = ""
+    for worker_id, stats in sorted(workers.items()):
+        rate = (stats["successes"] / stats["tasks"] * 100) if stats["tasks"] > 0 else 0
+        worker_rows += f"""
+            <tr>
+                <td>Worker {worker_id}</td>
+                <td>{stats["tasks"]}</td>
+                <td class="success">{stats["successes"]}</td>
+                <td class="error">{stats["failures"]}</td>
+                <td>{rate:.1f}%</td>
+            </tr>
+        """
+
+    # Domain breakdown HTML
+    domain_tags = ""
+    for domain in sorted(domain_stats.keys()):
+        stats = domain_stats[domain]
+        rate = (stats["success"] / stats["total"] * 100) if stats["total"] > 0 else 0
+        domain_tags += f"""
+            <div class="domain-tag">
+                <span class="domain-name">{domain}</span>
+                <span class="domain-stats">{stats["success"]}/{stats["total"]} ({rate:.0f}%)</span>
+            </div>
+        """
+
+    # Task rows HTML
+    task_rows = ""
+    for i, task in enumerate(tasks):
+        status_class = "success" if task.get("success") else "fail"
+        status_text = "PASS" if task.get("success") else "FAIL"
+        result = task.get("result", 0)
+        task_rows += f"""
+            <tr class="task-row" data-domain="{task.get('domain', 'unknown')}" data-status="{status_class}">
+                <td class="task-id">{task.get('task_id', 'N/A')}</td>
+                <td><span class="domain-badge">{task.get('domain', 'unknown')}</span></td>
+                <td><span class="status-badge {status_class}">{status_text}</span></td>
+                <td>{result:.2f}</td>
+                <td>{task.get('steps', 0)}</td>
+                <td>Worker {task.get('worker_id', '?')}</td>
+            </tr>
+        """
+
+    # Domain filter options
+    domain_options = '<option value="all">All Domains</option>'
+    for domain in sorted(domain_stats.keys()):
+        domain_options += f'<option value="{domain}">{domain}</option>'
+
+    html = f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>WAA Pool Results - {metadata.get("run_name", "Unknown")}</title>
+    <style>
+        :root {{
+            --bg-primary: #0a0a0f;
+            --bg-secondary: #12121a;
+            --bg-tertiary: #1a1a24;
+            --border-color: rgba(255, 255, 255, 0.06);
+            --text-primary: #f0f0f0;
+            --text-secondary: #888;
+            --text-muted: #555;
+            --accent: #00d4aa;
+            --accent-dim: rgba(0, 212, 170, 0.15);
+            --success: #34d399;
+            --error: #ff5f5f;
+            --warning: #f59e0b;
+        }}
+        * {{ box-sizing: border-box; margin: 0; padding: 0; }}
+        body {{
+            font-family: "SF Pro Display", -apple-system, BlinkMacSystemFont, "Inter", sans-serif;
+            background: var(--bg-primary);
+            color: var(--text-primary);
+            min-height: 100vh;
+            line-height: 1.5;
+        }}
+        .container {{
+            max-width: 1400px;
+            margin: 0 auto;
+            padding: 24px;
+        }}
+        h1 {{
+            font-size: 1.5rem;
+            font-weight: 600;
+            margin-bottom: 8px;
+        }}
+        .meta-info {{
+            font-size: 0.8rem;
+            color: var(--text-secondary);
+            margin-bottom: 24px;
+            font-family: "SF Mono", Monaco, monospace;
+        }}
+
+        /* Summary Panel */
+        .summary-panel {{
+            background: var(--bg-secondary);
+            border: 1px solid var(--border-color);
+            border-radius: 12px;
+            padding: 20px;
+            margin-bottom: 24px;
+        }}
+        .summary-stats {{
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(140px, 1fr));
+            gap: 16px;
+            margin-bottom: 16px;
+        }}
+        .stat-card {{
+            background: var(--bg-tertiary);
+            border-radius: 8px;
+            padding: 16px;
+        }}
+        .stat-card .stat-value {{
+            font-size: 1.8rem;
+            font-weight: 600;
+            font-family: "SF Mono", Monaco, monospace;
+        }}
+        .stat-card .stat-value.success {{ color: var(--success); }}
+        .stat-card .stat-value.error {{ color: var(--error); }}
+        .stat-card .stat-label {{
+            font-size: 0.7rem;
+            color: var(--text-muted);
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+            margin-top: 4px;
+        }}
+
+        /* Domain breakdown */
+        .domain-breakdown {{
+            display: flex;
+            flex-wrap: wrap;
+            gap: 8px;
+        }}
+        .domain-tag {{
+            display: inline-flex;
+            align-items: center;
+            gap: 6px;
+            padding: 6px 12px;
+            background: var(--bg-tertiary);
+            border-radius: 6px;
+            font-size: 0.75rem;
+        }}
+        .domain-tag .domain-name {{ color: var(--text-primary); }}
+        .domain-tag .domain-stats {{
+            font-family: "SF Mono", Monaco, monospace;
+            color: var(--text-secondary);
+        }}
+
+        /* Section headers */
+        .section-header {{
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            margin-bottom: 16px;
+        }}
+        .section-header h2 {{
+            font-size: 1rem;
+            font-weight: 600;
+        }}
+
+        /* Tables */
+        table {{
+            width: 100%;
+            border-collapse: collapse;
+        }}
+        th, td {{
+            padding: 12px;
+            text-align: left;
+            border-bottom: 1px solid var(--border-color);
+        }}
+        th {{
+            font-size: 0.7rem;
+            color: var(--text-muted);
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+            font-weight: 500;
+        }}
+        td {{
+            font-size: 0.85rem;
+        }}
+        td.success {{ color: var(--success); }}
+        td.error {{ color: var(--error); }}
+        tr:hover {{ background: var(--bg-tertiary); }}
+
+        /* Worker table */
+        .worker-panel {{
+            background: var(--bg-secondary);
+            border: 1px solid var(--border-color);
+            border-radius: 12px;
+            padding: 20px;
+            margin-bottom: 24px;
+        }}
+
+        /* Task list */
+        .task-panel {{
+            background: var(--bg-secondary);
+            border: 1px solid var(--border-color);
+            border-radius: 12px;
+            padding: 20px;
+        }}
+        .task-id {{
+            font-family: "SF Mono", Monaco, monospace;
+            font-size: 0.8rem;
+        }}
+        .status-badge {{
+            font-size: 0.7rem;
+            font-weight: 600;
+            padding: 2px 8px;
+            border-radius: 4px;
+        }}
+        .status-badge.success {{
+            background: rgba(52, 211, 153, 0.2);
+            color: var(--success);
+        }}
+        .status-badge.fail {{
+            background: rgba(255, 95, 95, 0.2);
+            color: var(--error);
+        }}
+        .domain-badge {{
+            font-size: 0.75rem;
+            color: var(--accent);
+        }}
+
+        /* Filters */
+        .filter-bar {{
+            display: flex;
+            gap: 16px;
+            margin-bottom: 16px;
+            flex-wrap: wrap;
+            align-items: center;
+        }}
+        .filter-group {{
+            display: flex;
+            align-items: center;
+            gap: 8px;
+        }}
+        .filter-label {{
+            font-size: 0.7rem;
+            color: var(--text-muted);
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+        }}
+        .filter-select {{
+            padding: 8px 32px 8px 12px;
+            border-radius: 8px;
+            font-size: 0.85rem;
+            background: var(--bg-tertiary);
+            color: var(--text-primary);
+            border: 1px solid var(--border-color);
+            cursor: pointer;
+            appearance: none;
+            background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='12' height='12' viewBox='0 0 12 12'%3E%3Cpath fill='%23888' d='M3 4.5L6 7.5L9 4.5'/%3E%3C/svg%3E");
+            background-repeat: no-repeat;
+            background-position: right 10px center;
+            transition: all 0.2s;
+        }}
+        .filter-select:hover {{ border-color: var(--accent); }}
+        .filter-count {{
+            font-size: 0.8rem;
+            color: var(--text-secondary);
+            margin-left: auto;
+        }}
+
+        /* Hidden rows */
+        .task-row.hidden {{ display: none; }}
+
+        /* Max height for task list */
+        .task-scroll {{
+            max-height: 600px;
+            overflow-y: auto;
+        }}
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>WAA Pool Results</h1>
+        <div class="meta-info">
+            Run: {metadata.get("run_name", "Unknown")} |
+            Model: {metadata.get("model", "N/A")} |
+            Workers: {metadata.get("num_workers", len(workers))} |
+            Time: {elapsed_str}
+        </div>
+
+        <!-- Summary Panel -->
+        <div class="summary-panel">
+            <div class="section-header">
+                <h2>Summary</h2>
+            </div>
+            <div class="summary-stats">
+                <div class="stat-card">
+                    <div class="stat-value">{num_tasks}</div>
+                    <div class="stat-label">Total Tasks</div>
+                </div>
+                <div class="stat-card">
+                    <div class="stat-value success">{num_success}</div>
+                    <div class="stat-label">Passed</div>
+                </div>
+                <div class="stat-card">
+                    <div class="stat-value error">{num_tasks - num_success}</div>
+                    <div class="stat-label">Failed</div>
+                </div>
+                <div class="stat-card">
+                    <div class="stat-value {'success' if success_rate >= 50 else 'error'}">{success_rate:.1f}%</div>
+                    <div class="stat-label">Success Rate</div>
+                </div>
+                <div class="stat-card">
+                    <div class="stat-value">{avg_time_str}</div>
+                    <div class="stat-label">Avg Time/Task</div>
+                </div>
+            </div>
+            <div class="domain-breakdown">
+                {domain_tags}
+            </div>
+        </div>
+
+        <!-- Worker Panel -->
+        <div class="worker-panel">
+            <div class="section-header">
+                <h2>Per-Worker Breakdown</h2>
+            </div>
+            <table>
+                <thead>
+                    <tr>
+                        <th>Worker</th>
+                        <th>Tasks</th>
+                        <th>Passed</th>
+                        <th>Failed</th>
+                        <th>Success Rate</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    {worker_rows}
+                </tbody>
+            </table>
+        </div>
+
+        <!-- Task List Panel -->
+        <div class="task-panel">
+            <div class="section-header">
+                <h2>Task Results</h2>
+            </div>
+            <div class="filter-bar">
+                <div class="filter-group">
+                    <span class="filter-label">Domain:</span>
+                    <select class="filter-select" id="domain-filter" onchange="filterTasks()">
+                        {domain_options}
+                    </select>
+                </div>
+                <div class="filter-group">
+                    <span class="filter-label">Status:</span>
+                    <select class="filter-select" id="status-filter" onchange="filterTasks()">
+                        <option value="all">All</option>
+                        <option value="success">Passed</option>
+                        <option value="fail">Failed</option>
+                    </select>
+                </div>
+                <span class="filter-count" id="filter-count">{num_tasks} tasks</span>
+            </div>
+            <div class="task-scroll">
+                <table>
+                    <thead>
+                        <tr>
+                            <th>Task ID</th>
+                            <th>Domain</th>
+                            <th>Status</th>
+                            <th>Result</th>
+                            <th>Steps</th>
+                            <th>Worker</th>
+                        </tr>
+                    </thead>
+                    <tbody id="task-body">
+                        {task_rows}
+                    </tbody>
+                </table>
+            </div>
+        </div>
+    </div>
+
+    <script>
+        function filterTasks() {{
+            const domainFilter = document.getElementById('domain-filter').value;
+            const statusFilter = document.getElementById('status-filter').value;
+
+            let visibleCount = 0;
+            document.querySelectorAll('.task-row').forEach(row => {{
+                const domain = row.dataset.domain;
+                const status = row.dataset.status;
+
+                const matchDomain = domainFilter === 'all' || domain === domainFilter;
+                const matchStatus = statusFilter === 'all' || status === statusFilter;
+
+                if (matchDomain && matchStatus) {{
+                    row.classList.remove('hidden');
+                    visibleCount++;
+                }} else {{
+                    row.classList.add('hidden');
+                }}
+            }});
+
+            document.getElementById('filter-count').textContent = `${{visibleCount}} tasks`;
+        }}
+    </script>
+</body>
+</html>
+"""
+
+    return html
diff --git a/pyproject.toml b/pyproject.toml
index 35d2720..3808938 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -24,8 +24,9 @@ classifiers = [
 dependencies = [
     "azure-ai-ml>=1.30.0",
     "azure-identity>=1.25.1",
+    "azureml-core>=1.61.0.post1",
     "bitsandbytes>=0.41.0", # For 4-bit quantization
-    "click>=8.1.0",  # CLI framework
+    "click>=8.1.0", # CLI framework
     "google-generativeai>=0.8.5",
     "matplotlib>=3.10.7",
     "openadapt-capture>=0.1.0",
diff --git a/scripts/analyze_pool_logs.py b/scripts/analyze_pool_logs.py
new file mode 100644
index 0000000..80bdfac
--- /dev/null
+++ b/scripts/analyze_pool_logs.py
@@ -0,0 +1,213 @@
+#!/usr/bin/env python3
+"""Analyze WAA pool benchmark logs and generate HTML summary.
+
+Usage:
+    python scripts/analyze_pool_logs.py benchmark_results/pool_run_20260204/
+"""
+
+import re
+import sys
+from pathlib import Path
+from datetime import datetime
+
+
+def parse_log_file(log_path: Path) -> dict:
+    """Parse a WAA benchmark log file."""
+    content = log_path.read_text()
+
+    # Extract task completions
+    tasks = []
+    finished_pattern = r"Finished (\w+)/([a-f0-9-]+-WOS)"
+    result_pattern = r"Result: ([\d.]+)"
+
+    # Find all finished tasks with their results
+    finished_matches = list(re.finditer(finished_pattern, content))
+    result_matches = list(re.finditer(result_pattern, content))
+
+    for i, match in enumerate(finished_matches):
+        domain = match.group(1)
+        task_id = match.group(2)
+        # Find the result that precedes this finish
+        result = 0.0
+        for rm in result_matches:
+            if rm.start() < match.start():
+                result = float(rm.group(1))
+        tasks.append({
+            "domain": domain,
+            "task_id": task_id,
+            "result": result,
+            "success": result > 0.0
+        })
+
+    # Extract total task count from progress bar
+    total_match = re.search(r"Example:\s+\d+%\|.*?\|\s+\d+/(\d+)", content)
+    total_tasks = int(total_match.group(1)) if total_match else 0
+
+    return {
+        "file": log_path.name,
+        "tasks_completed": len(tasks),
+        "tasks_total": total_tasks,
+        "tasks": tasks,
+        "successes": sum(1 for t in tasks if t["success"]),
+    }
+
+
+def generate_html_report(results: list, output_path: Path) -> None:
+    """Generate HTML summary report."""
+    total_completed = sum(r["tasks_completed"] for r in results)
+    total_tasks = sum(r["tasks_total"] for r in results)
+    total_success = sum(r["successes"] for r in results)
+    success_rate = (total_success / total_completed * 100) if total_completed > 0 else 0
+
+    # Group by domain
+    domain_stats = {}
+    for r in results:
+        for task in r["tasks"]:
+            domain = task["domain"]
+            if domain not in domain_stats:
+                domain_stats[domain] = {"total": 0, "success": 0}
+            domain_stats[domain]["total"] += 1
+            if task["success"]:
+                domain_stats[domain]["success"] += 1
+
+    html = f"""<!DOCTYPE html>
+<html>
+<head>
+    <title>WAA Benchmark Results</title>
+    <style>
+        body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; margin: 40px; background: #f5f5f5; }}
+        .container {{ max-width: 1000px; margin: 0 auto; }}
+        h1 {{ color: #333; }}
+        .summary {{ background: white; padding: 20px; border-radius: 8px; margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
+        .stats {{ display: flex; gap: 30px; margin: 20px 0; }}
+        .stat {{ text-align: center; }}
+        .stat-value {{ font-size: 36px; font-weight: bold; color: #2563eb; }}
+        .stat-label {{ color: #666; font-size: 14px; }}
+        table {{ width: 100%; border-collapse: collapse; background: white; border-radius: 8px; overflow: hidden; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
+        th, td {{ padding: 12px 16px; text-align: left; border-bottom: 1px solid #eee; }}
+        th {{ background: #f8f9fa; font-weight: 600; }}
+        .success {{ color: #16a34a; }}
+        .fail {{ color: #dc2626; }}
+        .worker {{ margin-top: 30px; }}
+        .badge {{ display: inline-block; padding: 2px 8px; border-radius: 4px; font-size: 12px; }}
+        .badge-success {{ background: #dcfce7; color: #16a34a; }}
+        .badge-fail {{ background: #fee2e2; color: #dc2626; }}
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>WAA Benchmark Results</h1>
+
+        <div class="summary">
+            <h2>Summary</h2>
+            <div class="stats">
+                <div class="stat">
+                    <div class="stat-value">{total_completed}/{total_tasks}</div>
+                    <div class="stat-label">Tasks Completed</div>
+                </div>
+                <div class="stat">
+                    <div class="stat-value">{total_success}</div>
+                    <div class="stat-label">Successes</div>
+                </div>
+                <div class="stat">
+                    <div class="stat-value">{success_rate:.1f}%</div>
+                    <div class="stat-label">Success Rate</div>
+                </div>
+                <div class="stat">
+                    <div class="stat-value">{len(results)}</div>
+                    <div class="stat-label">Workers</div>
+                </div>
+            </div>
+        </div>
+
+        <div class="summary">
+            <h2>By Domain</h2>
+            <table>
+                <tr><th>Domain</th><th>Completed</th><th>Success</th><th>Rate</th></tr>
+"""
+
+    for domain, stats in sorted(domain_stats.items()):
+        rate = (stats["success"] / stats["total"] * 100) if stats["total"] > 0 else 0
+        html += f"""                <tr>
+                    <td>{domain}</td>
+                    <td>{stats['total']}</td>
+                    <td>{stats['success']}</td>
+                    <td>{rate:.0f}%</td>
+                </tr>
+"""
+
+    html += """            </table>
+        </div>
+"""
+
+    for r in results:
+        html += f"""
+        <div class="worker">
+            <h2>{r['file']}</h2>
+            <p>Completed: {r['tasks_completed']}/{r['tasks_total']} tasks</p>
+            <table>
+                <tr><th>Domain</th><th>Task ID</th><th>Result</th><th>Status</th></tr>
+"""
+        for task in r["tasks"]:
+            status_class = "badge-success" if task["success"] else "badge-fail"
+            status_text = "PASS" if task["success"] else "FAIL"
+            html += f"""                <tr>
+                    <td>{task['domain']}</td>
+                    <td><code>{task['task_id'][:20]}...</code></td>
+                    <td>{task['result']:.2f}</td>
+                    <td><span class="badge {status_class}">{status_text}</span></td>
+                </tr>
+"""
+        html += """            </table>
+        </div>
+"""
+
+    html += f"""
+        <footer style="margin-top: 40px; color: #666; font-size: 12px;">
+            Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
+        </footer>
+    </div>
+</body>
+</html>
+"""
+
+    output_path.write_text(html)
+    print(f"Generated: {output_path}")
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python scripts/analyze_pool_logs.py <results_dir>")
+        sys.exit(1)
+
+    results_dir = Path(sys.argv[1])
+    if not results_dir.exists():
+        print(f"Directory not found: {results_dir}")
+        sys.exit(1)
+
+    # Find log files
+    log_files = list(results_dir.glob("waa-pool-*.log"))
+    if not log_files:
+        print(f"No log files found in {results_dir}")
+        sys.exit(1)
+
+    print(f"Found {len(log_files)} log files")
+
+    # Parse logs
+    results = []
+    for log_file in sorted(log_files):
+        print(f"  Parsing {log_file.name}...")
+        results.append(parse_log_file(log_file))
+
+    # Generate HTML
+    output_path = results_dir / "results.html"
+    generate_html_report(results, output_path)
+
+    # Print summary
+    total_completed = sum(r["tasks_completed"] for r in results)
+    total_success = sum(r["successes"] for r in results)
+    print(f"\nSummary: {total_completed} tasks completed, {total_success} successes ({total_success/total_completed*100:.0f}% rate)" if total_completed > 0 else "\nNo tasks completed")
+
+
+if __name__ == "__main__":
+    main()