diff --git a/.gitignore b/.gitignore index 1f4a400..0bab75f 100644 --- a/.gitignore +++ b/.gitignore @@ -11,8 +11,10 @@ wheels/ .venv local_context_openadapt_ml_internal.md -# Environment variables +# Environment variables and secrets .env +config.json +vendor/WindowsAgentArena/config.json # Ephemeral synthetic assets (frames, debug sessions, etc.) synthetic/ diff --git a/CLAUDE.md b/CLAUDE.md index c8c22fe..481d264 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,343 +4,124 @@ **Philosophy**: "Less is more. 80/20 impact/complexity. Working code beats elegant design." -**Before writing code, ask**: -1. Can this be <100 lines? (ideally <50) -2. Does this provide 80% of value? -3. Is this the simplest approach? +**Before writing code**: Can this be <100 lines? Does this provide 80% of value? Is this the simplest approach? -**Red flags to avoid**: -- Classes when functions work -- Abstractions before 3rd use -- Design docs for non-existent code -- Multiple implementations of same thing +**Avoid**: Classes when functions work, abstractions before 3rd use, design docs for non-existent code. -**See**: `/Users/abrichr/oa/src/openadapt-evals/SIMPLICITY_PRINCIPLES.md` for full guidelines. +See: `/Users/abrichr/oa/src/openadapt-evals/SIMPLICITY_PRINCIPLES.md` for full guidelines. --- -## ๐Ÿšจ๐Ÿšจ๐Ÿšจ CRITICAL: CLI-FIRST, NEVER RAW COMMANDS ๐Ÿšจ๐Ÿšจ๐Ÿšจ +## CRITICAL RULES -### THIS IS THE #1 RULE. VIOLATIONS FRUSTRATE THE USER. +### 0. CHECK RESOURCES ON SESSION START -**NEVER run commands that require user permission. ALWAYS use or extend the CLI.** +**After context compaction or session start, check for running Azure resources:** -โŒ **BANNED** (these require permission, waste user's time): ```bash -# Raw Azure CLI -az vm start --name ... -az vm run-command invoke ... +uv run python -m openadapt_ml.benchmarks.cli resources +``` -# Raw SSH -ssh azureuser@IP "command" +This prevents: +- Forgetting about running VMs (costs ~$0.19-0.38/hr) +- Creating duplicate resources +- Losing track of what's deployed -# Raw Python one-liners -uv run python -c "import subprocess; ..." +See `RESOURCES.md` for current status (auto-updated by the command). -# Any command not in the pre-approved CLI -``` +### 1. CLI-FIRST, NEVER RAW COMMANDS + +**NEVER run raw commands. ALWAYS use or extend the CLI.** -โœ… **REQUIRED** (these are pre-approved, don't ask permission): ```bash -# ALL VM operations go through the CLI +# BANNED (require user permission, waste time) +ssh azureuser@IP "anything" +az vm start --name ... +az vm run-command invoke ... +uv run python -c "import subprocess; ..." + +# REQUIRED (pre-approved, don't ask permission) uv run python -m openadapt_ml.benchmarks.cli vm start uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd "command" uv run python -m openadapt_ml.benchmarks.cli vm diag uv run python -m openadapt_ml.benchmarks.cli vm logs ``` -### When Functionality Is Missing - -**If a CLI command doesn't exist for what you need:** -1. **EDIT the CLI** to add the new command/action -2. **THEN call the CLI** command you just added -3. **NEVER use raw commands** as a workaround - -**Example**: Need to restart Docker services? -```python -# 1. Add to cli.py under cmd_vm(): -elif action == "fix-docker": - # Restart containerd and docker - commands = [ - "sudo systemctl restart containerd", - "sudo systemctl restart docker", - "docker ps" - ] - for cmd in commands: - run_on_vm(cmd) - -# 2. Then call it: -uv run python -m openadapt_ml.benchmarks.cli vm fix-docker -``` - -**This rule exists because:** -- Raw commands require user approval every time -- CLI commands are pre-approved and don't interrupt workflow -- CLI commands are documented and reusable -- The user has told you this MANY times - LISTEN - ---- - -## ๐Ÿ”„ STANDARD WORKFLOW: VM Configuration Changes - -**When VM config needs to change (disk size, VM size, etc.):** - -1. **Delete the current VM** (if running): - ```bash - uv run python -m openadapt_ml.benchmarks.cli vm delete -y - ``` - -2. **Update the code** that launches the VM (e.g., `cli.py` defaults) - -3. **Launch new VM** with the updated code: - ```bash - uv run python -m openadapt_ml.benchmarks.cli vm setup-waa # API key loaded from .env - ``` - -**DO NOT** try to resize/modify running VMs. It's simpler and faster to delete + recreate. - -**Current VM defaults** (in `cli.py`): -- Size: `Standard_D8ds_v5` (300GB temp storage on /mnt) -- Location: `eastus` -- OS: Ubuntu 22.04 LTS - ---- - -## Project Status & Priorities - -**IMPORTANT**: Before starting work, always check the project-wide status document: -- **Location**: `/Users/abrichr/oa/src/STATUS.md` -- **Purpose**: Tracks P0 priorities, active background tasks, blockers, and strategic decisions -- **Action**: Read this file at the start of every session to understand current priorities - -This ensures continuity between Claude Code sessions and context compactions. - ---- - -This file helps maintain context across sessions. - ---- -## โš ๏ธโš ๏ธโš ๏ธ MANDATORY: START DASHBOARD FIRST โš ๏ธโš ๏ธโš ๏ธ - -### STOP. READ THIS BEFORE DOING ANYTHING. - -**If ANY of these are true, you MUST run the dashboard command IMMEDIATELY:** -- Session just started or was compacted -- User mentions VMs, Azure, WAA, benchmark, or Windows -- You're about to run ANY `vm` subcommand (probe, diag, logs, run-waa, etc.) -- You want to check benchmark status - -**THE COMMAND (run this FIRST, not after other commands):** -```bash -uv run python -m openadapt_ml.benchmarks.cli vm monitor -``` - -**ENHANCED FEATURES (as of Jan 2026):** -The `vm monitor` command now provides comprehensive VM usage visibility: -- **VM Status**: Real-time VM state, size, and IP -- **Activity Detection**: What the VM is currently doing (idle, benchmark running, setup) -- **Cost Tracking**: Current uptime, hourly rate, and total cost for session -- **Azure ML Jobs**: Recent jobs from last 7 days with status -- **Evaluation History**: Past benchmark runs and success rates (with --details flag) -- **Dashboard & Tunnels**: Auto-starts web dashboard and SSH/VNC tunnels - -**Usage:** -```bash -# Basic monitoring -uv run python -m openadapt_ml.benchmarks.cli vm monitor - -# With detailed information (costs per day/week, evaluation history) -uv run python -m openadapt_ml.benchmarks.cli vm monitor --details - -# With auto-shutdown after 2 hours -uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2 -``` - -**WHY THIS MATTERS:** -- VNC is ONLY accessible via SSH tunnel at `localhost:8006` (NOT the public IP like `http://20.x.x.x:8006`) -- Azure NSG blocks port 8006 by design - direct access to public IP will NOT work -- The dashboard auto-manages SSH tunnels for VNC access -- Shows real-time costs to prevent budget overruns -- Tracks all Azure ML jobs for visibility into what's running -- Without it, you cannot see what Windows is doing -- The user WILL be frustrated if you keep forgetting this +**If a CLI command doesn't exist**: Edit cli.py to add it, THEN use it. NEVER use raw commands as workaround. -**WRONG (what you keep doing):** -```bash -# DON'T do this - checking probe/diag/logs WITHOUT dashboard running -uv run python -m openadapt_ml.benchmarks.cli vm probe -uv run python -m openadapt_ml.benchmarks.cli vm diag -# Then telling user to "run vm monitor" - NO! YOU run it FIRST! -``` +### 2. START DASHBOARD FIRST FOR VM WORK -**RIGHT (what you should do):** +**Before ANY vm subcommand (probe, diag, logs, etc.):** ```bash -# ALWAYS start dashboard FIRST, then it handles everything uv run python -m openadapt_ml.benchmarks.cli vm monitor ``` -**After every /compact or session restart, your LITERAL FIRST ACTION must be starting this dashboard if VMs are involved.** - ---- -## ๐Ÿ”ด MANDATORY: VERIFY URLs BEFORE RECOMMENDING ๐Ÿ”ด +This manages: +- SSH tunnels (VNC at localhost:8006, WAA at localhost:5001) +- Real-time cost tracking +- Azure ML job visibility +- Auto-opens web dashboard -**BEFORE telling the user to access ANY URL (localhost:XXXX, VNC, dashboard, etc.):** +**WRONG**: Running `vm probe` then `vm diag` then telling user to run `vm monitor` +**RIGHT**: Run `vm monitor` FIRST - it handles everything -1. **MANUALLY VERIFY** the URL is accessible by running a curl/check command -2. **NEVER assume** a service is running just because it was started earlier -3. **NEVER recommend** a URL based on documentation alone - ALWAYS test first +### 3. VERIFY URLs BEFORE RECOMMENDING -**Example verification:** +Always test URLs with curl before telling user to access them: ```bash -# ALWAYS do this BEFORE telling user to visit localhost:8006 -curl -s --connect-timeout 5 http://localhost:8006/ > /dev/null && echo "VNC accessible" || echo "VNC NOT accessible" +curl -s --connect-timeout 5 http://localhost:8006/ > /dev/null && echo "accessible" || echo "NOT accessible" ``` -**If verification fails:** -- Do NOT tell user to access the URL -- Diagnose why it's not working -- Fix it first, THEN provide the URL - -**This rule exists because:** The user was told to access localhost:8006 when the container was gone. This is unacceptable. - ---- -## ๐Ÿšจ๐Ÿšจ๐Ÿšจ STOP! READ THIS BEFORE EVERY COMMAND ๐Ÿšจ๐Ÿšจ๐Ÿšจ - -### ABSOLUTELY NEVER USE RAW SSH COMMANDS - -**This is the #1 rule. You have been told this MANY times. STOP IGNORING IT.** - -โŒ **BANNED** (never type these): -- `ssh azureuser@IP "anything"` -- `ssh $SSH_OPTS ...` -- Any command starting with `ssh` to the VM - -โœ… **REQUIRED** (always use these instead): -- `uv run python -m openadapt_ml.benchmarks.cli vm exec --cmd "your command"` -- `uv run python -m openadapt_ml.benchmarks.cli vm diag` -- `uv run python -m openadapt_ml.benchmarks.cli vm logs` - -**If a CLI command doesn't exist, ADD IT TO THE CLI FIRST, then use it.** - -**Before running ANY command involving the VM, ask yourself:** -1. Does this start with `ssh`? โ†’ STOP, use CLI instead -2. Is this a raw shell command to the VM? โ†’ STOP, use CLI instead -3. Can I use `vm exec --cmd`? โ†’ YES, use it - -This has been explained to you repeatedly. FOLLOW IT. - --- -## ๐Ÿ”ง DOCKERFILE/VM CHANGES: TEST INSIDE CONTAINER FIRST -**Problem**: Each Dockerfile change triggers: rebuild (10 min) โ†’ Windows boot (15 min) โ†’ test โ†’ repeat. Hours wasted on tiny changes. +## Project Status -**Solution**: Test fixes INSIDE a running container BEFORE rebuilding: - -```bash -# 1. Start a test container with bash entrypoint (seconds) -uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ - 'docker run -d --name test-fix --entrypoint /bin/bash windowsarena/winarena:latest -c "sleep 3600"' - -# 2. Apply your fix manually INSIDE the container (seconds) -uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ - "docker exec test-fix sed -i 's/old/new/' /some/file.sh" - -# 3. Verify the fix works (seconds) -uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ - "docker exec test-fix cat /some/file.sh" - -# 4. Test the actual behavior (seconds) -uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ - "docker exec test-fix /some/script.sh && ls /expected/output" - -# 5. Cleanup -uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'docker rm -f test-fix' - -# 6. ONLY AFTER fix is verified: Update Dockerfile and rebuild ONCE -``` - -**Why this matters**: -- Testing a fix takes SECONDS instead of 30+ minutes -- Iterate 10x on the fix before committing to a rebuild -- Don't lose context waiting for long builds -- Each rebuild should be the LAST rebuild, not a guess - ---- +**IMPORTANT**: Check `/Users/abrichr/oa/src/STATUS.md` at session start for P0 priorities. ## Project Overview -openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation agents. It provides: -- Schemas for GUI interaction trajectories -- Synthetic UI generation for bootstrapping +openadapt-ml: Model-agnostic ML engine for GUI automation agents. +- Schemas for GUI trajectories - VLM adapters (Qwen3-VL, Qwen2.5-VL, API backends) - Supervised fine-tuning pipeline - Runtime policy API ## Current Focus: Demo Retrieval -**Validated**: Demo-conditioned prompting improves action accuracy (Dec 2024) +**Validated (Dec 2024)**: Demo-conditioned prompting improves accuracy - Zero-shot: 33% correct first actions - With demo: 100% correct first actions - See `docs/experiments/demo_conditioned_prompting_results.md` -**โœ… VALIDATED (Jan 17, 2026)**: Demo persistence fix is working -- The P0 fix in `openadapt-evals` ensures demo is included at EVERY step, not just step 1 -- Mock test confirms: agent behavior changes from 6.8 avg steps (random) to 3.0 avg steps (focused) -- See `openadapt-evals/CLAUDE.md` for full validation details -- **Next step**: Run full WAA evaluation (154 tasks) to measure episode success improvement - -**Next step**: Build demo retrieval to automatically select relevant demos from a library. +**Validated (Jan 2026)**: Demo persistence fix working in openadapt-evals +- Agent behavior: 6.8 avg steps (random) -> 3.0 avg steps (focused) +- Next: Run full WAA evaluation (154 tasks) -**Key insight**: OpenAdapt's value is **trajectory-conditioned disambiguation of UI affordances**, not "better reasoning". +**Key insight**: OpenAdapt's value is trajectory-conditioned disambiguation of UI affordances. ## Benchmark Integration -**Primary benchmark**: Windows Agent Arena (WAA) +**Primary**: Windows Agent Arena (WAA) - 154 tasks across 11 Windows domains -- MIT licensed, can run locally or on Azure +- MIT licensed, runs locally or on Azure - SOTA: ~19.5% success (GPT-5.1 + OmniParser) -**Future benchmarks** (not yet implemented): -- WebArena/VisualWebArena (browser) -- OSWorld (cross-platform desktop) +**Future benchmarks** (not yet implemented): WebArena, OSWorld ---- - -## ๐ŸŽฏ WAA BENCHMARK WORKFLOW (COMPLETE GUIDE) +**Code location**: Benchmark code moved to `openadapt-evals` package. openadapt-ml handles VM management only. -### Architecture Overview +```python +# NEW (preferred) +from openadapt_evals import ApiAgent, WAAMockAdapter, evaluate_agent_on_benchmark -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LOCAL MACHINE โ”‚ -โ”‚ โ”‚ -โ”‚ openadapt-ml CLI openadapt-evals CLI โ”‚ -โ”‚ (VM management) (benchmark execution) โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ vm monitor โ”‚ live --server localhost:5001 โ”‚ -โ”‚ โ”‚ vm setup-waa โ”‚ run (shortcut) โ”‚ -โ”‚ โ”‚ vm diag โ”‚ โ”‚ -โ”‚ โ–ผ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ SSH TUNNELS (auto-managed) โ”‚ โ”‚ -โ”‚ โ”‚ localhost:5001 โ”€โ”€โ”€โ”€โ”€โ”€โ–บ VM:5000 (WAA Flask API) โ”‚ โ”‚ -โ”‚ โ”‚ localhost:8006 โ”€โ”€โ”€โ”€โ”€โ”€โ–บ VM:8006 (noVNC) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ”‚ SSH (port 22) - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ AZURE VM (Ubuntu) โ”‚ -โ”‚ โ”‚ -โ”‚ Docker โ”‚ -โ”‚ โ””โ”€โ”€ windowsarena/winarena:latest โ”‚ -โ”‚ โ””โ”€โ”€ QEMU (Windows 11 Enterprise) โ”‚ -โ”‚ โ”œโ”€โ”€ WAA Flask server (port 5000) โ”‚ -โ”‚ โ””โ”€โ”€ Navi agent (executes tasks) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +# Backward compat +from openadapt_ml.benchmarks import APIBenchmarkAgent, WAAMockAdapter ``` +--- + +## WAA Workflow + ### Two CLIs, Two Purposes | CLI | Repo | Purpose | @@ -350,1139 +131,485 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a ### API Keys -**API keys are auto-loaded from `.env` via `config.py`**. No need to pass explicitly. +Auto-loaded from `.env` via `config.py`. No need to pass explicitly. ```bash -# .env file (create in repo root, not committed to git) +# .env file (not committed to git) OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... ``` -Optional override: `[--api-key KEY]` on any command that needs it. +### Complete Workflow (Pool - Recommended) -### Complete Workflow (Step by Step) - -**Step 1: Setup Azure VM with WAA (first time, ~15 min)** +**Step 1: Create VM Pool (~10 min)** ```bash -cd /Users/abrichr/oa/src/openadapt-ml -uv run python -m openadapt_ml.benchmarks.cli vm setup-waa +# Single VM for quick tests +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 1 + +# Multiple VMs for parallel evaluation +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 ``` -This creates VM, installs Docker, pulls Windows image, starts WAA server. -**Step 2: Start Dashboard and Tunnels** +**Step 2: Wait for WAA Ready (~5-15 min)** ```bash -uv run python -m openadapt_ml.benchmarks.cli vm monitor +uv run python -m openadapt_ml.benchmarks.cli pool-wait ``` -This auto-manages SSH tunnels: -- `localhost:5001` -> VM:5000 (WAA API) -- `localhost:8006` -> VM:8006 (VNC) -**Step 3: Run Benchmark (from openadapt-evals)** +**Step 3: Run Benchmark** ```bash -cd /Users/abrichr/oa/src/openadapt-evals +# Run 3 tasks for quick validation +uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 3 -# Quick smoke test (no API key needed) -uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1 - -# Run with OpenAI (uses OPENAI_API_KEY from .env) -uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1 +# Run all 154 tasks +uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154 +``` -# Run with Claude (uses ANTHROPIC_API_KEY from .env) -uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1 +**Step 4: View Progress and VNC** +```bash +# Check status +uv run python -m openadapt_ml.benchmarks.cli pool-status -# Override API key if needed -uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1 --api-key sk-... +# Open VNC to view Windows desktops +uv run python -m openadapt_ml.benchmarks.cli pool-vnc -# Multiple tasks -uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2,browser_1 +# Stream logs +uv run python -m openadapt_ml.benchmarks.cli pool-logs ``` -**Step 4: View Results** +**Step 5: Cleanup (Stop Billing)** ```bash -uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval +uv run python -m openadapt_ml.benchmarks.cli pool-cleanup ``` -**Step 5: Deallocate VM (stops billing)** +### CLI Commands Reference + ```bash -cd /Users/abrichr/oa/src/openadapt-ml -uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y +# === POOL COMMANDS (Parallel VMs - Recommended) === +pool-create --workers N # Create N VMs with Docker + WAA image +pool-create --workers N --auto-shutdown-hours 6 # Custom auto-shutdown (default: 4h) +pool-wait # Wait for WAA server ready on all workers +pool-run --tasks N # Run N tasks distributed across workers +pool-status # Show status of all pool VMs +pool-vnc # Open VNC to pool workers (SSH tunnels) +pool-logs # Stream logs from all workers +pool-exec --cmd '' # Execute command on all workers +pool-cleanup -y # Delete all pool VMs and resources (no prompt) + +# === SINGLE VM COMMANDS === +create --fast # Create single VM (D8ds_v5) +create --fast --auto-shutdown-hours 6 # Custom auto-shutdown (default: 4h) +delete # Delete VM and all resources +status # Show VM status +start # Start WAA container +stop # Stop WAA container +probe # Check if WAA server is ready +run --num-tasks N # Run benchmark on single VM +vm-start # Start a deallocated VM +deallocate # Stop VM (preserves disk, stops billing) +logs # Show WAA logs +vnc # Open VNC (SSH tunnel) +exec --cmd '' # Run command in container +docker-exec --cmd '' # Run command on VM host + +# === AZURE ML COMMANDS (Legacy) === +run-azure-ml --workers N # Run on Azure ML compute instances +azure-ml-quota # Check quota status +azure-ml-quota-wait # Wait for quota approval ``` -### Quick Reference Commands +### Quota Auto-Detection + +Wait for quota approval before running evaluation: -**From openadapt-ml (VM management):** ```bash -vm monitor # Start dashboard, tunnels, show status -vm setup-waa # First-time VM + WAA setup -vm diag # Check disk, Docker, containers -vm probe # Check WAA server status -vm logs # View container logs -vm deallocate # Stop VM billing -vm delete # Remove VM entirely +# Wait for quota (polls every 60 seconds, 24h timeout) +uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait + +# Wait and automatically run evaluation when quota is approved +uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --auto-run --tasks 20 + +# Custom target (e.g., 16 vCPUs for 2 parallel workers) +uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --target 16 + +# Run in background (survives terminal close) +nohup uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota-wait --auto-run & ``` -**From openadapt-evals (benchmarks):** +See `docs/QUOTA_AUTO_DETECTION_DESIGN.md` for full documentation. + +### VM Auto-Shutdown and Orphan Prevention + +**Auto-shutdown policy**: All VMs are automatically configured with an Azure auto-shutdown policy as a safety net to prevent orphaned VMs from running indefinitely and consuming quota/money. + +- **Default**: 4 hours after VM creation +- **Customizable**: `--auto-shutdown-hours N` (0 to disable) +- **Azure-level enforcement**: Even if SSH connection drops, the VM will still be deallocated + ```bash -run # Simplified live evaluation (uses localhost:5001) -live # Full control over server URL -mock # Mock evaluation (no VM needed) -probe # Check if WAA server is ready -view # Generate HTML results viewer +# Default: auto-shutdown in 4 hours +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 + +# Custom: auto-shutdown in 8 hours for long-running evaluations +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 --auto-shutdown-hours 8 + +# Disable auto-shutdown (not recommended) +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 3 --auto-shutdown-hours 0 ``` -### Key Points to Remember +**Test VM cleanup**: During `pool-create`, a test VM is created to check quota availability. This test VM is always cleaned up via try/finally, even if the command is interrupted or fails. -1. **SSH tunnels are required** - Azure NSG blocks direct access to ports 5000/8006 -2. **WAA server runs INSIDE Windows** - The Flask server (port 5000) runs in Windows, not on the Ubuntu host -3. **Default tunnel port is 5001** - Use `--server http://localhost:5001` (not 5000) -4. **Monitor auto-manages tunnels** - Running `vm monitor` sets up everything -5. **Results saved to benchmark_results/** - View with `view --run-name ` +**Manual cleanup**: Use `pool-cleanup -y` to clean up orphaned resources without confirmation prompts (useful for automation): +```bash +uv run python -m openadapt_ml.benchmarks.cli pool-cleanup -y +``` -### Troubleshooting +### Azure ML Automated Workflow + +For parallel benchmark execution on Azure ML compute instances: -**Problem: "Cannot connect to WAA server"** ```bash -# 1. Is VM running? -uv run python -m openadapt_ml.benchmarks.cli vm status +# Single command handles everything: +# 1. Create/start VM if needed +# 2. Start Windows container with VERSION=11e +# 3. Wait for WAA server ready (~15-20 min first time) +# 4. Upload golden image to blob storage +# 5. Run Azure ML benchmark with N workers -# 2. Are tunnels active? -uv run python -m openadapt_ml.benchmarks.cli vm monitor +uv run python -m openadapt_ml.benchmarks.cli run-azure-ml-auto --workers 4 -# 3. Check container -uv run python -m openadapt_ml.benchmarks.cli vm diag +# Setup only (golden image, no benchmark) +uv run python -m openadapt_ml.benchmarks.cli run-azure-ml-auto --skip-benchmark + +# Cleanup when done (IMPORTANT - stops billing!) +uv run python -m openadapt_ml.benchmarks.cli run-azure-ml --teardown --confirm ``` -**Problem: "Connection refused on localhost:5001"** -```bash -# Start tunnels via monitor -uv run python -m openadapt_ml.benchmarks.cli vm monitor +See `docs/AZURE_ML_AUTOMATED_WORKFLOW.md` for full documentation. + +### Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LOCAL MACHINE โ”‚ +โ”‚ openadapt-ml CLI openadapt-evals CLI โ”‚ +โ”‚ (VM management) (benchmark execution) โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ SSH TUNNELS (auto-managed by monitor) โ”‚ โ”‚ +โ”‚ โ”‚ localhost:5001 โ†’ VM:5000 (WAA API) โ”‚ โ”‚ +โ”‚ โ”‚ localhost:8006 โ†’ VM:8006 (noVNC) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ SSH (port 22) + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AZURE VM (Ubuntu) โ”‚ +โ”‚ Docker โ”‚ +โ”‚ โ””โ”€โ”€ windowsarena/winarena:latest (Microsoft official) โ”‚ +โ”‚ โ””โ”€โ”€ QEMU (Windows 11 Enterprise) โ”‚ +โ”‚ โ”œโ”€โ”€ WAA Flask server (port 5000) โ”‚ +โ”‚ โ””โ”€โ”€ Navi agent (executes tasks) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` -**Problem: "Windows not booting"** +**Key Points**: +1. SSH tunnels required - Azure NSG blocks direct port access +2. WAA server runs INSIDE Windows, not on Ubuntu host +3. Default tunnel port is 5001 (not 5000) +4. Uses vanilla Microsoft WAA image, no custom Dockerfile +5. `VERSION=11e` auto-downloads Windows 11 Enterprise Evaluation + +--- + +## VM Configuration Changes + +Delete + recreate (don't try to resize running VMs): ```bash -# Check VNC (opens in browser via monitor) -# Look at container logs -uv run python -m openadapt_ml.benchmarks.cli vm logs +uv run python -m openadapt_ml.benchmarks.cli vm delete -y +# Update cli.py defaults +uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ``` +**Current defaults** (in cli.py): +- Size: `Standard_D8ds_v5` (8 vCPU, 32GB RAM, 300GB temp on /mnt) +- Location: `eastus` +- OS: Ubuntu 22.04 LTS + --- ## Key Architecture Decisions -1. **SoM (Set-of-Marks) mode** - Achieves 100% on synthetic benchmarks by using element IDs instead of coordinates (`CLICK([1])` not `CLICK(x=0.42, y=0.31)`) - -2. **Grounding module** - Keep but deprioritize. Useful for deployment on real UIs without SoM overlays. Located in `openadapt_ml/grounding/` - -3. **Schema design** - Actions should carry both coordinates AND element grounding (node_id, role, name, bbox) when available - -4. **Lossless preservation** - Always store raw benchmark configs verbatim in `raw_config`, `raw_observation`, `raw_action` fields - -5. **DOM/AX is mandatory in schema, optional at runtime** - Observations must support `accessibility_tree` and `dom_html` fields for evaluator compatibility (WebArena, WorkArena, Mind2Web need DOM for scoring), even if agents choose vision-only - -6. **Cloud-First Development** - While features should work locally for testing, immediately build out cloud compatibility (Azure free tier, Lambda Labs) because: - - Most users won't have 96GB RAM locally for VLM training - - Developer productivity suffers waiting for long training runs - - Training should be as short as possible with feedback as quickly as possible - - **Everything should feel fast** - offload heavy compute to cloud GPUs - - Cloud providers: Azure (primary, free tier available), Lambda Labs (GPU rental) - - See `docs/live_inference_design.md` for async inference architecture - -7. **Schema Purity** - The schema must remain domain-agnostic and generic: - - **External systems adapt TO the schema**, not the other way around - - Never add fields to accommodate specific external data structures - - Data transformation belongs in importers/exporters, not core schema - - Use `raw` and `metadata` dict fields for integration-specific data - - If a proposed field feels specific to one use case, it doesn't belong in the schema - - This is a standard open-source library: users import and call functions, they don't shape the API - - See `openadapt_ml/schemas/` for canonical definitions - -8. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first: - - **Never wait on real training to test UI/code changes** - - Use `--stub` flag to simulate training progress without GPU - - Generates fake loss curves, evaluations, checkpoints in seconds - - Enables rapid iteration on dashboard, viewer, stop button, etc. - - See `docs/stub_training_adapter.md` for implementation details - - Usage: `uv run python -m openadapt_ml.cloud.lambda_labs monitor --stub --open` - -## Expert Feedback - -1. **Prompting first** - Establish baselines with off-the-shelf models before fine-tuning -2. **Prompt engineering matters** - Use structured format: Observation summary โ†’ Planning โ†’ Possible actions โ†’ Action -3. **Element-based actions** - `Click [8]` instead of coordinates, similar to SoM -4. **Larger base models** - They used Gemma3 27B; current 2B/8B might be too small - -## Benchmark Integration (MIGRATED TO openadapt-evals) - -> **IMPORTANT**: Benchmark code has been consolidated into the `openadapt-evals` package. -> The `openadapt_ml/benchmarks/` directory now contains deprecation stubs that re-export from `openadapt-evals`. -> -> **Use the new package:** -> ```python -> # NEW (preferred) -> from openadapt_evals import ApiAgent, WAAMockAdapter, evaluate_agent_on_benchmark -> -> # Also works (backward compat) -> from openadapt_ml.benchmarks import APIBenchmarkAgent, WAAMockAdapter -> ``` -> -> **CLI (now in openadapt-evals):** -> ```bash -> # NEW (preferred) -> uv run python -m openadapt_evals.benchmarks.cli mock --tasks 10 -> uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://vm:5000 -> -> # openadapt-ml CLI still works for VM management -> uv run python -m openadapt_ml.benchmarks.cli vm monitor -> ``` - -The benchmark integration module is now in `openadapt-evals`: -- `openadapt_evals/adapters/` - BenchmarkAdapter, WAAAdapter, WAALiveAdapter -- `openadapt_evals/agents/` - BenchmarkAgent, ApiAgent (with P0 demo persistence fix), PolicyAgent -- `openadapt_evals/benchmarks/` - runner, metrics, viewer, data_collection - -### APIBenchmarkAgent - -The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-5.1) for benchmark evaluation baselines. -This enables comparing fine-tuned models against off-the-shelf VLMs. +1. **SoM mode** - Element IDs (`CLICK([1])`) instead of coordinates for 100% accuracy on synthetic benchmarks -```python -from openadapt_ml.benchmarks import APIBenchmarkAgent, evaluate_agent_on_benchmark +2. **Grounding module** - Keep but deprioritize. Useful for real UIs without SoM overlays. Located in `openadapt_ml/grounding/` -# Claude baseline -agent = APIBenchmarkAgent(provider="anthropic") -results = evaluate_agent_on_benchmark(agent, adapter) +3. **Schema design** - Actions carry both coordinates AND element grounding when available -# GPT-5.1 baseline -agent = APIBenchmarkAgent(provider="openai") -results = evaluate_agent_on_benchmark(agent, adapter) -``` - -CLI usage: -```bash -# Run Claude evaluation on mock tasks -uv run python -m openadapt_ml.benchmarks.cli run-api --provider anthropic --tasks 5 +4. **Lossless preservation** - Store raw benchmark configs in `raw_config`, `raw_observation`, `raw_action` fields -# Run GPT-5.1 evaluation -uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai --tasks 5 +5. **Schema purity** - Domain-agnostic; external systems adapt TO the schema, not vice versa. See `openadapt_ml/schemas/` -# Disable accessibility tree in prompts -uv run python -m openadapt_ml.benchmarks.cli run-api --no-a11y --tasks 5 -``` +6. **Cloud-first** - Offload heavy compute to cloud GPUs (Azure, Lambda Labs). Everything should feel fast. -The agent: -- Converts BenchmarkObservation to API format (screenshot + structured prompt) -- Parses VLM responses into BenchmarkActions using regex patterns -- Supports CLICK(x,y), CLICK([id]), TYPE("text"), KEY(key), SCROLL(dir), DONE() -- Stores raw VLM responses in `action.raw_action` for debugging - -### Azure Automation - -`scripts/setup_azure.py` fully automates Azure setup with 15 steps: -1. Check Azure CLI installation -2. Login to Azure -3. Select subscription -4. Register resource providers (Compute, ML, Storage, ContainerRegistry) -5. Create resource group -6. Create service principal with Contributor role -7. Create ML workspace -8. Create Azure Container Registry (ACR) -9. Import WAA Docker image from Docker Hub to ACR -10. Attach ACR to ML workspace -11. Grant AcrPull role to workspace managed identity -12. Sync workspace keys for ACR authentication -13. Request GPU quota -14. Create storage account -15. Create inference queue and blob containers - -The script writes all credentials to `.env` including: -- Service principal credentials (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID) -- Workspace config (AZURE_SUBSCRIPTION_ID, AZURE_ML_RESOURCE_GROUP, AZURE_ML_WORKSPACE_NAME) -- Docker image path (AZURE_DOCKER_IMAGE) pointing to ACR - -**Why ACR?** Azure ML cannot pull from Docker Hub or ghcr.io directly. The image must be in ACR. - -**ACR Authentication**: The script automatically configures ACR authentication by granting the workspace's managed identity AcrPull role on the ACR. This ensures compute instances can pull Docker images without requiring admin credentials. - -CLI usage: -```bash -# Set up Azure (creates resources, ACR, imports image, writes credentials to .env) -python scripts/setup_azure.py +7. **Stub training** - Use `--stub` flag for rapid UI iteration without GPU -# Clean up all Azure resources -python scripts/setup_azure.py --cleanup +8. **DOM/AX mandatory in schema** - For evaluator compatibility (WebArena, Mind2Web need DOM), even if agents use vision-only -# Estimate Azure costs -python -m openadapt_ml.benchmarks.cli estimate --workers 40 +--- -# Test with mock adapter (no Windows required) -python -m openadapt_ml.benchmarks.cli test-mock --tasks 20 +## Azure Automation -# Check Azure status -python -m openadapt_ml.benchmarks.cli status +`scripts/setup_azure.py` automates 15-step Azure setup: +- Creates resource group, service principal, ML workspace, ACR +- Imports WAA Docker image to ACR +- Configures ACR authentication (AcrPull role) +- Writes credentials to `.env` -# Run on Azure (WAA submodule auto-detected) -python -m openadapt_ml.benchmarks.cli run-azure --workers 1 +```bash +python scripts/setup_azure.py # Setup +python scripts/setup_azure.py --cleanup # Cleanup ``` -Schema extensions completed in `openadapt_ml/schemas/sessions.py`: -- `Action`: `target_node_id`, `target_role`, `target_name`, `answer`, `key`, `modifiers`, `scroll_direction`, `scroll_amount`, `end_x`, `end_y` -- `Observation`: `accessibility_tree`, `dom_html`, `url`, `window_title`, `app_name`, `focused_element` +--- ## Cloud GPU Training See `docs/cloud_gpu_training.md` for full documentation. -**Quick start:** ```bash -# Lambda Labs - fully automated training pipeline -uv run python -m openadapt_ml.cloud.lambda_labs train \ - --capture /path/to/capture \ - --goal "Task description" +# Lambda Labs - automated pipeline +uv run python -m openadapt_ml.cloud.lambda_labs train --capture /path --goal "Task" -# Or step by step: +# Step by step uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a10 uv run python -m openadapt_ml.cloud.lambda_labs train-status uv run python -m openadapt_ml.cloud.lambda_labs terminate ``` -**Important**: All cloud operations should be wrapped in CLI commands, not raw SSH. The Lambda Labs module provides: -- `LambdaLabsClient.setup_instance()` - Clone repo, install deps -- `LambdaLabsClient.upload_capture()` - rsync capture data -- `LambdaLabsClient.run_training()` - Execute training -- `LambdaLabsClient.get_training_status()` - Poll training progress +--- -## Training & Visualization Commands +## Training Commands ```bash -# Train on a capture recording +# Train on capture uv run python -m openadapt_ml.scripts.train \ --config configs/qwen3vl_capture.yaml \ --capture /path/to/capture \ - --open # opens dashboard in browser + --open -# Serve dashboard/viewer via HTTP (RECOMMENDED) -# Auto-regenerates dashboard.html and viewer.html before serving +# Serve dashboard (auto-regenerates HTML) uv run python -m openadapt_ml.cloud.local serve --port 8080 --open -# Skip regeneration if files are already up to date -uv run python -m openadapt_ml.cloud.local serve --port 8080 --open --no-regenerate - -# Regenerate viewer/dashboard without serving -# Useful after training completes or to refresh with latest code changes +# Regenerate viewer without serving uv run python -m openadapt_ml.cloud.local viewer -# Compare human vs model predictions +# Compare human vs model uv run python -m openadapt_ml.scripts.compare \ --capture /path/to/capture \ --checkpoint checkpoints/model \ --open ``` -## Benchmark Data Collection & Testing - -```bash -# Test benchmark data collection (Phase 1) -# Creates directory structure with screenshots, execution traces, and metadata -uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5 - -# Custom run name and output directory -uv run python -m openadapt_ml.benchmarks.cli test-collection \ - --tasks 10 \ - --run-name my_test_run \ - --output benchmark_results \ - --model-id "my-agent-v1" - -# Run the standalone test script (equivalent to test-collection) -uv run python test_data_collection.py -``` - -**Output directory structure:** -``` -benchmark_results/ -โ”œโ”€โ”€ {run_name}/ -โ”‚ โ”œโ”€โ”€ metadata.json # Benchmark name, model ID, timestamp -โ”‚ โ”œโ”€โ”€ summary.json # Aggregate metrics (success rate, avg steps) -โ”‚ โ””โ”€โ”€ tasks/ -โ”‚ โ”œโ”€โ”€ task_001/ -โ”‚ โ”‚ โ”œโ”€โ”€ task.json # Task definition -โ”‚ โ”‚ โ”œโ”€โ”€ execution.json # Execution trace with steps -โ”‚ โ”‚ โ””โ”€โ”€ screenshots/ -โ”‚ โ”‚ โ”œโ”€โ”€ step_000.png -โ”‚ โ”‚ โ”œโ”€โ”€ step_001.png -โ”‚ โ”‚ โ””โ”€โ”€ ... -โ”‚ โ””โ”€โ”€ task_002/ -โ”‚ โ””โ”€โ”€ ... -``` - -**Key files:** -- `execution.json`: Contains step-by-step trace with actions, reasoning, timestamps -- `task.json`: Task definition with instruction, domain, time limits -- `summary.json`: High-level metrics suitable for benchmark viewer -- `screenshots/`: PNG screenshots at each step - -## Viewer Setup Troubleshooting - -**Problem**: Viewer shows "No model loaded" after training. - -**Root cause**: The viewer requires: -1. A base `comparison.html` file (from capture or generated during training) -2. Prediction JSON files (`predictions_*.json`) - -**Solution**: -```bash -# If comparison.html is missing, copy from the capture directory: -cp /path/to/capture/comparison.html training_output/ - -# Then regenerate the viewer: -uv run python -m openadapt_ml.cloud.local viewer - -# Serve and open: -uv run python -m openadapt_ml.cloud.local serve --open -``` - -**Key files in training_output/**: -- `training_log.json` - Training progress, loss curves, evaluations -- `dashboard.html` - Training dashboard (auto-regenerated by serve command) -- `viewer.html` - Capture viewer with predictions (auto-regenerated by serve command) -- `comparison.html` - Base viewer from capture (needed for viewer generation) -- `predictions_*.json` - Model predictions by checkpoint (e.g., `predictions_epoch3.json`) - -## Files to Know - -- `docs/cloud_gpu_training.md` - Lambda Labs and Azure GPU training guide -- `docs/benchmark_integration_plan.md` - Benchmark integration architecture -- `docs/azure_waa_setup.md` - Azure WAA setup guide (quota increase, costs, troubleshooting) -- `docs/design.md` - Overall system design -- `docs/experiments/demo_conditioned_prompting_results.md` - Demo experiment results (validated Dec 2024) -- `openadapt_ml/cloud/` - Cloud GPU providers (Lambda Labs, Azure) -- `openadapt_ml/benchmarks/` - Benchmark integration module (WAA, base classes) -- `openadapt_ml/experiments/demo_prompt/` - Demo-conditioned prompting experiment -- `openadapt_ml/grounding/` - Grounding module (GeminiGrounder, etc.) -- `openadapt_ml/ingest/capture.py` - Converts openadapt-capture recordings to Episodes -- `scripts/run_demo_experiment.py` - Run demo-conditioned experiment -- `configs/qwen3vl_synthetic_som.yaml` - SoM training config +--- ## Code Patterns ### Environment Variables -Always load env vars through `openadapt_ml/config.py` using pydantic-settings, NOT directly from `os.environ`: - +Use `config.settings`, NOT `os.environ`: ```python # Good from openadapt_ml.config import settings -api_key = settings.lambda_api_key +api_key = settings.openai_api_key # Bad -api_key = os.environ.get("LAMBDA_API_KEY") +api_key = os.environ.get("OPENAI_API_KEY") ``` -This ensures `.env` file is automatically loaded. When adding new env vars: +When adding new env vars: 1. Add to `Settings` class in `config.py` -2. Add to `.env.example` with documentation - -### API Keys for CLI Commands +2. Add to `.env.example` -CLI commands that need API keys (e.g., `waa`, `run-api`) follow this priority: -1. Command-line argument: `--api-key YOUR_KEY` -2. Config file: `settings.openai_api_key` from `.env` -3. Environment variable: `$OPENAI_API_KEY` +### API Keys for CLI +Priority: `--api-key` flag > `.env` file > environment variable -**Best practice**: Store keys in `.env` file (not committed to git): -```bash -# .env -OPENAI_API_KEY=sk-... -ANTHROPIC_API_KEY=sk-ant-... -``` - -Then CLI commands work without `--api-key`: -```bash -# These load API key from .env automatically -uv run python -m openadapt_ml.benchmarks.cli waa -uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai -``` - -## File Access - -The user has pre-approved read access to: -- `~/oa/src/` - Parent directory containing related projects (openadapt-capture, etc.) - -Related paths: -- Capture recordings: `/Users/abrichr/oa/src/openadapt-capture/` -- Screenshots: `/Users/abrichr/oa/src/openadapt-capture//screenshots/` - -## Shared Dashboard Components - -The training dashboard and capture viewer share UI components for visual consistency. When modifying dashboard UI: - -**Key files:** -- `openadapt_ml/training/trainer.py` - Contains shared component functions: - - `_get_shared_header_css()` - CSS for the unified header - - `_generate_shared_header_html()` - HTML generator for nav tabs + controls - -**Pattern:** -1. Define shared CSS/HTML in dedicated functions (prefixed with `_`) -2. Both `generate_training_dashboard()` and `_enhance_comparison_to_unified_viewer()` call these functions -3. Changes to shared functions automatically propagate to all dashboards +--- -**Why this matters:** -- Prevents visual inconsistencies when switching between Training and Viewer tabs -- Single source of truth for styling (no duplicate CSS to maintain) -- Easier to add new dashboards that match existing style +## Dockerfile Testing -## CRITICAL: Always Start Dashboard When Running Azure Resources +Test fixes INSIDE container before rebuilding (saves 30+ min): -See the โš ๏ธ MANDATORY section at the TOP of this file. Use: ```bash -uv run python -m openadapt_ml.benchmarks.cli vm monitor -``` - -## โš ๏ธ SAFE PROCESS MANAGEMENT โš ๏ธ +# 1. Start test container +uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ + 'docker run -d --name test-fix --entrypoint /bin/bash windowsarena/winarena:latest -c "sleep 3600"' -**NEVER use broad pkill patterns** - they can kill unrelated applications! +# 2. Apply fix +uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ + "docker exec test-fix sed -i 's/old/new/' /some/file.sh" -**WRONG (DANGEROUS):** -```bash -# These patterns are TOO BROAD and will kill unrelated apps: -pkill -f "openadapt" # Kills anything with "openadapt" in path -pkill -f "python" # Kills ALL Python processes -pkill -9 -f "openadapt_ml" # Killed Claude Code, Windsurf, Signal, Chrome tabs! -``` +# 3. Verify +uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \ + "docker exec test-fix cat /some/file.sh" -**RIGHT (SAFE):** -```bash -# Use specific PID-based killing: -lsof -i :8765 | grep python | awk '{print $2}' | xargs kill 2>/dev/null +# 4. Cleanup +uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'docker rm -f test-fix' -# Or use specific process names with full path matching: -pkill -f "python.*-m openadapt_ml.cloud.local serve" +# 5. ONLY rebuild after fix is verified +``` -# Or kill only the specific port listener: -kill $(lsof -t -i :8765) 2>/dev/null +--- -# Check what would be killed FIRST: -pgrep -f "openadapt" -l # Lists matching processes before killing -``` +## Files to Know -**Before any pkill command:** -1. Run `pgrep -f "pattern" -l` to see what matches -2. Verify only intended processes are listed -3. Use the most specific pattern possible -4. Prefer port-based or PID-based killing +- `docs/WAA_APPROACH_REVIEW.md` - Full WAA setup documentation +- `docs/cloud_gpu_training.md` - Lambda Labs/Azure training guide +- `docs/azure_waa_setup.md` - Azure quota, costs, troubleshooting +- `docs/design.md` - System design +- `openadapt_ml/benchmarks/cli.py` - VM CLI commands +- `openadapt_ml/cloud/ssh_tunnel.py` - SSH tunnel manager +- `openadapt_ml/config.py` - Settings (pydantic-settings) +- `openadapt_ml/schemas/` - Canonical schema definitions -## Git Commit Style (Angular Convention) +--- -**ALWAYS use Angular-style commit messages** for all commits across all OpenAdapt repositories. +## Git Commit Style (Angular) -**Format:** ``` (): - - Co-Authored-By: Claude Opus 4.5 ``` -**Types:** -- `feat`: New feature -- `fix`: Bug fix -- `docs`: Documentation only -- `style`: Code style (formatting, semicolons, etc.) -- `refactor`: Code change that neither fixes a bug nor adds a feature -- `perf`: Performance improvement -- `test`: Adding or fixing tests -- `chore`: Maintenance tasks (deps, build, etc.) -- `ci`: CI/CD changes - -**Examples:** -```bash -# Feature -git commit -m "feat(viewer): add keyboard shortcuts for navigation" - -# Bug fix -git commit -m "fix(waa): resolve Docker storage path issue" - -# Documentation -git commit -m "docs: remove archived OpenAdapter from repository listing" +**Types**: feat, fix, docs, style, refactor, perf, test, chore, ci -# Refactor -git commit -m "refactor(cli): consolidate VM commands into single subcommand" -``` - -**Subject line rules:** -- Use imperative mood ("add" not "added" or "adds") -- No period at the end -- Max 50 characters -- Lowercase first letter after type +**Rules**: Imperative mood, no period, max 50 chars, lowercase after type --- ## Don't Do +- Don't use `os.environ` - use `config.settings` +- Don't use `pip install` - use `uv add` or `uv sync` +- Don't run VM ops without `vm monitor` first +- Don't use raw SSH/shell commands - use CLI +- Don't tell user to run commands - YOU run them +- Don't use broad pkill patterns (they kill unrelated apps) - Don't add timelines/estimates to plans -- Don't mention specific clients by name in public docs -- Don't over-engineer - keep solutions minimal -- Don't use `os.environ` directly - use `config.settings` instead -- Don't use `pip install` - always use `uv add` for dependencies or `uv sync` for the project -- Don't use non-Angular commit messages -- **Don't run Azure/VM operations without starting the dashboard first** - - โŒ WRONG: `vm probe` then `vm diag` then telling user to run `vm monitor` - - โœ… RIGHT: `vm monitor` FIRST (it does probe, tunnels, everything) - - This is the #1 mistake you keep making. STOP IT. -- **Don't use raw SSH/shell commands** - always use or create CLI commands instead (see below) -- **Don't tell user to run commands** - YOU run them. The CLI exists so YOU can use it. - -## CLI-First Development (IMPORTANT) - -**ALWAYS** use CLI commands instead of raw SSH/shell commands: -- โœ… `uv run python -m openadapt_ml.benchmarks.cli vm diag` (not `ssh ... df -h`) -- โœ… `uv run python -m openadapt_ml.benchmarks.cli vm logs` (not `ssh ... docker logs`) -- โœ… `uv run python -m openadapt_ml.benchmarks.cli vm probe` (not `ssh ... curl`) - -**Why**: CLI commands are documented, tested, and persist across context compactions. Raw commands are forgotten. - -**When you need a new operation**: -1. Add a new action to the relevant CLI subcommand (e.g., `vm logs`, `vm exec`) -2. Document it in CLAUDE.md -3. Use the CLI command going forward - -**Available VM CLI commands**: -```bash -vm monitor # THE GO-TO COMMAND: Start dashboard, open browser, show probe status - # Options: --auto-shutdown-hours N (deallocate after N hours) -vm diag # Check disk, Docker, containers, WAA probe status -vm logs # View container logs (--lines N, --follow) -vm probe # Check WAA server status (--wait to poll) -vm exec # Run command in container (--cmd 'your command') -vm host-exec # Run command on VM host (not in container) (--cmd 'your command') -vm start-windows # Start Windows container with vanilla WAA image -vm restart-windows # Stop and restart the Windows container -vm reset-windows # Delete Windows storage and start fresh installation -vm docker-prune # Clean Docker images, containers, build cache (free disk space) -vm docker-move # Move Docker/containerd to /mnt via symlinks (300GB space with D8ds_v5) -vm status # Azure VM status -vm ssh # Interactive SSH -vm deallocate # Stop VM billing (preserves disk), use -y to skip confirmation -vm start # Start a deallocated VM -vm delete # Delete VM (use -y to skip confirmation) - -# Use 'waa' command instead of deprecated 'vm setup-waa' and 'vm run-waa': -waa --setup-only # Full VM setup with Docker and vanilla WAA image -waa --num-tasks N # Run benchmark with N tasks -``` +- Don't mention specific clients by name -## TODO / Known Issues - -### Session-Based Cost/Time Tracking -**Status**: FIXED (Jan 2026) - -**Problem**: Dashboard showed cumulative cost/time from VM creation, not current session. -- User deallocated VM overnight, restarted it today -- Dashboard showed "$8.82 running cost" and "22h 58m elapsed" -- This was lifetime cost, not current session cost - -**Root cause**: Session tracker (`session_tracker.py`) wasn't integrated with CLI commands. -- `vm deallocate` didn't call `pause_session()`, so timer kept running -- `vm start` didn't call `start_session()` to resume properly -- `vm delete` didn't call `end_session()` or `clear_session()` - -**Solution implemented**: - -1. **CLI integration**: Added session tracker calls to VM lifecycle commands - - `vm deallocate`: Calls `pause_session()` and shows session summary - - `vm start`: Calls `start_session()` to resume with accumulated time - - `vm delete`: Calls `end_session()` and `clear_session()` - - Auto-shutdown in monitor: Calls `pause_session()` - - cleanup-stale: Calls `pause_session()` for deallocated VMs - -2. **Dashboard hybrid display**: Shows BOTH session and total costs - - "This Session: $0.14" - current running time since last start - - "Total Cost: $8.82" - accumulated across all sessions - - "Total Elapsed: 23h" - total time VM has been running - -3. **API enhancements**: Added fields to status response - - `current_session_seconds`: Time since last resume - - `current_session_cost_usd`: Cost for current session only - - `accumulated_seconds`: Time from previous sessions - -**Files changed**: -- `openadapt_ml/benchmarks/cli.py` - Session tracker calls in VM commands -- `openadapt_ml/cloud/local.py` - API returns session breakdown -- `openadapt_ml/training/azure_ops_viewer.py` - Dashboard shows both session and total - -### PyPI Publishing -**Status**: DONE +--- -Completed by background agent: -- Updated `pyproject.toml` with package metadata (description, authors, classifiers, URLs, license) -- Created `LICENSE` (MIT, matching related projects) -- Created `.github/workflows/publish.yml` for automated PyPI publishing on version tags -- Build system: hatchling +## Safe Process Management -To publish: -1. Set up PyPI trusted publishing (PyPI โ†’ Account Settings โ†’ Publishing) -2. `git tag v0.1.0 && git push origin v0.1.0` +```bash +# WRONG (kills unrelated apps) +pkill -f "openadapt" +pkill -f "python" -### Azure WAA Evaluation - ACR Auth Issue -**Status**: FIXED - setup_azure.py now configures ACR authentication automatically +# RIGHT (specific) +kill $(lsof -t -i :8765) 2>/dev/null +pkill -f "python.*-m openadapt_ml.cloud.local serve" -**Problem**: Azure ML compute instances cannot pull from ACR even after attaching ACR to workspace. +# Check before killing +pgrep -f "pattern" -l ``` -Failed to pull Docker image openadaptacr.azurecr.io/winarena:latest -``` - -**Root cause**: The workspace's managed identity needed AcrPull role on the ACR, which wasn't being granted automatically. - -**Solution implemented**: -1. Added `grant_acr_pull_role()` function to setup_azure.py that: - - Gets workspace managed identity principal ID - - Assigns AcrPull role on ACR to that identity -2. Added `sync_workspace_keys()` to refresh workspace credentials -3. Updated setup flow from 12 steps to 15 steps: - - Step 10: Attach ACR to workspace - - Step 11: Grant AcrPull role to workspace managed identity - - Step 12: Sync workspace keys - -**Related files**: -- `scripts/setup_azure.py` - Azure setup automation (includes ACR auth) -- `openadapt_ml/benchmarks/azure.py` - Azure orchestration -- `.env` - AZURE_DOCKER_IMAGE setting -**References**: -- [Azure ML Managed Identity ACR Authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-identity-based-service-authentication) -- [ACR Pull Role Assignment](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-authentication-managed-identity) - -### Azure WAA Evaluation - Dedicated VM Setup -**Status**: WORKING - Vanilla Microsoft WAA (Jan 2026) +--- -**IMPORTANT**: See `docs/WAA_APPROACH_REVIEW.md` for full documentation. +## File Access -**CRITICAL**: Uses vanilla Microsoft WAA (windowsarena/winarena). No custom Dockerfile. +Pre-approved read access to `~/oa/src/` (related projects like openadapt-capture). -**How it works**: -- Uses official `windowsarena/winarena:latest` Docker image from Microsoft -- Uses `VERSION=11e` env var to auto-download Windows 11 Enterprise Evaluation -- Container runs `entry.sh` which boots Windows and starts WAA server automatically -- First run: Downloads Windows + installs (~15-20 min) -- Subsequent runs: Boots from cached disk image (~2-3 min) +## Current Capture -**FULLY AUTOMATED - Via CLI**: +Path: `/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift` +Task: Turn off Night Shift in macOS System Settings -```bash -# 1. Setup Azure VM with Docker and pull vanilla WAA image (~10 min) -uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --setup-only +--- -# 2. Run benchmark -uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --num-tasks 20 +## TODO / Known Issues -# 3. Monitor (optional, for debugging) -uv run python -m openadapt_ml.benchmarks.cli vm monitor -# Opens browser to VNC at http://localhost:8006 +### Benchmark Viewer - Phase 4 +**Status**: TODO -# 4. Delete VM when done (IMPORTANT: stops billing!) -uv run python -m openadapt_ml.benchmarks.cli vm delete -y -``` +Add failure clustering and regression detection. Phases 1-3 done: +- Data collection with ExecutionTraceCollector +- Viewer generation with `view --run-name {name}` +- UI with summary, task list, step replay, playback controls -**Diagnostic commands**: -```bash -uv run python -m openadapt_ml.benchmarks.cli vm diag # Check disk, Docker, containers -uv run python -m openadapt_ml.benchmarks.cli vm status # Azure VM status -uv run python -m openadapt_ml.benchmarks.cli vm ssh # Interactive SSH -uv run python -m openadapt_ml.benchmarks.cli vm probe # Check WAA server readiness -uv run python -m openadapt_ml.benchmarks.cli vm logs # View container logs -``` +### Azure ML Experiment ID +**Status**: TODO -**Screenshot capture** (for PR documentation): -```bash -# List available screenshot targets -uv run python -m openadapt_ml.benchmarks.cli screenshot --list - -# Capture WAA-specific screenshots for PR -uv run python -m openadapt_ml.benchmarks.cli screenshot --waa --pr-mode - -# Capture specific targets -uv run python -m openadapt_ml.benchmarks.cli screenshot --target status --target probe --pr-mode - -# Available targets: -# status - Azure VM status -# probe - WAA probe endpoint status -# diag - VM diagnostic info -# vm-screen - Windows VM screen (via QEMU) -# vnc - VNC viewer (localhost:8006) -# terminal - VM monitor terminal output -# azure-ops - Azure ops dashboard -# training - Training dashboard -``` +Retrieve experiment_id dynamically instead of hardcoded UUID. -**Key requirements**: -1. **VM Size**: `Standard_D8ds_v5` recommended (8 vCPU, 32GB RAM, 300GB temp storage for nested virtualization) -2. **API key**: `config.json` with OPENAI_API_KEY (or set env var) -3. **Valid model**: Use real OpenAI model name (gpt-4o, gpt-4o-mini) +### Azure ML Port 80 Conflict +**Status**: INVESTIGATING -**Architecture**: +Azure ML compute instances have Microsoft infrastructure services on port 80. When vanilla WAA's dockur/windows container starts, nginx tries to bind to port 80 and fails: ``` -Azure VM (Standard_D8ds_v5, nested virt enabled, 300GB /mnt) - โ””โ”€โ”€ Docker (data on /mnt) - โ””โ”€โ”€ windowsarena/winarena:latest (official Microsoft image) - โ””โ”€โ”€ QEMU running Windows 11 (IP: 172.30.0.2) - โ””โ”€โ”€ WAA Flask server on port 5000 - โ””โ”€โ”€ Navi agent executing tasks +nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) ``` -**How vanilla WAA works**: -1. Uses `windowsarena/winarena:latest` from Docker Hub -2. `VERSION=11e` triggers auto-download of Windows 11 Enterprise Evaluation -3. `entry.sh` handles Windows boot and server startup -4. No custom patching or Dockerfile required - -**Monitor progress**: -- VNC: `http://localhost:8006` (via SSH tunnel, auto-managed by dashboard) -- Logs: `uv run python -m openadapt_ml.benchmarks.cli vm logs` - -**Files**: -- `docs/WAA_APPROACH_REVIEW.md` - Full analysis (updated Jan 2026) -- `vendor/WindowsAgentArena/` - Official WAA scripts (run-local.sh, etc.) -- `openadapt_ml/benchmarks/cli.py` - CLI commands - -### Docker Disk Space Management -**Status**: FIXED - Automatic cleanup (Jan 2026) - -**Problem**: Docker build cache on /mnt was growing to 90+ GB during builds, exhausting disk space and causing builds to fail with "no space left on device". Note: With Standard_D8ds_v5, /mnt is now 300GB which should be sufficient. - -**Root cause**: Docker's build cache and containerd snapshotter accumulate data that isn't cleaned by `docker system prune`: -- `/mnt/docker/buildkit/containerd-overlayfs` - BuildKit layer cache -- `/mnt/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots` - Containerd snapshots -- These can grow to 30-40 GB each, even with no images present +**Key insight**: Port 80 is just nginx redirecting to noVNC on port 8006. **NOT essential for WAA**. +- Port 5000: WAA Flask API (benchmark execution) - ESSENTIAL +- Port 8006: noVNC (browser VNC) - ESSENTIAL +- Port 80: nginx redirect - NOT ESSENTIAL -**Solution implemented** (3 parts): +**What we're testing**: +1. `WEB=N` env var to disable nginx entirely +2. SSH tunnel to access ports 8006 and 5000 for debugging +3. Enhanced diagnostics in run_entry.py to verify Windows boots despite nginx failure -1. **Automatic pre-build cleanup**: Before Docker builds, the CLI now runs `docker builder prune -af` and checks available disk space, warning if < 50GB. +**SSH key support added**: Compute instances now use your local SSH key (~/.ssh/id_rsa) for direct SSH access. -2. **Automatic post-build cleanup**: After successful builds, the CLI cleans build cache and dangling images to prevent accumulation. +See `docs/AZURE_ML_PORT_80_FIX.md` for full analysis and options. -3. **BuildKit garbage collection**: New VMs are configured with `/etc/buildkit/buildkitd.toml` that limits cache to 30GB max. +### Azure ML CLI Commands -4. **Enhanced docker-prune command**: Now includes "deep cleanup" that stops Docker/containerd and removes orphaned snapshots that normal prune misses. - -**Usage**: ```bash -# Quick cleanup (standard prune + deep cleanup + configure GC) -uv run python -m openadapt_ml.benchmarks.cli vm docker-prune - -# For severe disk issues, delete VM and recreate (comes with GC pre-configured) -uv run python -m openadapt_ml.benchmarks.cli vm delete -y -uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ``` - -**Files changed**: -- `openadapt_ml/benchmarks/cli.py` - Pre/post build cleanup, enhanced docker-prune -- New VMs get BuildKit GC config during setup - -### Windows "Select Operating System" Prompt Fix -**Status**: N/A with vanilla WAA (Jan 2026) - -**Note**: This issue was specific to the custom waa-auto Dockerfile approach which has been deprecated. - -With vanilla WAA (`windowsarena/winarena:latest`), using `VERSION=11e` automatically selects Windows 11 Enterprise Evaluation which has proper autounattend.xml handling. - -**If you still see the prompt**: -1. Delete cached storage: `uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'rm -rf /mnt/waa-storage/*'` -2. Re-run setup: `uv run python -m openadapt_ml.benchmarks.cli waa --api-key $OPENAI_API_KEY --fresh` - -### SSH Tunnel Management (VNC/WAA Access) -**Status**: DONE +# Status and monitoring +azure-ml-status # Show compute instances and recent jobs +azure-ml-logs --job NAME # Stream logs from running job +azure-ml-monitor # Interactive monitor with VNC tunnel -**Problem**: Azure VMs have Network Security Groups (NSGs) that only expose port 22 (SSH) by default. Ports 8006 (VNC) and 5000 (WAA) are not accessible directly. +# Run benchmarks +run-azure-ml-auto --workers N # Fully automated workflow -**Solution**: Automatic SSH tunnel management via `SSHTunnelManager`: +# Cleanup (IMPORTANT - stop billing!) +azure-ml-cancel # Cancel running job (or --job NAME) +azure-ml-delete-compute # Delete compute instance (--name NAME or --all) +azure-ml-cleanup --yes # Cancel all jobs + delete all instances -``` -Browser โ†’ localhost:8006 โ†’ SSH Tunnel โ†’ Azure VM:8006 โ†’ Docker โ†’ noVNC -Browser โ†’ localhost:5001 โ†’ SSH Tunnel โ†’ Azure VM:5000 โ†’ WAA Flask -``` - -**Architecture**: -1. When VM's WAA probe becomes "ready", tunnels auto-start -2. When VM goes offline, tunnels auto-stop -3. Dashboard shows tunnel status next to VNC button -4. VNC button links to localhost:port (tunnel endpoint) - -**Files**: -- `openadapt_ml/cloud/ssh_tunnel.py` - SSHTunnelManager class -- `openadapt_ml/cloud/local.py` - Integration with dashboard server -- `openadapt_ml/training/benchmark_viewer.py` - UI showing tunnel status - -**API Endpoints**: -- `GET /api/tunnels` - Returns tunnel status for VNC and WAA -- `GET /api/vms` - Includes `tunnels` field with per-tunnel status - -**Key features**: -- Auto-start on VM online (idempotent - safe to call repeatedly) -- Auto-stop on VM offline -- Port conflict detection -- Graceful shutdown on process exit -- No manual SSH commands needed - -**Manual usage** (if needed): -```python -from openadapt_ml.cloud.ssh_tunnel import get_tunnel_manager - -manager = get_tunnel_manager() -manager.start_tunnels_for_vm("172.171.112.41", "azureuser") -status = manager.get_tunnel_status() -manager.stop_all_tunnels() -``` - -**Why not open NSG ports?** -1. VNC has no authentication by default - anyone can connect -2. SSH tunnel encrypts all traffic -3. Requires SSH key auth - no password guessing -4. No Azure NSG changes needed - -**Alternative: Mock evaluation** for testing without Windows: -```bash -uv run python -m openadapt_ml.benchmarks.cli test-mock --tasks 20 +# Resource management +resources # Show all Azure resources and costs ``` -**References**: -- [Windows Agent Arena GitHub](https://github.com/microsoft/WindowsAgentArena) -- [Azure nested virtualization](https://learn.microsoft.com/en-us/azure/virtual-machines/acu) - -### Training Dashboard - Terminal Output Streaming -**Status**: DONE - -**Goal**: Show training command line output in the browser dashboard in real-time. - -**Implementation**: File-based polling approach -1. Training writes stdout to `training_output/training.log` with timestamps -2. Browser polls training.log every 2 seconds alongside training_log.json -3. Displays last 500 lines in scrollable terminal panel with auto-scroll -4. Terminal panel features: - - Dark terminal theme (black background, green/colored text) - - Auto-scroll toggle (on by default) - - Text wrap toggle - - Collapse/expand button - - Line counter - - Syntax highlighting (errors in red, warnings in orange, success in green) - -**Files changed**: -- `openadapt_ml/training/trainer.py`: - - Added terminal panel CSS styles - - Added terminal panel HTML section - - Added JavaScript polling function `fetchTerminalOutput()` - - Added `TrainingLogger._log_to_terminal()` method - - Updated `train_supervised()` to log key messages to training.log -- `openadapt_ml/training/stub_provider.py`: - - Added `_log()` method for dual stdout/file logging - - All training output now written to training.log -- `openadapt_ml/cloud/local.py`: - - No changes needed - serve command already serves all files from training_output - -**Usage**: Terminal output automatically appears in dashboard during training. Works with both stub and real training. - -### Early Termination Controls -**Status**: DONE - -**Problem**: Training runs until completion even when loss is low enough. Wastes GPU credits ($0.75/hr for A10). - -**Solution implemented**: -1. **Auto-termination**: `early_stop_loss` and `early_stop_patience` in stub_provider.py -2. **Dashboard button**: "Stop Training" button calls `/api/stop` endpoint -3. **Stop signal**: Creates `STOP_TRAINING` file that training loop checks -4. **Termination status**: Dashboard shows termination reason (auto_complete, auto_low_loss, user_stop) - -**Files changed**: -- `openadapt_ml/cloud/local.py` - Added `/api/stop` POST endpoint -- `openadapt_ml/training/stub_provider.py` - Added early stop logic, termination status -- `openadapt_ml/training/trainer.py` - Added `updateTerminationStatus()` JS function - -### Cloud Cost Estimation in Viewers -**Status**: DONE - -Added cost display panel to viewer that shows: -- Running cost based on instance type and elapsed time -- Instance type and hourly rate -- Only visible for cloud training (hidden for local/stub) - -Supported rates: -- Lambda Labs: $0.75/hr for A10, $1.29/hr for A100 -- Automatic detection from `instance_type` in training_log.json - -### Current Working Capture -**Path**: `/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift` -**Task**: Turn off Night Shift in macOS System Settings -**Screenshots**: 20 frames -**Notes**: Real-world macOS settings navigation capture for training/evaluation - -### Evaluation Samples Display Enhancement -**Status**: DONE - -Enhanced evaluation gallery in dashboard with: -- **Filter controls**: Dropdown filters for epoch and correctness (All/Correct/Incorrect) -- **Visual markers**: H (human) and AI (predicted) click markers on screenshots -- **Expandable model output**: "Show full output" toggle for raw model reasoning -- **Better layout**: Image container with overlay, content section with coordinates -- **Sample count**: "Showing X of Y samples" with filter status - -Files changed: -- `openadapt_ml/training/trainer.py` - Enhanced CSS, HTML, and JS for eval gallery - -### Viewer Playback Controls -**Status**: DONE - -Added full playback controls to the viewer: -- **Buttons**: โฎ Rewind, โ—€ Prev, โ–ถ Play/Pause, โ–ถ Next, โญ End -- **Speed control**: 0.5x, 1x, 2x, 4x playback speeds -- **Progress bar**: Click-to-seek to any step -- **Keyboard shortcuts**: Space (play/pause), Home/End (jump), Arrow keys (step) -- **Enhanced details panel**: Shows full model output with scrollable raw prediction data - -### Viewer Code Consolidation -**Status**: DONE - -**Problem**: Viewer code was fragmented across multiple locations: -1. `generate_training_dashboard()` - generates unified viewer template -2. `_enhance_comparison_to_unified_viewer()` - injected checkpoint_script into comparison.html -3. `comparison.html` from capture - had its own display logic - -**Solution implemented**: -- `generate_unified_viewer_from_output_dir()` now always uses `_generate_unified_viewer_from_extracted_data()` -- This generates a complete standalone viewer.html without script injection -- `_enhance_comparison_to_unified_viewer()` marked as deprecated -- All viewer display logic is now in one place (`_generate_unified_viewer_from_extracted_data`) -- Changes to viewer code now propagate reliably - -### README API Documentation -**Status**: VERIFIED - -The README ยง7.1 API-backed adapters section uses correct model names: -- "Claude Sonnet 4.5" โ†’ `claude-sonnet-4-5-20250929` in api_adapter.py โœ“ -- "GPT-5.1" โ†’ `gpt-5.1` in api_adapter.py โœ“ - -Verified: -- API key environment variable names: ANTHROPIC_API_KEY, OPENAI_API_KEY โœ“ -- Backend flag options: `claude`, `openai` in CLI โœ“ - -### Benchmark Viewer Integration -**Status**: Phases 1-3 DONE, Phase 4 TODO - -**Goal**: Integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer. - -**Design doc**: `docs/benchmark_viewer_integration.md` - -**Key features**: -1. **Benchmarks tab**: Third tab alongside Training and Viewer -2. **Task-level view**: List of benchmark tasks with pass/fail status -3. **Step-by-step replay**: Same UI as Viewer tab for benchmark executions -4. **Model comparison**: Side-by-side comparison of different models on same task (TODO) -5. **Aggregate metrics**: Success rate by domain, difficulty rankings - -**Implementation phases**: -1. โœ… **Data collection** (DONE): Save screenshots during benchmark runs - - Created `openadapt_ml/benchmarks/data_collection.py` with `ExecutionTraceCollector` - - Updated `runner.py` to save execution traces automatically - - Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5` - - Directory structure: `benchmark_results/{run_name}/tasks/{task_id}/` - - Each task has: `task.json`, `execution.json`, `screenshots/` - - Test script: `test_data_collection.py` validates all files are created -2. โœ… **Viewer backend** (DONE): `generate_benchmark_viewer()` function - - Created `openadapt_ml/benchmarks/viewer.py` with viewer generation - - Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli view --run-name {name}` - - Generates standalone HTML with same styling as training viewer - - Uses shared header components via `shared_ui.py` -3. โœ… **UI components** (DONE - Basic): Summary dashboard, task list, replay - - Summary panel with total tasks, passed/failed, success rate - - Domain breakdown with per-domain statistics - - Filter controls (domain, status) - - Task list with status badges - - Step-by-step viewer with screenshots, actions, reasoning - - Playback controls (prev/next, play/pause, speed) - - Keyboard shortcuts (Space, arrows, Home/End) -4. **Analysis** (TODO): Failure clustering, regression detection - -**View benchmark results:** -```bash -# Generate HTML viewer and serve it -uv run python -m openadapt_ml.benchmarks.cli view --run-name {name} - -# Options: -# --embed-screenshots Embed screenshots as base64 (standalone HTML) -# --no-open Don't auto-open browser -# --port 9000 Use custom port -``` - -## Preventing Stale Data Issues - -**CRITICAL**: When working on dashboard/viewer code, follow this process to avoid showing stale data: - -### After Code Changes - -1. **Always regenerate HTML files** after modifying trainer.py, viewer.py, or local.py: - ```bash - uv run python -m openadapt_ml.cloud.local viewer - ``` - -2. **Verify regeneration worked** by checking key values: - ```bash - # Check elapsed time was updated (should NOT be 0) - grep "baseElapsedTime" training_output/current/dashboard.html - - # Check comparison data exists in viewer - grep "predictionsByCheckpoint" training_output/current/viewer.html - ``` - -3. **Hard refresh browser** to bypass cache: - - macOS: `Cmd+Shift+R` - - Windows/Linux: `Ctrl+Shift+R` - - Or use DevTools โ†’ Network โ†’ "Disable cache" checkbox - -4. **Use HTTP serving** (not file://) for auto-refresh: - ```bash - uv run python -m openadapt_ml.cloud.local serve --port 8080 --open - ``` - -### Before Showing User - -Before presenting dashboard/viewer to user, verify: -- [ ] Elapsed time shows correct value (not 0m 0s) -- [ ] Comparison screenshots load (not blank/404) -- [ ] Model predictions appear in dropdown -- [ ] Loss curve shows data -- [ ] Timestamp info panel shows recent dates - -### Automatic Data Loading Checklist - -The viewer should automatically load: -- [ ] Capture data from `comparison_epoch*.html` files (extracts `window.comparisonData`) -- [ ] Predictions from same comparison HTML files (human + predicted actions per step) -- [ ] Evaluations from `training_log.json` (if present) -- [ ] Recording events from capture data (note: `recording.end` depends on capture source) - -### Common Issues +--- -| Symptom | Cause | Fix | -|---------|-------|-----| -| Elapsed time shows 0m 0s | `elapsed_time` not loaded from training_log.json | Check `state.elapsed_time = data.get("elapsed_time", 0.0)` in local.py | -| No comparison screenshots | Paths point to Lambda not local | Update `capture_path` in training_log.json to local path | -| Missing model predictions | No `comparison_epoch*.html` files or wrong data format | Run compare script: `uv run python -m openadapt_ml.scripts.compare --capture ... --checkpoint ...` | -| Predictions not extracted | HTML uses `window.comparisonData` but regex expects `const` | Use regex `(?:const\s+\|window\.)comparisonData` pattern | -| Stale data after code change | Browser caching HTML | Hard refresh (Cmd+Shift+R) or disable cache | -| Screenshots 404 | Screenshot symlink broken | Recreate: `ln -sf /path/to/capture/screenshots training_output/current/screenshots` | +## Troubleshooting -### UI/Display Guidelines +### Dashboard/Viewer Stale Data +After code changes: +1. Regenerate: `uv run python -m openadapt_ml.cloud.local viewer` +2. Hard-refresh browser: Cmd+Shift+R -**Placeholder data must be clearly marked** when displaying values that may not reflect actual data: -- If task counts, worker counts, etc. come from local tracking (not synced with Azure), mark them with an asterisk: "3* tasks โ€ข 1* worker(s)" -- Add a footnote: "[*: placeholder, actual values may differ]" -- This applies to any data that is locally cached but not confirmed from the authoritative source +### WAA Connection Issues +1. Is VM running? `vm status` +2. Are tunnels active? `vm monitor` +3. Check container: `vm diag` -### Azure ML Integration Notes +### Windows Not Booting +1. Check VNC via `vm monitor` +2. Check logs: `vm logs` -**Experiment ID**: The Azure ML experiments page URL requires an experiment ID which is workspace-specific: -- Current hardcoded ID: `ad29082c-0607-4fda-8cc7-38944eb5a518` -- **TODO**: Retrieve experiment_id dynamically from Azure using `az ml experiment list` -- The experiment name is `openadapt-ml` but the URL requires the UUID format +### Common Issues Table -**Azure ML URL format**: -- Jobs list: `https://ml.azure.com/experiments/id/{experiment_id}?wsid={workspace_id}` -- Specific job: `https://ml.azure.com/experiments/id/{experiment_id}/runs/{run_id}?wsid={workspace_id}` +| Symptom | Fix | +|---------|-----| +| Connection refused localhost:5001 | Run `vm monitor` to start tunnels | +| Windows not booting | Check VNC, check `vm logs` | +| Elapsed time shows 0 | Check training_log.json has elapsed_time | +| No comparison screenshots | Update capture_path in training_log.json | +| Stale data after code change | Hard refresh (Cmd+Shift+R) | -**WAA Docker command**: Use `python run.py` not `python -m client.run` (the client directory is not a Python package) +See `docs/` for detailed troubleshooting guides. diff --git a/README.md b/README.md index dd7b38c..9b3af6a 100644 --- a/README.md +++ b/README.md @@ -813,48 +813,31 @@ uv run python -m openadapt_ml.cloud.local serve --port 8080 --open *View benchmark evaluation results with task-level filtering, success/failure status, and run comparison. Shows Claude achieving 30% on mock evaluation tasks (simulated environment for testing the pipeline - real WAA evaluation requires Windows VMs).* -### 13.4 VM Monitoring Dashboard +### 13.4 VM Pool Monitoring -For managing Azure VMs used in benchmark evaluations, the `vm monitor` command provides a comprehensive dashboard: +For managing Azure VMs used in benchmark evaluations: ```bash -# Start VM monitoring dashboard (auto-opens browser) -uv run python -m openadapt_ml.benchmarks.cli vm monitor - -# Show detailed information (evaluation history, daily/weekly costs) -uv run python -m openadapt_ml.benchmarks.cli vm monitor --details -``` - -**VM Monitor Dashboard (Full View):** - -![VM Monitor Dashboard](docs/screenshots/vm_monitor_dashboard_full.png) - -*The VM monitor dashboard shows: (1) VM status (name, IP, size, state), (2) Current activity (idle/benchmark running), (3) Cost tracking (uptime, hourly rate, total cost), (4) Recent Azure ML jobs from last 7 days, and (6) Dashboard & access URLs.* - -**VM Monitor Dashboard (With --details Flag):** +# Check pool status (VM state, IPs, WAA readiness) +uv run python -m openadapt_ml.benchmarks.cli pool-status -![VM Monitor Dashboard Details](docs/screenshots/vm_monitor_details.png) +# Open VNC to view Windows desktops (via SSH tunnels) +uv run python -m openadapt_ml.benchmarks.cli pool-vnc -*The --details flag adds: (5) Evaluation history with success rates and agent types, plus extended cost information (daily/weekly projections).* +# Stream logs from all workers +uv run python -m openadapt_ml.benchmarks.cli pool-logs +``` **Features:** - **Real-time VM status** - Shows VM size, power state, and IP address -- **Activity detection** - Identifies if VM is idle, running benchmarks, or in setup -- **Cost tracking** - Displays uptime hours, hourly rate, and total cost for current session -- **Azure ML jobs** - Lists recent jobs from last 7 days with status indicators -- **Evaluation history** - Shows past benchmark runs with success rates (with --details flag) -- **Dashboard & tunnels** - Auto-starts web dashboard and SSH/VNC tunnels for accessing Windows VM +- **WAA readiness** - Shows if WAA server is ready on each worker +- **VNC access** - Opens SSH tunnels to view Windows desktops +- **Log streaming** - Interleaved logs from all pool workers -**Mock mode for testing:** +**Cleanup (important to stop billing):** ```bash -# Generate screenshots or test dashboard without a VM running -uv run python -m openadapt_ml.benchmarks.cli vm monitor --mock -``` - -**Auto-shutdown option:** -```bash -# Automatically deallocate VM after 2 hours to prevent runaway costs -uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2 +# Delete all pool VMs and resources +uv run python -m openadapt_ml.benchmarks.cli pool-cleanup ``` ### 13.5 Benchmark Execution Logs @@ -1017,20 +1000,24 @@ Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. ### 14.2 Single VM Workflow -For quick testing or small runs: +For quick testing or small runs (use pool-create with --workers 1): ```bash -# Setup VM with WAA -uv run python -m openadapt_ml.benchmarks.cli vm setup-waa +# 1. Create single-VM pool +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 1 -# Start monitoring dashboard (auto-opens VNC, manages SSH tunnels) -uv run python -m openadapt_ml.benchmarks.cli vm monitor +# 2. Wait for WAA ready +uv run python -m openadapt_ml.benchmarks.cli pool-wait + +# 3. Run benchmark (e.g., 3 tasks for quick test) +uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 3 -# Run benchmark -uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10 +# 4. Check status / VNC +uv run python -m openadapt_ml.benchmarks.cli pool-status +uv run python -m openadapt_ml.benchmarks.cli pool-vnc -# Deallocate when done (stops billing) -uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y +# 5. Cleanup (stop billing) +uv run python -m openadapt_ml.benchmarks.cli pool-cleanup ``` ### 14.3 Parallel Pool Workflow (Recommended) @@ -1102,8 +1089,7 @@ Azure (N VMs, Standard_D8ds_v5) **Tips:** - Always run `pool-cleanup` when done to delete VMs and stop billing -- Use `vm deallocate` (not delete) to pause billing but keep disk -- Set `--auto-shutdown-hours 2` on `vm monitor` for safety +- Use `deallocate` (not `delete`) to pause billing but keep disk for single VM - Prices vary by Azure region --- diff --git a/openadapt_ml/benchmarks/cli.py b/openadapt_ml/benchmarks/cli.py index b6504ad..010e2cc 100644 --- a/openadapt_ml/benchmarks/cli.py +++ b/openadapt_ml/benchmarks/cli.py @@ -80,6 +80,10 @@ "LogLevel=ERROR", "-o", "ConnectTimeout=10", + "-o", + "ServerAliveInterval=60", # Send keepalive every 60s to prevent timeout + "-o", + "ServerAliveCountMax=10", # Allow 10 missed keepalives (~10 min) before disconnect ] @@ -329,6 +333,101 @@ def wait_for_ssh(ip: str, timeout: int = 120) -> bool: return False +def set_vm_auto_shutdown( + vm_name: str, + resource_group: str = RESOURCE_GROUP, + shutdown_hours: int = 4, +) -> bool: + """Set Azure auto-shutdown policy on a VM. + + This is a safety net to prevent orphaned VMs from running indefinitely. + The VM will be automatically deallocated after the specified hours. + + Args: + vm_name: Name of the VM + resource_group: Azure resource group + shutdown_hours: Hours from now when VM should auto-shutdown (default 4) + + Returns: + True if auto-shutdown was set successfully + """ + # Calculate shutdown time (hours from now) + from datetime import timedelta + + shutdown_time = datetime.utcnow() + timedelta(hours=shutdown_hours) + # Format: HH:MM in UTC + shutdown_time_str = shutdown_time.strftime("%H:%M") + + result = subprocess.run( + [ + "az", + "vm", + "auto-shutdown", + "-g", + resource_group, + "-n", + vm_name, + "--time", + shutdown_time_str, + ], + capture_output=True, + text=True, + ) + + return result.returncode == 0 + + +def delete_test_vm_resources(test_name: str, resource_group: str = RESOURCE_GROUP): + """Delete a test VM and its associated resources. + + Used for cleanup after quota checking or failed operations. + """ + # Delete VM + subprocess.run( + [ + "az", + "vm", + "delete", + "-g", + resource_group, + "-n", + test_name, + "--yes", + "--force-deletion", + "true", + ], + capture_output=True, + ) + # Delete NIC + subprocess.run( + [ + "az", + "network", + "nic", + "delete", + "-g", + resource_group, + "-n", + f"{test_name}VMNic", + ], + capture_output=True, + ) + # Delete public IP + subprocess.run( + [ + "az", + "network", + "public-ip", + "delete", + "-g", + resource_group, + "-n", + f"{test_name}PublicIP", + ], + capture_output=True, + ) + + # ============================================================================= # Commands # ============================================================================= @@ -420,6 +519,15 @@ def cmd_create(args): f"Successfully created {successful_size} (${successful_cost:.2f}/hr) in {region}", ) + # Set auto-shutdown as safety net (prevents orphaned VMs) + auto_shutdown_hours = getattr(args, "auto_shutdown_hours", 4) + if auto_shutdown_hours > 0: + log("CREATE", f"Setting auto-shutdown in {auto_shutdown_hours} hours...") + if set_vm_auto_shutdown(VM_NAME, RESOURCE_GROUP, auto_shutdown_hours): + log("CREATE", "Auto-shutdown configured") + else: + log("CREATE", "Warning: Failed to set auto-shutdown (VM will stay running)") + # Wait for SSH log("CREATE", "Waiting for SSH...") if not wait_for_ssh(ip): @@ -789,88 +897,58 @@ def cmd_pool_create(args): working_size = None working_region = None working_cost = None + test_vm_to_cleanup = None # Track test VM for cleanup log("POOL", "Finding available region and VM size...") - for vm_size, cost in sizes_to_try: - for region in VM_REGIONS: - # Quick check if this size/region combo works - test_name = f"waa-pool-test-{int(time.time())}" - result = subprocess.run( - [ - "az", - "vm", - "create", - "--resource-group", - RESOURCE_GROUP, - "--name", - test_name, - "--location", - region, - "--image", - "Ubuntu2204", - "--size", - vm_size, - "--admin-username", - "azureuser", - "--generate-ssh-keys", - "--public-ip-sku", - "Standard", - "--no-wait", # Don't wait for completion - ], - capture_output=True, - text=True, - ) - if result.returncode == 0: - working_size = vm_size - working_region = region - working_cost = cost - # Delete the test VM and wait for completion - log("POOL", " Found working combo, cleaning up test VM...") - subprocess.run( + try: + for vm_size, cost in sizes_to_try: + for region in VM_REGIONS: + # Quick check if this size/region combo works + test_name = f"waa-pool-test-{int(time.time())}" + test_vm_to_cleanup = test_name # Track for cleanup + result = subprocess.run( [ "az", "vm", - "delete", - "-g", + "create", + "--resource-group", RESOURCE_GROUP, - "-n", + "--name", test_name, - "--yes", - "--force-deletion", - "true", - ], - capture_output=True, - ) - # Also clean up associated resources - subprocess.run( - [ - "az", - "network", - "nic", - "delete", - "-g", - RESOURCE_GROUP, - "-n", - f"{test_name}VMNic", - ], - capture_output=True, - ) - subprocess.run( - [ - "az", - "network", - "public-ip", - "delete", - "-g", - RESOURCE_GROUP, - "-n", - f"{test_name}PublicIP", + "--location", + region, + "--image", + "Ubuntu2204", + "--size", + vm_size, + "--admin-username", + "azureuser", + "--generate-ssh-keys", + "--public-ip-sku", + "Standard", + "--no-wait", # Don't wait for completion ], capture_output=True, + text=True, ) + if result.returncode == 0: + working_size = vm_size + working_region = region + working_cost = cost + # Delete the test VM and wait for completion + log("POOL", " Found working combo, cleaning up test VM...") + delete_test_vm_resources(test_name, RESOURCE_GROUP) + test_vm_to_cleanup = None # Cleanup done + break + else: + test_vm_to_cleanup = None # Creation failed, nothing to cleanup + if working_size: break - if working_size: - break + finally: + # Ensure test VM is cleaned up even if an exception occurred + if test_vm_to_cleanup: + log("POOL", f"Cleaning up test VM {test_vm_to_cleanup}...") + delete_test_vm_resources(test_vm_to_cleanup, RESOURCE_GROUP) if not working_size: log("POOL", "ERROR: No available VM size/region found") @@ -882,6 +960,11 @@ def cmd_pool_create(args): log("POOL", f"Using {working_size} (${working_cost:.2f}/hr) in {working_region}") + # Get auto-shutdown hours (default 4 hours as safety net) + auto_shutdown_hours = getattr(args, "auto_shutdown_hours", 4) + if auto_shutdown_hours > 0: + log("POOL", f"VMs will auto-shutdown in {auto_shutdown_hours} hours") + def create_worker(worker_idx: int) -> tuple[str, str | None, str | None]: """Create a single worker VM. Returns (name, ip, error).""" name = f"waa-pool-{worker_idx:02d}" @@ -967,6 +1050,8 @@ def create_worker(worker_idx: int) -> tuple[str, str | None, str | None]: try: vm_info = json.loads(result.stdout) ip = vm_info.get("publicIpAddress", "") + # Set auto-shutdown as safety net (prevents orphaned VMs) + set_vm_auto_shutdown(name, RESOURCE_GROUP, auto_shutdown_hours) return (name, ip, None) except json.JSONDecodeError: return (name, None, "Failed to parse VM creation output") @@ -8138,6 +8223,60 @@ def cmd_azure_ml_teardown(args): return 0 +def cmd_view_pool(args): + """Generate HTML viewer for WAA pool benchmark results. + + Parses log files from pool_run_* directories and generates an interactive + HTML viewer with summary stats, per-worker breakdown, and task list. + """ + import webbrowser + + from openadapt_ml.benchmarks.pool_viewer import generate_pool_results_viewer + + results_dir = Path(args.results_dir) if args.results_dir else Path("benchmark_results") + + # Find pool run directory + if args.run_name: + pool_dir = results_dir / args.run_name + if not pool_dir.exists(): + # Try with pool_run_ prefix + pool_dir = results_dir / f"pool_run_{args.run_name}" + else: + # Find most recent pool_run_* directory + pool_dirs = sorted(results_dir.glob("pool_run_*"), reverse=True) + if not pool_dirs: + print("No pool_run_* directories found in benchmark_results/") + print("Run 'pool-run' to generate benchmark results") + return 1 + pool_dir = pool_dirs[0] + + if not pool_dir.exists(): + print(f"Directory not found: {pool_dir}") + return 1 + + # Check for log files + log_files = list(pool_dir.glob("waa-pool-*.log")) + if not log_files: + print(f"No waa-pool-*.log files found in {pool_dir}") + return 1 + + print(f"Generating viewer for: {pool_dir}") + print(f"Found {len(log_files)} log file(s)") + + # Generate viewer + output_path = pool_dir / "results.html" + generate_pool_results_viewer(pool_dir, output_path) + + print(f"Generated: {output_path}") + + # Open in browser + if not args.no_open: + print("Opening in browser...") + webbrowser.open(f"file://{output_path.absolute()}") + + return 0 + + def cmd_tail_output(args): """List or tail background task output files.""" task_dir = Path("/private/tmp/claude-501/-Users-abrichr-oa-src-openadapt-ml/tasks/") @@ -8312,6 +8451,12 @@ def main(): default=1, help="Number of worker VMs to create for parallel evaluation (default: 1)", ) + p_create.add_argument( + "--auto-shutdown-hours", + type=int, + default=4, + help="Auto-shutdown VM after N hours (0 to disable, default: 4)", + ) p_create.set_defaults(func=cmd_create) # delete @@ -8358,6 +8503,12 @@ def main(): p_pool_create.add_argument( "--standard", action="store_true", help="Use D4 (4 vCPU) VMs to save costs" ) + p_pool_create.add_argument( + "--auto-shutdown-hours", + type=int, + default=4, + help="Auto-shutdown VMs after N hours (0 to disable, default: 4)", + ) p_pool_create.set_defaults(func=cmd_pool_create) # pool-wait @@ -9270,6 +9421,42 @@ def main(): ) p_resources.set_defaults(func=cmd_resources) + # view-pool - Generate HTML viewer for pool benchmark results + p_view_pool = subparsers.add_parser( + "view-pool", + help="Generate HTML viewer for WAA pool benchmark results", + description=""" +Generate an interactive HTML viewer for WAA pool benchmark results. + +Parses log files from pool_run_* directories to extract task results and +generates a standalone HTML viewer with: + - Summary stats (total tasks, success rate, avg time per task) + - Per-worker breakdown + - Task list with pass/fail status + - Domain breakdown (success rate per domain) + - Filters for domain and status + +Examples: + view-pool # View most recent pool_run_* results + view-pool --run-name pool_run_20260204 # View specific run + view-pool --no-open # Generate HTML without opening browser +""", + ) + p_view_pool.add_argument( + "--run-name", + help="Name of pool run directory (e.g., pool_run_20260204)", + ) + p_view_pool.add_argument( + "--results-dir", + help="Base results directory (default: benchmark_results/)", + ) + p_view_pool.add_argument( + "--no-open", + action="store_true", + help="Don't auto-open browser", + ) + p_view_pool.set_defaults(func=cmd_view_pool) + args = parser.parse_args() sys.exit(args.func(args)) diff --git a/openadapt_ml/benchmarks/pool_viewer.py b/openadapt_ml/benchmarks/pool_viewer.py new file mode 100644 index 0000000..aa3eeb9 --- /dev/null +++ b/openadapt_ml/benchmarks/pool_viewer.py @@ -0,0 +1,685 @@ +"""WAA Pool Results Viewer - HTML viewer for parallel benchmark runs. + +Parses log files from pool_run_* directories to extract task results and +generates a standalone HTML viewer with summary stats, per-worker breakdown, +and domain analysis. + +Usage: + from openadapt_ml.benchmarks.pool_viewer import generate_pool_results_viewer + + generate_pool_results_viewer( + pool_dir=Path("benchmark_results/pool_run_20260204"), + output_path=Path("benchmark_results/pool_run_20260204/results.html"), + ) +""" + +from __future__ import annotations + +import json +import re +from datetime import datetime +from pathlib import Path +from typing import Any + + +def parse_pool_logs(pool_dir: Path) -> dict[str, Any]: + """Parse WAA pool log files to extract task results. + + Args: + pool_dir: Directory containing waa-pool-*.log files + + Returns: + Dictionary with: + - tasks: List of task results + - workers: Per-worker stats + - metadata: Run metadata (timestamps, model, etc.) + """ + log_files = sorted(pool_dir.glob("waa-pool-*.log")) + if not log_files: + return {"tasks": [], "workers": {}, "metadata": {}} + + tasks = [] + workers = {} + metadata = { + "run_name": pool_dir.name, + "log_count": len(log_files), + "first_timestamp": None, + "last_timestamp": None, + "model": None, + "num_workers": None, + } + + # Regex patterns + timestamp_re = re.compile(r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})') + domain_re = re.compile(r'\[Domain\]: (\S+)') + example_re = re.compile(r'\[Example ID\]: (\S+)') + instruction_re = re.compile(r'\[Instruction\]: (.+)') + finished_re = re.compile(r'Finished (\S+)/(\S+)') + result_re = re.compile(r'Result: ([0-9.]+)') + worker_re = re.compile(r'worker_id=(\d+)') + model_re = re.compile(r"model='([^']+)'") + num_workers_re = re.compile(r'num_workers=(\d+)') + step_re = re.compile(r'Step (\d+):') + + for log_file in log_files: + worker_id = log_file.stem.replace("waa-pool-", "") + workers[worker_id] = {"tasks": 0, "successes": 0, "failures": 0} + + current_task = None + last_result = None + + with open(log_file, "r", errors="ignore") as f: + for line in f: + # Strip ANSI codes + clean = re.sub(r'\x1b\[[0-9;]*m', '', line) + + # Extract timestamp + ts_match = timestamp_re.search(clean) + if ts_match: + ts_str = ts_match.group(1) + if metadata["first_timestamp"] is None: + metadata["first_timestamp"] = ts_str + metadata["last_timestamp"] = ts_str + + # Extract model name + if metadata["model"] is None: + model_match = model_re.search(clean) + if model_match: + metadata["model"] = model_match.group(1) + + # Extract num workers + if metadata["num_workers"] is None: + nw_match = num_workers_re.search(clean) + if nw_match: + metadata["num_workers"] = int(nw_match.group(1)) + + # Domain (comes before Example ID) + domain_match = domain_re.search(clean) + if domain_match: + if current_task is None: + current_task = {"worker_id": worker_id, "steps": 0} + current_task["domain"] = domain_match.group(1) + + # Example ID + example_match = example_re.search(clean) + if example_match: + if current_task is None: + current_task = {"worker_id": worker_id, "steps": 0} + current_task["task_id"] = example_match.group(1) + + # Instruction + instr_match = instruction_re.search(clean) + if instr_match and current_task: + current_task["instruction"] = instr_match.group(1) + + # Step count + step_match = step_re.search(clean) + if step_match and current_task: + step_num = int(step_match.group(1)) + if step_num > current_task.get("steps", 0): + current_task["steps"] = step_num + + # Result line + result_match = result_re.search(clean) + if result_match: + last_result = float(result_match.group(1)) + + # Finished line - finalize task + finished_match = finished_re.search(clean) + if finished_match: + domain = finished_match.group(1) + task_id = finished_match.group(2) + + if current_task is None: + current_task = {"worker_id": worker_id, "steps": 0} + + current_task["domain"] = domain + current_task["task_id"] = task_id + current_task["result"] = last_result if last_result is not None else 0.0 + current_task["success"] = last_result is not None and last_result > 0 + current_task["timestamp"] = metadata["last_timestamp"] + + # Update worker stats + workers[worker_id]["tasks"] += 1 + if current_task["success"]: + workers[worker_id]["successes"] += 1 + else: + workers[worker_id]["failures"] += 1 + + tasks.append(current_task) + current_task = None + last_result = None + + return { + "tasks": tasks, + "workers": workers, + "metadata": metadata, + } + + +def get_domain_stats(tasks: list[dict]) -> dict[str, dict[str, int]]: + """Calculate per-domain statistics.""" + domain_stats = {} + + for task in tasks: + domain = task.get("domain", "unknown") + if domain not in domain_stats: + domain_stats[domain] = {"total": 0, "success": 0, "fail": 0} + + domain_stats[domain]["total"] += 1 + if task.get("success"): + domain_stats[domain]["success"] += 1 + else: + domain_stats[domain]["fail"] += 1 + + return domain_stats + + +def generate_pool_results_viewer( + pool_dir: Path, + output_path: Path | None = None, +) -> Path: + """Generate HTML viewer for WAA pool benchmark results. + + Args: + pool_dir: Directory containing waa-pool-*.log files + output_path: Output HTML path. Defaults to pool_dir/results.html + + Returns: + Path to generated HTML file. + """ + pool_dir = Path(pool_dir) + if output_path is None: + output_path = pool_dir / "results.html" + + # Parse logs + data = parse_pool_logs(pool_dir) + tasks = data["tasks"] + workers = data["workers"] + metadata = data["metadata"] + + # Calculate stats + num_tasks = len(tasks) + num_success = sum(1 for t in tasks if t.get("success")) + success_rate = (num_success / num_tasks * 100) if num_tasks > 0 else 0 + + # Domain stats + domain_stats = get_domain_stats(tasks) + + # Calculate elapsed time + elapsed_str = "N/A" + if metadata.get("first_timestamp") and metadata.get("last_timestamp"): + try: + fmt = "%Y-%m-%d %H:%M:%S" + start = datetime.strptime(metadata["first_timestamp"], fmt) + end = datetime.strptime(metadata["last_timestamp"], fmt) + elapsed = end - start + hours, remainder = divmod(int(elapsed.total_seconds()), 3600) + minutes, seconds = divmod(remainder, 60) + if hours > 0: + elapsed_str = f"{hours}h {minutes}m {seconds}s" + elif minutes > 0: + elapsed_str = f"{minutes}m {seconds}s" + else: + elapsed_str = f"{seconds}s" + except Exception: + pass + + # Avg time per task + avg_time_str = "N/A" + if num_tasks > 0 and metadata.get("first_timestamp") and metadata.get("last_timestamp"): + try: + fmt = "%Y-%m-%d %H:%M:%S" + start = datetime.strptime(metadata["first_timestamp"], fmt) + end = datetime.strptime(metadata["last_timestamp"], fmt) + elapsed = end - start + avg_seconds = elapsed.total_seconds() / num_tasks + if avg_seconds >= 60: + avg_time_str = f"{avg_seconds / 60:.1f}m" + else: + avg_time_str = f"{avg_seconds:.0f}s" + except Exception: + pass + + # Generate HTML + html = _generate_pool_viewer_html( + tasks=tasks, + workers=workers, + metadata=metadata, + domain_stats=domain_stats, + num_tasks=num_tasks, + num_success=num_success, + success_rate=success_rate, + elapsed_str=elapsed_str, + avg_time_str=avg_time_str, + ) + + # Write output + output_path = Path(output_path) + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(html) + + return output_path + + +def _generate_pool_viewer_html( + tasks: list[dict], + workers: dict, + metadata: dict, + domain_stats: dict, + num_tasks: int, + num_success: int, + success_rate: float, + elapsed_str: str, + avg_time_str: str, +) -> str: + """Generate HTML content for pool results viewer.""" + + # Worker rows HTML + worker_rows = "" + for worker_id, stats in sorted(workers.items()): + rate = (stats["successes"] / stats["tasks"] * 100) if stats["tasks"] > 0 else 0 + worker_rows += f""" + + Worker {worker_id} + {stats["tasks"]} + {stats["successes"]} + {stats["failures"]} + {rate:.1f}% + + """ + + # Domain breakdown HTML + domain_tags = "" + for domain in sorted(domain_stats.keys()): + stats = domain_stats[domain] + rate = (stats["success"] / stats["total"] * 100) if stats["total"] > 0 else 0 + domain_tags += f""" +
+ {domain} + {stats["success"]}/{stats["total"]} ({rate:.0f}%) +
+ """ + + # Task rows HTML + task_rows = "" + for i, task in enumerate(tasks): + status_class = "success" if task.get("success") else "fail" + status_text = "PASS" if task.get("success") else "FAIL" + result = task.get("result", 0) + task_rows += f""" + + {task.get('task_id', 'N/A')} + {task.get('domain', 'unknown')} + {status_text} + {result:.2f} + {task.get('steps', 0)} + Worker {task.get('worker_id', '?')} + + """ + + # Domain filter options + domain_options = '' + for domain in sorted(domain_stats.keys()): + domain_options += f'' + + html = f""" + + + + + WAA Pool Results - {metadata.get("run_name", "Unknown")} + + + +
+

WAA Pool Results

+
+ Run: {metadata.get("run_name", "Unknown")} | + Model: {metadata.get("model", "N/A")} | + Workers: {metadata.get("num_workers", len(workers))} | + Time: {elapsed_str} +
+ + +
+
+

Summary

+
+
+
+
{num_tasks}
+
Total Tasks
+
+
+
{num_success}
+
Passed
+
+
+
{num_tasks - num_success}
+
Failed
+
+
+
{success_rate:.1f}%
+
Success Rate
+
+
+
{avg_time_str}
+
Avg Time/Task
+
+
+
+ {domain_tags} +
+
+ + +
+
+

Per-Worker Breakdown

+
+ + + + + + + + + + + + {worker_rows} + +
WorkerTasksPassedFailedSuccess Rate
+
+ + +
+
+

Task Results

+
+
+
+ Domain: + +
+
+ Status: + +
+ {num_tasks} tasks +
+
+ + + + + + + + + + + + + {task_rows} + +
Task IDDomainStatusResultStepsWorker
+
+
+
+ + + + +""" + + return html diff --git a/pyproject.toml b/pyproject.toml index 35d2720..3808938 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -24,8 +24,9 @@ classifiers = [ dependencies = [ "azure-ai-ml>=1.30.0", "azure-identity>=1.25.1", + "azureml-core>=1.61.0.post1", "bitsandbytes>=0.41.0", # For 4-bit quantization - "click>=8.1.0", # CLI framework + "click>=8.1.0", # CLI framework "google-generativeai>=0.8.5", "matplotlib>=3.10.7", "openadapt-capture>=0.1.0", diff --git a/scripts/analyze_pool_logs.py b/scripts/analyze_pool_logs.py new file mode 100644 index 0000000..80bdfac --- /dev/null +++ b/scripts/analyze_pool_logs.py @@ -0,0 +1,213 @@ +#!/usr/bin/env python3 +"""Analyze WAA pool benchmark logs and generate HTML summary. + +Usage: + python scripts/analyze_pool_logs.py benchmark_results/pool_run_20260204/ +""" + +import re +import sys +from pathlib import Path +from datetime import datetime + + +def parse_log_file(log_path: Path) -> dict: + """Parse a WAA benchmark log file.""" + content = log_path.read_text() + + # Extract task completions + tasks = [] + finished_pattern = r"Finished (\w+)/([a-f0-9-]+-WOS)" + result_pattern = r"Result: ([\d.]+)" + + # Find all finished tasks with their results + finished_matches = list(re.finditer(finished_pattern, content)) + result_matches = list(re.finditer(result_pattern, content)) + + for i, match in enumerate(finished_matches): + domain = match.group(1) + task_id = match.group(2) + # Find the result that precedes this finish + result = 0.0 + for rm in result_matches: + if rm.start() < match.start(): + result = float(rm.group(1)) + tasks.append({ + "domain": domain, + "task_id": task_id, + "result": result, + "success": result > 0.0 + }) + + # Extract total task count from progress bar + total_match = re.search(r"Example:\s+\d+%\|.*?\|\s+\d+/(\d+)", content) + total_tasks = int(total_match.group(1)) if total_match else 0 + + return { + "file": log_path.name, + "tasks_completed": len(tasks), + "tasks_total": total_tasks, + "tasks": tasks, + "successes": sum(1 for t in tasks if t["success"]), + } + + +def generate_html_report(results: list, output_path: Path) -> None: + """Generate HTML summary report.""" + total_completed = sum(r["tasks_completed"] for r in results) + total_tasks = sum(r["tasks_total"] for r in results) + total_success = sum(r["successes"] for r in results) + success_rate = (total_success / total_completed * 100) if total_completed > 0 else 0 + + # Group by domain + domain_stats = {} + for r in results: + for task in r["tasks"]: + domain = task["domain"] + if domain not in domain_stats: + domain_stats[domain] = {"total": 0, "success": 0} + domain_stats[domain]["total"] += 1 + if task["success"]: + domain_stats[domain]["success"] += 1 + + html = f""" + + + WAA Benchmark Results + + + +
+

WAA Benchmark Results

+ +
+

Summary

+
+
+
{total_completed}/{total_tasks}
+
Tasks Completed
+
+
+
{total_success}
+
Successes
+
+
+
{success_rate:.1f}%
+
Success Rate
+
+
+
{len(results)}
+
Workers
+
+
+
+ +
+

By Domain

+ + +""" + + for domain, stats in sorted(domain_stats.items()): + rate = (stats["success"] / stats["total"] * 100) if stats["total"] > 0 else 0 + html += f""" + + + + + +""" + + html += """
DomainCompletedSuccessRate
{domain}{stats['total']}{stats['success']}{rate:.0f}%
+
+""" + + for r in results: + html += f""" +
+

{r['file']}

+

Completed: {r['tasks_completed']}/{r['tasks_total']} tasks

+ + +""" + for task in r["tasks"]: + status_class = "badge-success" if task["success"] else "badge-fail" + status_text = "PASS" if task["success"] else "FAIL" + html += f""" + + + + + +""" + html += """
DomainTask IDResultStatus
{task['domain']}{task['task_id'][:20]}...{task['result']:.2f}{status_text}
+
+""" + + html += f""" +
+ Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} +
+
+ + +""" + + output_path.write_text(html) + print(f"Generated: {output_path}") + + +def main(): + if len(sys.argv) < 2: + print("Usage: python scripts/analyze_pool_logs.py ") + sys.exit(1) + + results_dir = Path(sys.argv[1]) + if not results_dir.exists(): + print(f"Directory not found: {results_dir}") + sys.exit(1) + + # Find log files + log_files = list(results_dir.glob("waa-pool-*.log")) + if not log_files: + print(f"No log files found in {results_dir}") + sys.exit(1) + + print(f"Found {len(log_files)} log files") + + # Parse logs + results = [] + for log_file in sorted(log_files): + print(f" Parsing {log_file.name}...") + results.append(parse_log_file(log_file)) + + # Generate HTML + output_path = results_dir / "results.html" + generate_html_report(results, output_path) + + # Print summary + total_completed = sum(r["tasks_completed"] for r in results) + total_success = sum(r["successes"] for r in results) + print(f"\nSummary: {total_completed} tasks completed, {total_success} successes ({total_success/total_completed*100:.0f}% rate)" if total_completed > 0 else "\nNo tasks completed") + + +if __name__ == "__main__": + main()