From b389d3edc889055ed5744e72f15db2d322eb3ba5 Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Wed, 4 Feb 2026 21:14:32 -0500 Subject: [PATCH 1/2] docs(readme): add parallel WAA evaluation section, fix build badge MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix broken build badge (publish.yml → release.yml) - Add prominent "Parallel WAA Benchmark Evaluation" section near top - Add detailed "WAA Benchmark Workflow" section (#14) with: - Single VM and parallel pool workflows - VNC access instructions - Architecture diagram - Cost estimates - Update section numbering (Limitations → 15, Roadmap → 16) Co-Authored-By: Claude Opus 4.5 --- README.md | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 136 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 2c5164f..833f876 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # OpenAdapt-ML -[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml) +[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml) [![PyPI version](https://img.shields.io/pypi/v/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/) [![Downloads](https://img.shields.io/pypi/dm/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) @@ -30,6 +30,38 @@ The design is described in detail in [`docs/design.md`](docs/design.md). --- +## Parallel WAA Benchmark Evaluation (New in v0.3.0) + +Run Windows Agent Arena benchmarks across multiple Azure VMs in parallel for faster evaluation: + +```bash +# Create a pool of 5 workers +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5 + +# Wait for all workers to be ready +uv run python -m openadapt_ml.benchmarks.cli pool-wait + +# Run 154 tasks distributed across workers (~5x faster) +uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154 +``` + +**Key features:** +- **Parallel execution**: Distribute 154 WAA tasks across N workers +- **Automatic task distribution**: Uses WAA's native `--worker_id`/`--num_workers` for round-robin assignment +- **VNC access**: View each Windows VM via SSH tunnels (`localhost:8006`, `localhost:8007`, etc.) +- **Cost tracking**: Monitor Azure VM costs in real-time + +**Performance:** +| Workers | Estimated Time (154 tasks) | +|---------|---------------------------| +| 1 | ~50-80 hours | +| 5 | ~10-16 hours | +| 10 | ~5-8 hours | + +See [WAA Benchmark Workflow](#waa-benchmark-workflow) for complete setup instructions. + +--- + ## 1. Installation ### 1.1 From PyPI (recommended) @@ -971,7 +1003,108 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t --- -## 14. Limitations & Notes +## 14. WAA Benchmark Workflow + + + +Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution. + +### 14.1 Prerequisites + +1. **Azure CLI**: `brew install azure-cli && az login` +2. **OpenAI API Key**: Set in `.env` file (`OPENAI_API_KEY=sk-...`) +3. **Azure quota**: Ddsv5 family VMs (8+ vCPUs per worker) + +### 14.2 Single VM Workflow + +For quick testing or small runs: + +```bash +# Setup VM with WAA +uv run python -m openadapt_ml.benchmarks.cli vm setup-waa + +# Start monitoring dashboard (auto-opens VNC, manages SSH tunnels) +uv run python -m openadapt_ml.benchmarks.cli vm monitor + +# Run benchmark +uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10 + +# Deallocate when done (stops billing) +uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y +``` + +### 14.3 Parallel Pool Workflow (Recommended) + +For full 154-task evaluations, use multiple VMs: + +```bash +# 1. Create pool (provisions N Azure VMs with Docker + WAA) +uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5 + +# 2. Wait for all workers to be ready (Windows boot + WAA server startup) +uv run python -m openadapt_ml.benchmarks.cli pool-wait + +# 3. Run benchmark across all workers +# Tasks are distributed using WAA's native --worker_id/--num_workers +uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154 + +# 4. Monitor progress +uv run python -m openadapt_ml.benchmarks.cli pool-status +uv run python -m openadapt_ml.benchmarks.cli pool-logs + +# 5. Cleanup (delete all VMs - IMPORTANT to stop billing!) +uv run python -m openadapt_ml.benchmarks.cli pool-delete -y +``` + +### 14.4 VNC Access to Workers + +View what each Windows VM is doing: + +```bash +# Set up SSH tunnels (tunnels are created automatically, but you can also do this manually) +ssh -f -N -L 8006:localhost:8006 azureuser@ # localhost:8006 +ssh -f -N -L 8007:localhost:8006 azureuser@ # localhost:8007 +# etc. + +# Open in browser +open http://localhost:8006 # Worker 0 +open http://localhost:8007 # Worker 1 +``` + +### 14.5 Architecture + +``` +Local Machine +├── openadapt-ml CLI (pool-create, pool-wait, pool-run) +│ └── SSH tunnels to each worker +│ +Azure (N VMs, Standard_D8ds_v5) +├── waa-pool-00 +│ └── Docker +│ └── windowsarena/winarena:latest +│ └── QEMU (Windows 11) +│ ├── WAA Flask server (port 5000) +│ └── Navi agent (GPT-4o-mini) +├── waa-pool-01 +│ └── ... +└── waa-pool-N + └── ... +``` + +### 14.6 Cost Estimates + +| VM Size | vCPUs | RAM | Cost/hr | 5 VMs for 10hrs | +|---------|-------|-----|---------|-----------------| +| Standard_D8ds_v5 | 8 | 32GB | ~$0.38 | ~$19 | + +**Tips:** +- Always run `pool-delete -y` when done +- Use `vm deallocate` (not delete) to pause billing but keep disk +- Set `--auto-shutdown-hours 2` on `vm monitor` for safety + +--- + +## 15. Limitations & Notes - **Apple Silicon / bitsandbytes**: - Example configs are sized for CPU / Apple Silicon development runs; see @@ -995,7 +1128,7 @@ For deeper architectural details, see [`docs/design.md`](docs/design.md). --- -## 15. Roadmap +## 16. Roadmap For the up-to-date, prioritized roadmap (including concrete implementation targets and agent-executable acceptance criteria), see From 5d3f4f86ff825b5fdd867b7402acc9cf84ad6c27 Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Wed, 4 Feb 2026 21:17:46 -0500 Subject: [PATCH 2/2] fix(readme): address self-review feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix anchor placement (move before heading for proper navigation) - Correct pool-delete → pool-cleanup (actual command name) - Add pool-status example for getting worker IPs - Add "prices vary by region" caveat Co-Authored-By: Claude Opus 4.5 --- README.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 833f876..dd7b38c 100644 --- a/README.md +++ b/README.md @@ -1003,10 +1003,10 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t --- -## 14. WAA Benchmark Workflow - +## 14. WAA Benchmark Workflow + Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution. ### 14.1 Prerequisites @@ -1053,7 +1053,7 @@ uv run python -m openadapt_ml.benchmarks.cli pool-status uv run python -m openadapt_ml.benchmarks.cli pool-logs # 5. Cleanup (delete all VMs - IMPORTANT to stop billing!) -uv run python -m openadapt_ml.benchmarks.cli pool-delete -y +uv run python -m openadapt_ml.benchmarks.cli pool-cleanup ``` ### 14.4 VNC Access to Workers @@ -1061,6 +1061,9 @@ uv run python -m openadapt_ml.benchmarks.cli pool-delete -y View what each Windows VM is doing: ```bash +# Get worker IPs +uv run python -m openadapt_ml.benchmarks.cli pool-status + # Set up SSH tunnels (tunnels are created automatically, but you can also do this manually) ssh -f -N -L 8006:localhost:8006 azureuser@ # localhost:8006 ssh -f -N -L 8007:localhost:8006 azureuser@ # localhost:8007 @@ -1098,9 +1101,10 @@ Azure (N VMs, Standard_D8ds_v5) | Standard_D8ds_v5 | 8 | 32GB | ~$0.38 | ~$19 | **Tips:** -- Always run `pool-delete -y` when done +- Always run `pool-cleanup` when done to delete VMs and stop billing - Use `vm deallocate` (not delete) to pause billing but keep disk - Set `--auto-shutdown-hours 2` on `vm monitor` for safety +- Prices vary by Azure region ---