From b389d3edc889055ed5744e72f15db2d322eb3ba5 Mon Sep 17 00:00:00 2001
From: Richard Abrich <richard.abrich@gmail.com>
Date: Wed, 4 Feb 2026 21:14:32 -0500
Subject: [PATCH 1/2] docs(readme): add parallel WAA evaluation section, fix
 build badge
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Fix broken build badge (publish.yml → release.yml)
- Add prominent "Parallel WAA Benchmark Evaluation" section near top
- Add detailed "WAA Benchmark Workflow" section (#14) with:
  - Single VM and parallel pool workflows
  - VNC access instructions
  - Architecture diagram
  - Cost estimates
- Update section numbering (Limitations → 15, Roadmap → 16)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 136 insertions(+), 3 deletions(-)
diff --git a/README.md b/README.md
index 2c5164f..833f876 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # OpenAdapt-ML
 
-[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml)
+[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml)
 [![PyPI version](https://img.shields.io/pypi/v/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
 [![Downloads](https://img.shields.io/pypi/dm/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -30,6 +30,38 @@ The design is described in detail in [`docs/design.md`](docs/design.md).
 
 ---
 
+## Parallel WAA Benchmark Evaluation (New in v0.3.0)
+
+Run Windows Agent Arena benchmarks across multiple Azure VMs in parallel for faster evaluation:
+
+```bash
+# Create a pool of 5 workers
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
+
+# Wait for all workers to be ready
+uv run python -m openadapt_ml.benchmarks.cli pool-wait
+
+# Run 154 tasks distributed across workers (~5x faster)
+uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
+```
+
+**Key features:**
+- **Parallel execution**: Distribute 154 WAA tasks across N workers
+- **Automatic task distribution**: Uses WAA's native `--worker_id`/`--num_workers` for round-robin assignment
+- **VNC access**: View each Windows VM via SSH tunnels (`localhost:8006`, `localhost:8007`, etc.)
+- **Cost tracking**: Monitor Azure VM costs in real-time
+
+**Performance:**
+| Workers | Estimated Time (154 tasks) |
+|---------|---------------------------|
+| 1       | ~50-80 hours              |
+| 5       | ~10-16 hours              |
+| 10      | ~5-8 hours                |
+
+See [WAA Benchmark Workflow](#waa-benchmark-workflow) for complete setup instructions.
+
+---
+
 ## 1. Installation
 
 ### 1.1 From PyPI (recommended)
@@ -971,7 +1003,108 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t
 
 ---
 
-## 14. Limitations & Notes
+## 14. WAA Benchmark Workflow
+
+<a id="waa-benchmark-workflow"></a>
+
+Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution.
+
+### 14.1 Prerequisites
+
+1. **Azure CLI**: `brew install azure-cli && az login`
+2. **OpenAI API Key**: Set in `.env` file (`OPENAI_API_KEY=sk-...`)
+3. **Azure quota**: Ddsv5 family VMs (8+ vCPUs per worker)
+
+### 14.2 Single VM Workflow
+
+For quick testing or small runs:
+
+```bash
+# Setup VM with WAA
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
+
+# Start monitoring dashboard (auto-opens VNC, manages SSH tunnels)
+uv run python -m openadapt_ml.benchmarks.cli vm monitor
+
+# Run benchmark
+uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10
+
+# Deallocate when done (stops billing)
+uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
+```
+
+### 14.3 Parallel Pool Workflow (Recommended)
+
+For full 154-task evaluations, use multiple VMs:
+
+```bash
+# 1. Create pool (provisions N Azure VMs with Docker + WAA)
+uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
+
+# 2. Wait for all workers to be ready (Windows boot + WAA server startup)
+uv run python -m openadapt_ml.benchmarks.cli pool-wait
+
+# 3. Run benchmark across all workers
+#    Tasks are distributed using WAA's native --worker_id/--num_workers
+uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
+
+# 4. Monitor progress
+uv run python -m openadapt_ml.benchmarks.cli pool-status
+uv run python -m openadapt_ml.benchmarks.cli pool-logs
+
+# 5. Cleanup (delete all VMs - IMPORTANT to stop billing!)
+uv run python -m openadapt_ml.benchmarks.cli pool-delete -y
+```
+
+### 14.4 VNC Access to Workers
+
+View what each Windows VM is doing:
+
+```bash
+# Set up SSH tunnels (tunnels are created automatically, but you can also do this manually)
+ssh -f -N -L 8006:localhost:8006 azureuser@<worker-0-ip>  # localhost:8006
+ssh -f -N -L 8007:localhost:8006 azureuser@<worker-1-ip>  # localhost:8007
+# etc.
+
+# Open in browser
+open http://localhost:8006  # Worker 0
+open http://localhost:8007  # Worker 1
+```
+
+### 14.5 Architecture
+
+```
+Local Machine
+├── openadapt-ml CLI (pool-create, pool-wait, pool-run)
+│   └── SSH tunnels to each worker
+│
+Azure (N VMs, Standard_D8ds_v5)
+├── waa-pool-00
+│   └── Docker
+│       └── windowsarena/winarena:latest
+│           └── QEMU (Windows 11)
+│               ├── WAA Flask server (port 5000)
+│               └── Navi agent (GPT-4o-mini)
+├── waa-pool-01
+│   └── ...
+└── waa-pool-N
+    └── ...
+```
+
+### 14.6 Cost Estimates
+
+| VM Size | vCPUs | RAM | Cost/hr | 5 VMs for 10hrs |
+|---------|-------|-----|---------|-----------------|
+| Standard_D8ds_v5 | 8 | 32GB | ~$0.38 | ~$19 |
+
+**Tips:**
+- Always run `pool-delete -y` when done
+- Use `vm deallocate` (not delete) to pause billing but keep disk
+- Set `--auto-shutdown-hours 2` on `vm monitor` for safety
+
+---
+
+## 15. Limitations & Notes
 
 - **Apple Silicon / bitsandbytes**:
   - Example configs are sized for CPU / Apple Silicon development runs; see
@@ -995,7 +1128,7 @@ For deeper architectural details, see [`docs/design.md`](docs/design.md).
 
 ---
 
-## 15. Roadmap
+## 16. Roadmap
 
 For the up-to-date, prioritized roadmap (including concrete implementation
 targets and agent-executable acceptance criteria), see

From 5d3f4f86ff825b5fdd867b7402acc9cf84ad6c27 Mon Sep 17 00:00:00 2001
From: Richard Abrich <richard.abrich@gmail.com>
Date: Wed, 4 Feb 2026 21:17:46 -0500
Subject: [PATCH 2/2] fix(readme): address self-review feedback
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Fix anchor placement (move before heading for proper navigation)
- Correct pool-delete → pool-cleanup (actual command name)
- Add pool-status example for getting worker IPs
- Add "prices vary by region" caveat

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 833f876..dd7b38c 100644
--- a/README.md
+++ b/README.md
@@ -1003,10 +1003,10 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t
 
 ---
 
-## 14. WAA Benchmark Workflow
-
 <a id="waa-benchmark-workflow"></a>
 
+## 14. WAA Benchmark Workflow
+
 Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution.
 
 ### 14.1 Prerequisites
@@ -1053,7 +1053,7 @@ uv run python -m openadapt_ml.benchmarks.cli pool-status
 uv run python -m openadapt_ml.benchmarks.cli pool-logs
 
 # 5. Cleanup (delete all VMs - IMPORTANT to stop billing!)
-uv run python -m openadapt_ml.benchmarks.cli pool-delete -y
+uv run python -m openadapt_ml.benchmarks.cli pool-cleanup
 ```
 
 ### 14.4 VNC Access to Workers
@@ -1061,6 +1061,9 @@ uv run python -m openadapt_ml.benchmarks.cli pool-delete -y
 View what each Windows VM is doing:
 
 ```bash
+# Get worker IPs
+uv run python -m openadapt_ml.benchmarks.cli pool-status
+
 # Set up SSH tunnels (tunnels are created automatically, but you can also do this manually)
 ssh -f -N -L 8006:localhost:8006 azureuser@<worker-0-ip>  # localhost:8006
 ssh -f -N -L 8007:localhost:8006 azureuser@<worker-1-ip>  # localhost:8007
@@ -1098,9 +1101,10 @@ Azure (N VMs, Standard_D8ds_v5)
 | Standard_D8ds_v5 | 8 | 32GB | ~$0.38 | ~$19 |
 
 **Tips:**
-- Always run `pool-delete -y` when done
+- Always run `pool-cleanup` when done to delete VMs and stop billing
 - Use `vm deallocate` (not delete) to pause billing but keep disk
 - Set `--auto-shutdown-hours 2` on `vm monitor` for safety
+- Prices vary by Azure region
 
 ---