Skip to content
Open
111 changes: 111 additions & 0 deletions scripts/e2e_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,117 @@ uv run python scripts/e2e_eval/run_eval.py --retry-failed
| `--verbose` | off | Print stderr for failed models |
| `--continue` | off | Skip models with existing results |
| `--retry-failed [TYPE ...]` | — | Re-run failed models (implies `--continue`) |
| `--build-only` | off | Build with `--no-compile`, writing each stage's ONNX (no EP needed). Loops the EP matrix when `--ep`/`--device` omitted |

#### `--build-only` — Generate per-stage models (no EP required)

`--build-only` runs config + build with `--no-compile`, writing each stage's ONNX —
`export.onnx`, `optimized.onnx`, `quantized.onnx`. Because compile is skipped, this
needs **no execution-provider hardware** and runs on any CPU machine. Perf and accuracy
phases are skipped.

When `--ep`/`--device` are **omitted**, every model is built once per EP in the
build-only matrix, each into a `<ep>_<device>/` subdir:

| Label | EP | Device |
|---|---|---|
| `qnn_npu` | qnn | npu |
| `qnn_gpu` | qnn | gpu |
| `ov_cpu` | openvino | cpu |
| `ov_npu` | openvino | npu |
| `ov_gpu` | openvino | gpu |
| `mlas_cpu` | cpu (MLAS) | cpu |
| `dml_gpu` | dml | gpu |
| `vitisai_npu` | vitisai | npu |

Precision per combo follows the eval policy: NPU defaults to `w8a16`, CPU/GPU omit the
flag (winml auto), and native-quant EPs (VitisAI) are built unquantized (`--no-quant`).
When `--ep` or `--device` is pinned, a single build is written directly into
`<output-dir>/models/<slug>/`.

```bash
# Build all EP-matrix variants for P0 models (8 builds per model)
uv run python scripts/e2e_eval/run_eval.py --build-only --priority P0

# Pin a single EP/device (no matrix; writes directly to model dir)
uv run python scripts/e2e_eval/run_eval.py --build-only --hf-model microsoft/resnet-50 --ep qnn --device npu
```

Composite models (multiple sub-components) are built into per-component subdirectories
under each EP subdir.

**Export dedup** (without `--upload`): the `export.onnx` stage is EP/device-independent,
so it is identical across all matrix combos. It is stored once under
`<model_dir>/_shared/export.onnx` and removed from each `<ep>_<device>/` subdir,
keeping only one copy on disk. With `--upload` each combo is published and deleted on
its own, so there is nothing to share and dedup is skipped.

#### Streaming upload to the Azure Artifacts feed (`--upload`)

Running the full matrix over many models fills the local disk fast. `--upload`
publishes each **EP/device combo** to the **`Modelkit`** Azure Artifacts feed
(Universal Package) as soon as it is built, then deletes that combo's local copy —
so peak disk stays at roughly one combo, and a large/slow upload of one combo can't
fill the disk.

- **Auth**: uses `az login` (Entra ID) — no PAT. The script verifies the
`azure-devops` az extension is installed (auto-adds it) and that you're logged in;
if not, it aborts (so disk isn't silently filled).
- **Package**: one package `winml-cli-models`, **one version per combo**, named
`0.0.0-<run-stamp>-<ep>-<device>-<model-slug>` where the run-stamp is a date
(default today, `YYYYMMDD`). e.g.
`0.0.0-20260609-qnn-npu-microsoft-resnet-50-image-classification` (the `0.0.0-`
core keeps it valid SemVer 2.0; the rest is the pre-release segment). Uploading
per combo keeps each package small, which lowers the per-upload timeout risk and
lets a single combo be retried on its own.
- **Disk is always bounded**: each combo's local dir is deleted after *every*
outcome — uploaded, version-exists, upload-failed, **timed-out**, or build-failed
— unless `--keep-local`. A failed or timed-out combo is recorded and the run
continues; a host-level az failure (not logged in / token expired) aborts so you
can re-auth and resume.
- A `build_only_results.json` log (combo version → build status + upload status +
error tail + timestamps) is written in the output dir for *every* run (with or
without `--upload`), so you can audit which combos succeeded, failed, or timed
out. It also drives `--continue` (skips combos already in the feed).

```bash
# Build the matrix and stream each model to the feed, deleting locals
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --priority P0

# Resume an interrupted batch: same run-stamp + --continue skips combos already
# uploaded (per the results log / feed) without rebuilding them.
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --continue \
--run-stamp 20260609 --priority P0

# --upload-skip-existing: if the feed already has a version (e.g. results log lost),
# treat the publish conflict as done and delete the local copy.
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --upload-skip-existing

# Upload but keep local copies (debug)
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --keep-local
```

Download a specific model's specific file later with `--file-filter`:

```bash
az artifacts universal download \
--organization https://dev.azure.com/microsoft --project windows.ai.toolkit \
--scope project --feed Modelkit --name winml-cli-models \
--version 0.0.0-20260609-qnn-npu-microsoft-resnet-50-image-classification \
--path ./out --file-filter 'quantized.onnx'
```

| Upload flag | Default | Description |
|---|---|---|
| `--upload` | off | Publish each EP/device combo to the feed, then delete it locally |
| `--run-stamp` | today (`YYYYMMDD`) | Version prefix; pass the same stamp + `--continue` to resume |
| `--continue` | off | Skip combos already uploaded for this run-stamp (no rebuild) |
| `--feed` | `Modelkit` | Azure Artifacts feed name |
| `--feed-org` | `https://dev.azure.com/microsoft` | Azure DevOps org URL |
| `--feed-project` | `windows.ai.toolkit` | Project for the project-scoped feed |
| `--package-name` | `winml-cli-models` | Universal Package name |
| `--keep-local` | off | Upload but do not delete local combos (also keeps build-failed combos) |
| `--upload-skip-existing` | off | Treat an existing feed version as done (feed-based resume) |

### `generate_report.py` — Regenerate Reports

Expand Down
Loading