diff --git a/.agents/verification.md b/.agents/verification.md index 0f02e1237..6f1dac4d4 100644 --- a/.agents/verification.md +++ b/.agents/verification.md @@ -129,6 +129,8 @@ Use live dogfood before marking PRs ready when they affect eval execution, exper - For native experiment changes, run through `agentv eval run ... --experiment ` so resolution, setup, scripts, target selection, run knobs, and artifact metadata are exercised together. - For repeat-run changes, use an experiment-level repeat config with `count >= 2`, `early_exit: false` when validating all attempts are persisted. Inspect root `index.jsonl`, root `benchmark.json`, and the repeated case folder. The repeated case folder should carry aggregate `summary.json` with flattened snake_case timing fields plus AgentV aggregate `grading.json`; attempt-specific outputs and transcripts live under `run-N/`. Each `run-N/` folder should contain `result.json`, `grading.json`, `transcript.json`, `transcript-raw.jsonl`, and `outputs/answer.md`. Do not write per-run `metrics.json`; timing and o11y fields belong in `result.json`, and `result.json` points at `./grading.json` through `grading_path`. - For local OpenAI-compatible grading through the OAuth proxy, use `endpoint: http://127.0.0.1:10531/v1`, but still route `api_key` and `model` through environment references such as `${{ LOCAL_OPENAI_PROXY_API_KEY }}` and `${{ LOCAL_OPENAI_PROXY_MODEL }}`. Literal secrets and literal model values are intentionally rejected by target validation unless a resolver explicitly allows them. +- For `codex`/Codex SDK live dogfood through the same local proxy, configure the agent target with `provider: codex`, `base_url: ${{ LOCAL_OPENAI_PROXY_BASE_URL }}`, `api_key: ${{ LOCAL_OPENAI_PROXY_API_KEY }}`, `model: ${{ LOCAL_OPENAI_PROXY_MODEL }}`, `api_format: responses`, `grader_target: `, `workers: 1`, and a bounded `timeout_seconds`. Configure the grader target as `provider: openai`, `api_format: chat`, and the same local proxy env references. A minimal run should use `bun apps/cli/src/cli.ts eval run --targets --target --workers 1`. +- If the local proxy returns `401 token_expired`, the blocker is stale Codex OAuth, not AgentV target configuration. Refresh from a trusted local terminal with `codex logout`, `codex login --device-auth`, then restart `openai-oauth` and rerun the same eval command. - Preserve review evidence in `agentv-private` on an `evidence/` branch. Include the run bundle, source eval/experiment/targets files, a short README, an artifact tree, and screenshots when folder structure or UI behavior is under review. - If comparing against an external convention such as Vercel `agent-eval`, verify both semantic provenance and the physical `run-N` artifact layout for repeat runs.