From ee2715bf2bbde6ac5d82fd54e84fc8422323fbe8 Mon Sep 17 00:00:00 2001 From: "const.koutsakis@aurecongroup.com" Date: Mon, 25 May 2026 23:56:10 +1000 Subject: [PATCH 1/4] feat: eval pattern examples calling Azure OpenAI (#94) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The eval slice previously shipped one toy case (echo-hello) and a disabled-by-default nightly. A reader expecting an LLM-eval story found the infrastructure without conviction. Adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. These are not benchmarks — they demonstrate what an eval case *looks like* for the four LLM-eval patterns you most often need to write: - factual-http-200 exact_match format-constrained recall - numeric-seconds-per-day numeric_close numeric reasoning + tolerance - definitional-fastapi-depends semantic_similar free-form judge-scored prose - structured-json-status exact_match structured-output adherence When the template is forked for a real project, replace these four with cases that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on. Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI client — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the LLMClient Protocol in src/eval/judge.py does its job: the eval core never imports openai, vendor lock-in lives only in the adapter. Changes: - src/eval/adapters/azure_openai.py — implements LLMClient via the openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version from env. Lazy-imports the SDK so the module is importable without the optional extra installed; the adapter raises a clear AzureOpenAIConfigError if the env or SDK is missing. - eval/golden_patterns.json — the four cases with notes explaining which pattern each demonstrates. - eval/test_golden_patterns.py — separate test file gated on the Azure env vars via pytestmark. Skipped on a stock checkout, so `uv run pytest eval/` always exits 0. The toy test_golden_qa.py keeps running as before. - pyproject.toml — new optional [project.optional-dependencies] eval extra (just `openai>=1.40.0`), mypy override for openai.* matching the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11 self-version bump. - .github/workflows/eval-nightly.yml — env vars renamed from the placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated with the Azure setup recipe. uv sync now passes --extra eval. - docs/EVAL_HARNESS.md — new "Worked patterns" section with the table mapping case -> tolerance -> pattern, the local setup recipe, and a "Swapping providers" note documenting the Protocol-based extension path. Local gates: mypy --strict clean on 42 source files (was 31), ruff clean, ruff format clean, import-linter both contracts kept, 192 unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env. Closes #94 --- .github/workflows/eval-nightly.yml | 35 ++++---- docs/EVAL_HARNESS.md | 52 ++++++++++-- eval/golden_patterns.json | 38 +++++++++ eval/test_golden_patterns.py | 81 +++++++++++++++++++ pyproject.toml | 12 +++ src/eval/adapters/__init__.py | 13 +++ src/eval/adapters/azure_openai.py | 123 +++++++++++++++++++++++++++++ uv.lock | 90 ++++++++++++++++++++- 8 files changed, 421 insertions(+), 23 deletions(-) create mode 100644 eval/golden_patterns.json create mode 100644 eval/test_golden_patterns.py create mode 100644 src/eval/adapters/__init__.py create mode 100644 src/eval/adapters/azure_openai.py diff --git a/.github/workflows/eval-nightly.yml b/.github/workflows/eval-nightly.yml index 9020446..2ca8981 100644 --- a/.github/workflows/eval-nightly.yml +++ b/.github/workflows/eval-nightly.yml @@ -1,12 +1,15 @@ # Eval harness nightly — disabled-by-default. # -# This workflow runs the golden QA dataset against the agent / LLM loop. It -# is `workflow_dispatch`-only by default to prevent accidental LLM API -# spend. To enable nightly runs: +# This workflow runs the golden QA dataset + worked-pattern cases against a +# real Azure OpenAI deployment. It is `workflow_dispatch`-only by default +# to prevent accidental API spend. To enable nightly runs: +# +# 1. Set the Azure OpenAI secrets in repo settings: +# AZURE_OPENAI_ENDPOINT e.g. https://my.openai.azure.com +# AZURE_OPENAI_API_KEY the Azure resource key +# AZURE_OPENAI_DEPLOYMENT deployment name, e.g. gpt-4o-mini +# AZURE_OPENAI_API_VERSION optional, defaults to 2024-10-21 # -# 1. Set the LLM secrets in repo settings (LLM_API_KEY at minimum; -# LLM_BASE_URL / LLM_MODEL / LLM_PROVIDER if your judge differs from -# OpenAI defaults). # 2. Replace the `on:` block below with: # # on: @@ -14,9 +17,13 @@ # - cron: "0 6 * * *" # daily 06:00 UTC # workflow_dispatch: # -# 3. Add the `eval-nightly.yml` to EXEMPT_WORKFLOWS in -# `.github/scripts/check_required_contexts.py` if it's not already -# there (it is, by default — scheduled runs never gate PRs). +# 3. Confirm `eval-nightly.yml` is in EXEMPT_WORKFLOWS in +# `.github/scripts/check_required_contexts.py` (it is, by default +# — scheduled runs never gate PRs). +# +# When the Azure secrets are absent, eval/test_golden_patterns.py is +# skipped via pytestmark — the toy eval/test_golden_qa.py case still +# runs as a smoke check on the runner mechanics. # # See docs/EVAL_HARNESS.md for the full setup story. @@ -43,11 +50,11 @@ jobs: - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5 with: python-version: ${{ inputs.python_version || '3.14' }} - - run: uv sync --frozen --extra dev + - run: uv sync --frozen --extra dev --extra eval - name: Run pytest eval/ env: - LLM_PROVIDER: ${{ secrets.LLM_PROVIDER }} - LLM_API_KEY: ${{ secrets.LLM_API_KEY }} - LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }} - LLM_MODEL: ${{ secrets.LLM_MODEL }} + AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }} + AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }} + AZURE_OPENAI_DEPLOYMENT: ${{ secrets.AZURE_OPENAI_DEPLOYMENT }} + AZURE_OPENAI_API_VERSION: ${{ secrets.AZURE_OPENAI_API_VERSION }} run: uv run pytest eval/ -v diff --git a/docs/EVAL_HARNESS.md b/docs/EVAL_HARNESS.md index ec115b1..d352636 100644 --- a/docs/EVAL_HARNESS.md +++ b/docs/EVAL_HARNESS.md @@ -6,15 +6,19 @@ LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, th ``` src/eval/ -├── models.py # EvalCase, EvalResult (Pydantic) -├── runner.py # EvalRunner — generic, takes a Callable[[str], str] -├── judge.py # LLMClient Protocol + semantic-similarity judge -├── report.py # Markdown report generator -└── __main__.py # python -m src.eval +├── models.py # EvalCase, EvalResult (Pydantic) +├── runner.py # EvalRunner — generic, takes a Callable[[str], str] +├── judge.py # LLMClient Protocol + semantic-similarity judge +├── report.py # Markdown report generator +├── __main__.py # python -m src.eval +└── adapters/ + └── azure_openai.py # Concrete LLMClient for Azure OpenAI (optional extra) eval/ -├── golden_qa.json # The dataset (one trivial example case ships) -└── test_golden_qa.py # Parametrised pytest runner +├── golden_qa.json # Toy smoke case — runs without LLM credentials +├── test_golden_qa.py # Parametrised runner for the toy case +├── golden_patterns.json # Four worked-pattern cases — require Azure OpenAI +└── test_golden_patterns.py # Skipped unless AZURE_OPENAI_* env vars are set ``` ## How it works @@ -86,11 +90,43 @@ python -m src.eval # CLI runner — prints the markdown report The pytest invocation is marked `@pytest.mark.eval`, so the default `pytest tests/` skips it. +## Worked patterns (Azure OpenAI) + +The four cases in `eval/golden_patterns.json` are *not* benchmarks. They exist to demonstrate what an eval case looks like against each of the runner's tolerance modes; together they cover the four LLM-eval patterns you most often need to write: + +| Case ID | Tolerance | Pattern demonstrated | +|---|---|---| +| `factual-http-200` | `exact_match` | Format-constrained factual recall. The prompt forces a single canonical token; if the model wraps the answer in prose, the case fails loudly. | +| `numeric-seconds-per-day` | `numeric_close` | Numeric reasoning with extraction tolerance. The runner pulls the first number from each side and compares within 1 %, so `86,400` and `86400 seconds` both match. | +| `definitional-fastapi-depends` | `semantic_similar` | Free-form prose scored by an LLM judge at ≥ 0.8. Use for explanations and any case where wording can vary but the underlying claim is checkable. | +| `structured-json-status` | `exact_match` | Structured-output adherence. The prompt asks for raw JSON; markdown-fenced or prose-wrapped responses fail — which is the failure mode downstream parsers also hit. | + +The cases all call a real Azure OpenAI deployment via the adapter at `src/eval/adapters/azure_openai.py`. When you fork the template for a real project, replace these four with cases that exercise your own product's prompts; the patterns transfer. + +### Setup + +```sh +uv sync --extra dev --extra eval # installs the openai SDK + +export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com" +export AZURE_OPENAI_API_KEY="..." +export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini" # or whatever you deployed +export AZURE_OPENAI_API_VERSION="2024-10-21" # optional, this is the default + +uv run pytest eval/test_golden_patterns.py -v +``` + +Without the env vars, `eval/test_golden_patterns.py` is skipped via `pytestmark` — `eval/test_golden_qa.py` still runs as a smoke check on the runner mechanics, so `uv run pytest eval/` always exits 0 on a fresh checkout. + +### Swapping providers + +`src/eval/judge.py` defines `LLMClient` as a `Protocol` — the eval core does not import `openai` anywhere. To target a different provider (Anthropic, vLLM, vanilla OpenAI), write a new adapter under `src/eval/adapters/` that implements `complete_json(*, model, prompt) -> str` and update the runner fixture in your test file. Nothing in `src/eval/` itself changes. + ## Nightly opt-in `.github/workflows/eval-nightly.yml` ships `workflow_dispatch`-only by default to avoid accidental LLM API spend. To turn on a real nightly: -1. Add the LLM secrets in repo settings: `LLM_API_KEY` (required), `LLM_PROVIDER`, `LLM_BASE_URL`, `LLM_MODEL` (optional, depending on adapter). +1. Add the Azure OpenAI secrets in repo settings: `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`, and optionally `AZURE_OPENAI_API_VERSION`. 2. Replace the workflow's `on:` block with: diff --git a/eval/golden_patterns.json b/eval/golden_patterns.json new file mode 100644 index 0000000..d0b7316 --- /dev/null +++ b/eval/golden_patterns.json @@ -0,0 +1,38 @@ +[ + { + "id": "factual-http-200", + "question": "What HTTP status code means OK? Respond with only the number, no prose.", + "category": "factual-recall", + "expected_answer": "200", + "tolerance": "exact_match", + "difficulty": "easy", + "notes": "Pattern: factual recall with format-constrained output. exact_match works because the prompt forces a single canonical token. If the model adds prose (\"The status code is 200.\") this fails loudly — which is the point: format adherence is part of the assertion." + }, + { + "id": "numeric-seconds-per-day", + "question": "How many seconds are in 24 hours? Respond with the integer only.", + "category": "numeric-reasoning", + "expected_answer": "86400", + "tolerance": "numeric_close", + "difficulty": "easy", + "notes": "Pattern: numeric extraction with 1% tolerance. The runner pulls the first number from each side and compares ratios, so '86,400', '86400 seconds', and '86400.0' all match. Use this tolerance for math, conversions, and any case where formatting around the number is uninteresting." + }, + { + "id": "definitional-fastapi-depends", + "question": "In one sentence: what does FastAPI's Depends() do?", + "category": "definitional", + "expected_answer": "Depends declares a callable that FastAPI resolves at request time and injects the result into the parameter, enabling dependency injection for things like authentication, database sessions, or settings.", + "tolerance": "semantic_similar", + "difficulty": "medium", + "notes": "Pattern: free-form prose scored by LLM judge. semantic_similar passes at score >= 0.8 via the judge in src/eval/judge.py. Use this for definitions, explanations, and any case where wording can legitimately vary but the underlying claim is checkable." + }, + { + "id": "structured-json-status", + "question": "Return exactly this JSON object and nothing else (no markdown fence, no prose, no trailing newline): {\"ok\": true, \"version\": 1}", + "category": "structured-output", + "expected_answer": "{\"ok\": true, \"version\": 1}", + "tolerance": "exact_match", + "difficulty": "medium", + "notes": "Pattern: format adherence on structured output. Models commonly wrap JSON in ```json``` fences or add a preamble; exact_match after normalisation (lowercase + whitespace-collapse) accepts a clean response but rejects the fenced or prose-wrapped version. This is the failure mode you want to catch — downstream parsers break the same way." + } +] diff --git a/eval/test_golden_patterns.py b/eval/test_golden_patterns.py new file mode 100644 index 0000000..932970a --- /dev/null +++ b/eval/test_golden_patterns.py @@ -0,0 +1,81 @@ +"""LLM-eval pattern showcase — four worked cases that exercise the existing +tolerance modes against a real Azure OpenAI deployment. + +Each case demonstrates a different eval *pattern* (see notes inside +`eval/golden_patterns.json`): + + - factual recall with exact_match + - numeric reasoning with numeric_close + - free-form definitional with semantic_similar + - structured-output adherence with exact_match + +This file is *skipped entirely* unless the Azure OpenAI env vars are set +(`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`). +Run with:: + + uv sync --extra eval --extra dev + AZURE_OPENAI_ENDPOINT=... AZURE_OPENAI_API_KEY=... \\ + AZURE_OPENAI_DEPLOYMENT=... uv run pytest eval/test_golden_patterns.py + +The toy `eval/test_golden_qa.py` runs without any credentials — that one +exercises the runner mechanics; this one exercises the runner against a +real model. +""" + +from __future__ import annotations + +import os +from pathlib import Path + +import pytest + +from src.eval.models import EvalCase +from src.eval.runner import EvalRunner, load_golden_dataset + +_PATTERNS_PATH = Path(__file__).resolve().parent / "golden_patterns.json" +_REQUIRED_ENV = ( + "AZURE_OPENAI_ENDPOINT", + "AZURE_OPENAI_API_KEY", + "AZURE_OPENAI_DEPLOYMENT", +) + +_missing = [name for name in _REQUIRED_ENV if not os.environ.get(name)] +pytestmark = [ + pytest.mark.eval, + pytest.mark.skipif( + bool(_missing), + reason=f"requires Azure OpenAI env vars: missing {', '.join(_missing)}", + ), +] + +patterns = load_golden_dataset(_PATTERNS_PATH) + + +@pytest.fixture(scope="module") +def runner() -> EvalRunner: + """Construct the runner with one Azure client serving both roles + (answer_fn and judge_client). Same deployment for cost simplicity; + a real project might split subject and judge models.""" + from src.eval.adapters.azure_openai import AzureOpenAIClient + + client = AzureOpenAIClient() + return EvalRunner( + answer_fn=client.complete, + judge_client=client, + # Azure addresses by deployment, set at client construction. The + # runner still passes this through for Protocol conformance. + judge_model="azure-deployment", + ) + + +@pytest.mark.parametrize("case", patterns, ids=lambda c: c.id) +def test_golden_patterns(case: EvalCase, runner: EvalRunner) -> None: + """Run one worked pattern case against the live Azure deployment.""" + result = runner.evaluate(case) + assert result.pass_result, ( + f"[{case.id}] {case.category}/{case.difficulty}\n" + f"Q: {case.question}\n" + f"Expected: {case.expected_answer}\n" + f"Got: {result.actual_answer}\n" + f"Reason: {result.failure_reason}" + ) diff --git a/pyproject.toml b/pyproject.toml index 0651387..a416243 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -55,6 +55,13 @@ dev = [ "commitizen>=4.0.0", "pyyaml>=6.0.3", ] +# Optional extra for the eval harness's LLM-backed pattern cases. Kept +# separate from `dev` so a contributor working on backend/frontend code +# never pulls the openai SDK or its transitive deps. See +# docs/EVAL_HARNESS.md for the full setup. +eval = [ + "openai>=1.40.0", +] [project.urls] Homepage = "https://github.com/constk/harness-python-react" @@ -122,6 +129,11 @@ warn_unused_ignores = true [[tool.mypy.overrides]] module = [ "opentelemetry.*", + # `openai` is an optional extra (see [project.optional-dependencies]). + # mypy on a stock `uv sync --extra dev` checkout doesn't see it; the + # adapter in src/eval/adapters/azure_openai.py wraps it in `Any` at + # the import boundary so the rest of src/ stays fully typed. + "openai.*", ] ignore_missing_imports = true diff --git a/src/eval/adapters/__init__.py b/src/eval/adapters/__init__.py new file mode 100644 index 0000000..7a11e47 --- /dev/null +++ b/src/eval/adapters/__init__.py @@ -0,0 +1,13 @@ +"""Concrete LLM-client adapters for the eval harness. + +The judge in `src.eval.judge` calls an `LLMClient` Protocol — never an SDK +directly. Each adapter in this package implements that Protocol for one +provider, so the eval core stays vendor-neutral and a downstream consumer +can swap providers by changing one wiring line. + +Adapters are intentionally thin: env-driven construction, lazy SDK import, +one `complete_json(...)` method. No retries, no streaming, no batching — +the goal is "works for nightly eval runs", not "production-grade client". +""" + +from __future__ import annotations diff --git a/src/eval/adapters/azure_openai.py b/src/eval/adapters/azure_openai.py new file mode 100644 index 0000000..a8ed742 --- /dev/null +++ b/src/eval/adapters/azure_openai.py @@ -0,0 +1,123 @@ +"""Azure OpenAI adapter implementing the eval-harness `LLMClient` Protocol. + +Why Azure and not vanilla OpenAI: the eval slice is intentionally +provider-distinct from the rest of the harness (which uses Claude via +Claude Code). Demonstrates that the `LLMClient` Protocol does its job — +the eval core in `src/eval/judge.py` doesn't import the `openai` SDK +anywhere. + +Env vars (read at construction time; all required except API version): + + AZURE_OPENAI_ENDPOINT e.g. https://my-resource.openai.azure.com + AZURE_OPENAI_API_KEY the Azure resource key + AZURE_OPENAI_DEPLOYMENT deployment name, e.g. "gpt-4o-mini" + AZURE_OPENAI_API_VERSION optional; defaults to 2024-10-21 + +The `openai` SDK is an *optional* extra (`uv sync --extra eval`). Importing +this module does not require the SDK; only constructing `AzureOpenAIClient` +does. That keeps the rest of the harness importable on a stock +`uv sync --extra dev` checkout. +""" + +from __future__ import annotations + +import os +from typing import TYPE_CHECKING, Any + +if TYPE_CHECKING: + from collections.abc import Mapping + + +class AzureOpenAIConfigError(RuntimeError): + """Raised when required Azure OpenAI configuration is missing.""" + + +_REQUIRED_ENV = ( + "AZURE_OPENAI_ENDPOINT", + "AZURE_OPENAI_API_KEY", + "AZURE_OPENAI_DEPLOYMENT", +) +_DEFAULT_API_VERSION = "2024-10-21" + + +def _resolve_config(env: Mapping[str, str]) -> tuple[str, str, str, str]: + """Read the four config values from env; raise with all missing names.""" + endpoint = env.get("AZURE_OPENAI_ENDPOINT", "") + api_key = env.get("AZURE_OPENAI_API_KEY", "") + deployment = env.get("AZURE_OPENAI_DEPLOYMENT", "") + api_version = env.get("AZURE_OPENAI_API_VERSION", "") or _DEFAULT_API_VERSION + + missing = [name for name in _REQUIRED_ENV if not env.get(name)] + if missing: + raise AzureOpenAIConfigError( + f"Missing required Azure OpenAI env vars: {', '.join(missing)}. " + "See docs/EVAL_HARNESS.md for the full setup." + ) + return endpoint, api_key, deployment, api_version + + +class AzureOpenAIClient: + """Implements `src.eval.judge.LLMClient` against an Azure OpenAI deployment. + + Used in two roles by `eval/test_golden_patterns.py`: + + 1. As the `answer_fn` — the thing whose output we are evaluating. + 2. As the `judge_client` — the LLM that scores `semantic_similar` + cases. Same deployment serves both for cost simplicity; a real + project might split judge and subject. + """ + + def __init__(self) -> None: + endpoint, api_key, deployment, api_version = _resolve_config(os.environ) + self._deployment = deployment + + # Lazy SDK import: keeps the module importable without `openai` + # installed. Constructing the client without the extra is the + # error case, not importing the module. + try: + from openai import AzureOpenAI + except ImportError as exc: # pragma: no cover - env-dependent + raise AzureOpenAIConfigError( + "openai SDK not installed. Run: uv sync --extra eval" + ) from exc + + self._client: Any = AzureOpenAI( + azure_endpoint=endpoint, + api_key=api_key, + api_version=api_version, + ) + + def complete(self, prompt: str) -> str: + """Return the assistant's plain-text response to `prompt`. + + Used as the eval runner's `answer_fn`. Returns "" if the model + returns no content (rare but possible for safety-filtered prompts). + """ + response = self._client.chat.completions.create( + model=self._deployment, + messages=[{"role": "user", "content": prompt}], + ) + return response.choices[0].message.content or "" + + def complete_json(self, *, model: str, prompt: str) -> str: + """Return the assistant's response as a raw JSON string. + + Implements the `LLMClient` Protocol. The `model` argument is + accepted for Protocol conformance but ignored — Azure addresses + by deployment name, set at construction time. Uses Azure's + structured-output mode (`response_format={"type": "json_object"}`) + to guarantee parseable JSON. + """ + del model # Azure dispatches by deployment, not model + response = self._client.chat.completions.create( + model=self._deployment, + messages=[ + { + "role": "system", + "content": "Respond with valid JSON only. No prose, no markdown.", + }, + {"role": "user", "content": prompt}, + ], + response_format={"type": "json_object"}, + ) + return response.choices[0].message.content or "{}" diff --git a/uv.lock b/uv.lock index e8fcd8c..8a7b1b7 100644 --- a/uv.lock +++ b/uv.lock @@ -226,6 +226,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" }, ] +[[package]] +name = "distro" +version = "1.9.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload-time = "2023-12-24T09:54:32.31Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload-time = "2023-12-24T09:54:30.421Z" }, +] + [[package]] name = "fastapi" version = "0.136.1" @@ -357,6 +366,9 @@ dev = [ { name = "pyyaml" }, { name = "ruff" }, ] +eval = [ + { name = "openai" }, +] [package.metadata] requires-dist = [ @@ -365,6 +377,7 @@ requires-dist = [ { name = "httpx", specifier = ">=0.28.1" }, { name = "import-linter", marker = "extra == 'dev'", specifier = ">=2.0.0" }, { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.15.0" }, + { name = "openai", marker = "extra == 'eval'", specifier = ">=1.40.0" }, { name = "opentelemetry-api", specifier = ">=1.33.0" }, { name = "opentelemetry-exporter-otlp-proto-grpc", specifier = ">=1.33.0" }, { name = "opentelemetry-instrumentation-fastapi", specifier = ">=0.62b0" }, @@ -382,7 +395,7 @@ requires-dist = [ { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.11.0" }, { name = "uvicorn", extras = ["standard"], specifier = ">=0.34.0" }, ] -provides-extras = ["dev"] +provides-extras = ["dev", "eval"] [[package]] name = "httpcore" @@ -493,6 +506,41 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" }, ] +[[package]] +name = "jiter" +version = "0.15.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/66/b5/55f06bb281d92fb3cc86d14e1def2bd908bb77693183e7cb1f5a3c388b0c/jiter-0.15.0.tar.gz", hash = "sha256:4251acc80e2b7c9b7b8823456ea0fceeb0734dac2df7636d3c711b38476b5a76", size = 166640, upload-time = "2026-05-19T10:09:48.361Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/eb/d2/079f350ebf7859d081de30aa890f9e3be68516f754f3ba32366ffff4dcee/jiter-0.15.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:ac0d9ddea4350974be7a221fc25895f251a8fee748c889bdced2141c0fec1a49", size = 308884, upload-time = "2026-05-19T10:08:31.667Z" }, + { url = "https://files.pythonhosted.org/packages/04/4e/a2c30a7f69b48c03b20935d647479106fe932f6e63f75faf53937197e05d/jiter-0.15.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:01a8222cf05ab1128e239421156c207949808acaaea2bdfd33130ae666786e86", size = 310028, upload-time = "2026-05-19T10:08:33.304Z" }, + { url = "https://files.pythonhosted.org/packages/40/90/2e7cdfd3cf8ca967be38c48f5cf474d79f089efaf559a40f15984a77ae69/jiter-0.15.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:182226cbc930c9fab81bc2e41a4da672f89539906dadb05e75670ac07b94f71f", size = 337485, upload-time = "2026-05-19T10:08:35.259Z" }, + { url = "https://files.pythonhosted.org/packages/9b/11/15a1aa28b120b8ee5b4f1fb894c125046225f09847738bd64233d3b84883/jiter-0.15.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:71683c38c825452999b5717fcae07ea708e8c93003e808be4319c1b02e3d176e", size = 364223, upload-time = "2026-05-19T10:08:36.694Z" }, + { url = "https://files.pythonhosted.org/packages/b7/25/f442e8af5f3d0dcf47b39e83a0efd9ee45ea946aa6d04625dc3181eae3b6/jiter-0.15.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:30f2218e6a9e5c18bc10fe6d41ac189c442c88eacf11bad9f28ef95a9bef00e6", size = 456387, upload-time = "2026-05-19T10:08:38.143Z" }, + { url = "https://files.pythonhosted.org/packages/da/f4/37f2d2c9f64f49af7da652ed7532bb5a2372e588e6927c3fdd76f911db65/jiter-0.15.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5157de9f76eb4bc5ea74a1219366a25f945ad305641d74e04f59c54087091aa9", size = 374461, upload-time = "2026-05-19T10:08:39.869Z" }, + { url = "https://files.pythonhosted.org/packages/60/28/edcfbbbf0cb15436f36664a8908a0df47ab9006298d4cd937dc08ea932d6/jiter-0.15.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:90c5db5527c221249a876160663ab891ace358c17f7b9c93ec1478b7f0550e5c", size = 345924, upload-time = "2026-05-19T10:08:41.668Z" }, + { url = "https://files.pythonhosted.org/packages/47/13/89fba6398dab7f202b7278c4b4aac122399d2c0183971c4a57a3b7088df5/jiter-0.15.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:3e4540b8e74e4268811ac05db226a6a128ff572e7e0ce3f1163b693cadb184cd", size = 352283, upload-time = "2026-05-19T10:08:43.091Z" }, + { url = "https://files.pythonhosted.org/packages/1b/da/0f6af8cef2c565a1ab44d970f268c43ccaa72707386ea6388e6fe2b6cd26/jiter-0.15.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:62ebd14e47e9aed9df4472afcb2663668ce4d74891cd54f86bf6e44029d6dc89", size = 389985, upload-time = "2026-05-19T10:08:44.915Z" }, + { url = "https://files.pythonhosted.org/packages/a1/ec/b9cb7d6d29e24ee14910266157d2a279d7a8f60ee0df7fa840882976ba64/jiter-0.15.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:0be6f5ad41a809f303f416d17cec92a7a725902fb9b4f3de3d19362ac0ef8554", size = 517695, upload-time = "2026-05-19T10:08:46.486Z" }, + { url = "https://files.pythonhosted.org/packages/64/5e/6d1bda880723aae0ad86b4b763f044362448efe31e3e819635d41cb03451/jiter-0.15.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:813dfbb17d65328bf86e5f0905dd277ba2265d3ca20556e86c0c7035b7182e5a", size = 548868, upload-time = "2026-05-19T10:08:48.026Z" }, + { url = "https://files.pythonhosted.org/packages/0c/72/7de501cf38dcacaf35098796f3a50e0f2e338baba18a58946c618544b809/jiter-0.15.0-cp314-cp314-win32.whl", hash = "sha256:50e51156192722a9c58db112837d3f8ef96fb3c5ecc14e95f409134b08b158ec", size = 206380, upload-time = "2026-05-19T10:08:49.738Z" }, + { url = "https://files.pythonhosted.org/packages/1e/a9/e19addf4b0c1bdce52c6da12351e6bc42c340c45e7c09e2158e46d293ccc/jiter-0.15.0-cp314-cp314-win_amd64.whl", hash = "sha256:30ce1a5d16b5641dc935d50ef775af6a0871e3d14ab05d6fc54dff371b78e558", size = 197687, upload-time = "2026-05-19T10:08:51.088Z" }, + { url = "https://files.pythonhosted.org/packages/f2/c9/776b1db01db25fc6c1d58d1979a37b0a9fe787e5f5b1d062d2eaacb77923/jiter-0.15.0-cp314-cp314-win_arm64.whl", hash = "sha256:510c8b3c17a0ed9ac69850c0438dada3c9b82d9c4d589fcb62002a5a9cf3a866", size = 192571, upload-time = "2026-05-19T10:08:52.451Z" }, + { url = "https://files.pythonhosted.org/packages/a0/f6/45bb4670bacf300fd2c7abadbfb3af376e5f1b6ae75fd9bc069891d15870/jiter-0.15.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7553333dd0930c104a5a0db8df72bf7219fe663d731383b576bb6ed6351c984d", size = 317151, upload-time = "2026-05-19T10:08:53.867Z" }, + { url = "https://files.pythonhosted.org/packages/d7/68/ed635ad5acd7b73e454283083bbb7c8205ad10e88b0d9d7d793b09fe8226/jiter-0.15.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f2143ab06181d2b029eedcb6af3cebe95f11bbac62441781860f98ee9330a6a6", size = 341243, upload-time = "2026-05-19T10:08:55.383Z" }, + { url = "https://files.pythonhosted.org/packages/5d/db/3ff4176b817b8ea33879e71e13d8bc2b0d481a7ed3fe9e080f333d415c16/jiter-0.15.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6eac374c5c975709b69c10f09afd199df74150172156ad10c8d4fd785b7da995", size = 363629, upload-time = "2026-05-19T10:08:56.928Z" }, + { url = "https://files.pythonhosted.org/packages/ab/24/5f8270e0ba9c883582f96f722f8a0b58015c7ce1f8c6d4571cf394e99b6b/jiter-0.15.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b3b3b775e33d3bfaec9899edc526ae97b0da0bf9d071a46124ba419149a414f8", size = 456198, upload-time = "2026-05-19T10:08:58.618Z" }, + { url = "https://files.pythonhosted.org/packages/45/5b/76fc02b0b5c54c3d18c60653156e2f76fde1816f9b4722db68d6ee2c897e/jiter-0.15.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:eda3071db3346334beae1360b46da4606da57bf3528c167b3c38533afaf9f2c5", size = 373710, upload-time = "2026-05-19T10:09:00.151Z" }, + { url = "https://files.pythonhosted.org/packages/c4/52/4310821b0ea9277994d3e1f49fc6a4b34e4800caebacb2c0af81da59a454/jiter-0.15.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c6694a173ecabc12eb60efbc0b474464ead1951ff65cd8b1e72100715c64512b", size = 349901, upload-time = "2026-05-19T10:09:01.621Z" }, + { url = "https://files.pythonhosted.org/packages/93/fe/67648c35b3594fba8854ac64cc8a826d8bcd18324bbdb53d77697c60b6ef/jiter-0.15.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:a254e10b593624d230c365b6d616b22ca0ad65e63a16e6631c2b3466022e6ba8", size = 352438, upload-time = "2026-05-19T10:09:03.216Z" }, + { url = "https://files.pythonhosted.org/packages/cb/28/0a1879d07ad6b3e025a2750027363452ced93c2d16d1c9d4b153ffd51c91/jiter-0.15.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d8d2955167274e15d79a7a020afdd9b39c990eb80b2d89fca695d92dcfdd38ec", size = 388152, upload-time = "2026-05-19T10:09:04.741Z" }, + { url = "https://files.pythonhosted.org/packages/c1/78/46c6f6b56ba85c90021f4afd72ed42f691f8f84daacb5fe27277070e3858/jiter-0.15.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:acf4ee4d1fc55917239fe72972fb292dd773055d05eb040d36f4326e02cc2c0e", size = 517707, upload-time = "2026-05-19T10:09:06.231Z" }, + { url = "https://files.pythonhosted.org/packages/ca/cb/720662d4c88fcad606e826fef5424365527ba43ce4868a479aed8f8c507e/jiter-0.15.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:e7196e56f1cd69af1dbb07dff02dcfb260a50b45a82d409d92a06fedb32473b5", size = 548241, upload-time = "2026-05-19T10:09:08.093Z" }, + { url = "https://files.pythonhosted.org/packages/60/e3/935b8034fd143f21125c87d51404a9e0e1449186a494405721ff5d1d695e/jiter-0.15.0-cp314-cp314t-win32.whl", hash = "sha256:7f6163c0f10b055245f814dcc59f4818da60dfe72f3e72ab89fc24b6bd5e9c52", size = 207950, upload-time = "2026-05-19T10:09:09.616Z" }, + { url = "https://files.pythonhosted.org/packages/93/59/984fd9ece895953dad3e0880a650e766f5a2da2c5514f0eafdaaabbeb5f9/jiter-0.15.0-cp314-cp314t-win_amd64.whl", hash = "sha256:980c256edb05b78a111b99c4de3b1d32e31634b867fd1fc2cf726e7b7bba9854", size = 200055, upload-time = "2026-05-19T10:09:11.367Z" }, + { url = "https://files.pythonhosted.org/packages/0e/a4/cf8d779feb133a27a2e3bc833bccb9e13aa332cdf820497ebf72c10ce8c3/jiter-0.15.0-cp314-cp314t-win_arm64.whl", hash = "sha256:66b1880df2d01e206e8339769d1c7c1753bcb653efd6289e203f6f24ebada0c0", size = 191244, upload-time = "2026-05-19T10:09:12.74Z" }, +] + [[package]] name = "librt" version = "0.9.0" @@ -625,6 +673,25 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/88/b2/d0896bdcdc8d28a7fc5717c305f1a861c26e18c05047949fb371034d98bd/nodeenv-1.10.0-py2.py3-none-any.whl", hash = "sha256:5bb13e3eed2923615535339b3c620e76779af4cb4c6a90deccc9e36b274d3827", size = 23438, upload-time = "2025-12-20T14:08:52.782Z" }, ] +[[package]] +name = "openai" +version = "2.38.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "distro" }, + { name = "httpx" }, + { name = "jiter" }, + { name = "pydantic" }, + { name = "sniffio" }, + { name = "tqdm" }, + { name = "typing-extensions" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/8f/12/cfa322c5f5dd8fa21aab9a7a8e979e7a11123800f86ca8d82eb68a83d213/openai-2.38.0.tar.gz", hash = "sha256:798694c6cf74145541fda94325b6f8f72d8e1fd0262cc137c8d728177a6a4ce3", size = 772764, upload-time = "2026-05-21T21:23:42.105Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/0a/bf/ccff9be562e24207716d04ef9dc931c76aff0c89a7265da43e2104d7fe06/openai-2.38.0-py3-none-any.whl", hash = "sha256:ec6661c57b2dcc47414a767e6e3335c7ed3d19c9696999283a3c82e95c756a3c", size = 1344910, upload-time = "2026-05-21T21:23:39.636Z" }, +] + [[package]] name = "opentelemetry-api" version = "1.41.1" @@ -1102,6 +1169,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c0/98/6beb4b351e472e5f4c4613f7c35a5290b8be2497e183825310c4c3a3984b/ruff-0.15.12-py3-none-win_arm64.whl", hash = "sha256:a538f7a82d061cee7be55542aca1d86d1393d55d81d4fcc314370f4340930d4f", size = 11120821, upload-time = "2026-04-24T18:16:57.979Z" }, ] +[[package]] +name = "sniffio" +version = "1.3.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/a2/87/a6771e1546d97e7e041b6ae58d80074f81b7d5121207425c964ddf5cfdbd/sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc", size = 20372, upload-time = "2024-02-25T23:20:04.057Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" }, +] + [[package]] name = "starlette" version = "1.1.0" @@ -1132,6 +1208,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/b5/11/87d6d29fb5d237229d67973a6c9e06e048f01cf4994dee194ab0ea841814/tomlkit-0.14.0-py3-none-any.whl", hash = "sha256:592064ed85b40fa213469f81ac584f67a4f2992509a7c3ea2d632208623a3680", size = 39310, upload-time = "2026-01-13T01:14:51.965Z" }, ] +[[package]] +name = "tqdm" +version = "4.67.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/09/a9/6ba95a270c6f1fbcd8dac228323f2777d886cb206987444e4bce66338dd4/tqdm-4.67.3.tar.gz", hash = "sha256:7d825f03f89244ef73f1d4ce193cb1774a8179fd96f31d7e1dcde62092b960bb", size = 169598, upload-time = "2026-02-03T17:35:53.048Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/16/e1/3079a9ff9b8e11b846c6ac5c8b5bfb7ff225eee721825310c91b3b50304f/tqdm-4.67.3-py3-none-any.whl", hash = "sha256:ee1e4c0e59148062281c49d80b25b67771a127c85fc9676d3be5f243206826bf", size = 78374, upload-time = "2026-02-03T17:35:50.982Z" }, +] + [[package]] name = "typing-extensions" version = "4.15.0" From 46df7f5c3be004dd9f4a4c03e9e1f7e6342a74a2 Mon Sep 17 00:00:00 2001 From: "const.koutsakis@aurecongroup.com" Date: Tue, 26 May 2026 00:29:00 +1000 Subject: [PATCH 2/4] test: add adapter unit tests + adapters README (#94 review fixes) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94 --- eval/test_golden_patterns.py | 11 +- src/eval/adapters/README.md | 33 ++++ tests/test_eval_azure_openai_adapter.py | 234 ++++++++++++++++++++++++ 3 files changed, 275 insertions(+), 3 deletions(-) create mode 100644 src/eval/adapters/README.md create mode 100644 tests/test_eval_azure_openai_adapter.py diff --git a/eval/test_golden_patterns.py b/eval/test_golden_patterns.py index 932970a..bf42432 100644 --- a/eval/test_golden_patterns.py +++ b/eval/test_golden_patterns.py @@ -50,6 +50,13 @@ patterns = load_golden_dataset(_PATTERNS_PATH) +# Sentinel passed to EvalRunner.judge_model. The runner threads this through +# to LLMClient.complete_json(model=...), where the Azure adapter discards it +# — Azure addresses by deployment name (set at adapter construction), not by +# the model parameter. Named constant makes the intent obvious to a reader +# of this fixture without needing to chase into the adapter. +_AZURE_DEPLOYMENT_SENTINEL = "azure-deployment-from-env" + @pytest.fixture(scope="module") def runner() -> EvalRunner: @@ -62,9 +69,7 @@ def runner() -> EvalRunner: return EvalRunner( answer_fn=client.complete, judge_client=client, - # Azure addresses by deployment, set at client construction. The - # runner still passes this through for Protocol conformance. - judge_model="azure-deployment", + judge_model=_AZURE_DEPLOYMENT_SENTINEL, ) diff --git a/src/eval/adapters/README.md b/src/eval/adapters/README.md new file mode 100644 index 0000000..4c52a51 --- /dev/null +++ b/src/eval/adapters/README.md @@ -0,0 +1,33 @@ +# `src/eval/adapters` + +Concrete `LLMClient` adapters for the eval harness. The judge in [`src/eval/judge.py`](../judge.py) calls an `LLMClient` Protocol — never a vendor SDK directly. Each adapter in this package implements that Protocol for one provider, so the eval core stays vendor-neutral and a downstream consumer can swap providers by changing one wiring line in their test fixture. + +## Why this layer exists + +Without the Protocol seam, swapping LLM providers would mean touching the eval core. With it, vendor lock-in is confined to one file per provider. The layer demonstrates that the harness's "provider-agnostic" claim is structural, not aspirational: the eval core has zero imports of any vendor SDK. + +## Current adapters + +| File | Provider | Optional extra | Env contract | +|---|---|---|---| +| [`azure_openai.py`](azure_openai.py) | Azure OpenAI | `uv sync --extra eval` | `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`, optional `AZURE_OPENAI_API_VERSION` (default `2024-10-21`) | + +## Adding a new adapter + +1. Add the SDK to `[project.optional-dependencies]` in `pyproject.toml` — either to the existing `eval` extra or a new provider-scoped one. +2. Add the SDK's top-level module to `[[tool.mypy.overrides]]` with `ignore_missing_imports = true`, matching the existing `openai.*` / `opentelemetry.*` entries. This keeps mypy clean on stock `uv sync --extra dev` checkouts. +3. Implement `complete_json(*, model: str, prompt: str) -> str` per the `LLMClient` Protocol in [`src/eval/judge.py`](../judge.py). Optionally add a `complete(prompt: str) -> str` for use as an `EvalRunner.answer_fn`. +4. **Lazy-import the SDK inside `__init__`** so the adapter module remains importable without the optional extra installed. The import error path should raise a clear, named exception (e.g. `AzureOpenAIConfigError`) telling the reader which `uv sync --extra ...` to run. +5. Read configuration from environment variables at construction time. Raise the same named exception listing every missing var when env is incomplete — fail fast, fail clear. +6. Add an offline unit test in [`tests/`](../../../tests/) that mocks the SDK at the `sys.modules` level (see `tests/test_eval_azure_openai_adapter.py` for the pattern). This keeps the unit suite credential-free; live-credential paths are exercised by [`eval/test_golden_patterns.py`](../../../eval/test_golden_patterns.py). +7. Document the env contract in this README's table above and in [`docs/EVAL_HARNESS.md`](../../../docs/EVAL_HARNESS.md)'s "Worked patterns" section. + +## Why adapters live under `src/eval/` + +The import-linter contract in `pyproject.toml` puts `src.eval` at the top of the layered import order: + +``` +api | eval -> agent -> tools -> data -> observability -> models +``` + +Adapters can therefore depend on anything in `src/`; nothing in `src/` depends on them. That asymmetry is exactly what the layered architecture exists to encode — vendor-specific code stays at the boundary, never leaks down into the eval primitives or the model layer. diff --git a/tests/test_eval_azure_openai_adapter.py b/tests/test_eval_azure_openai_adapter.py new file mode 100644 index 0000000..7b08bca --- /dev/null +++ b/tests/test_eval_azure_openai_adapter.py @@ -0,0 +1,234 @@ +"""Offline unit tests for the Azure OpenAI eval adapter. + +These tests never hit the network. The `openai` SDK is replaced at the +`sys.modules` level so the adapter's lazy import resolves to a `MagicMock`, +which lets us assert on the constructor arguments and the chat-completions +call shape without an API key. + +The live-credential path is exercised by `eval/test_golden_patterns.py`, +which is skipped on stock checkouts. +""" + +from __future__ import annotations + +import sys +from types import SimpleNamespace +from unittest.mock import MagicMock + +import pytest + +from src.eval.adapters.azure_openai import ( + _DEFAULT_API_VERSION, + AzureOpenAIClient, + AzureOpenAIConfigError, + _resolve_config, +) + +# --------------------------------------------------------------------------- +# _resolve_config — pure function, no SDK involved +# --------------------------------------------------------------------------- + + +class TestResolveConfig: + """`_resolve_config` reads env, applies the default API version, and + raises a single `AzureOpenAIConfigError` naming every missing var.""" + + def test_returns_env_values_with_default_api_version(self) -> None: + env = { + "AZURE_OPENAI_ENDPOINT": "https://x.openai.azure.com", + "AZURE_OPENAI_API_KEY": "key", + "AZURE_OPENAI_DEPLOYMENT": "gpt-4o-mini", + } + endpoint, key, deploy, version = _resolve_config(env) + assert endpoint == "https://x.openai.azure.com" + assert key == "key" + assert deploy == "gpt-4o-mini" + assert version == _DEFAULT_API_VERSION + + def test_explicit_api_version_overrides_default(self) -> None: + env = { + "AZURE_OPENAI_ENDPOINT": "https://x.openai.azure.com", + "AZURE_OPENAI_API_KEY": "key", + "AZURE_OPENAI_DEPLOYMENT": "deploy", + "AZURE_OPENAI_API_VERSION": "2025-01-01", + } + _, _, _, version = _resolve_config(env) + assert version == "2025-01-01" + + def test_empty_api_version_falls_back_to_default(self) -> None: + env = { + "AZURE_OPENAI_ENDPOINT": "https://x.openai.azure.com", + "AZURE_OPENAI_API_KEY": "key", + "AZURE_OPENAI_DEPLOYMENT": "deploy", + "AZURE_OPENAI_API_VERSION": "", + } + _, _, _, version = _resolve_config(env) + assert version == _DEFAULT_API_VERSION + + def test_raises_listing_all_missing_when_none_set(self) -> None: + with pytest.raises(AzureOpenAIConfigError) as exc: + _resolve_config({}) + msg = str(exc.value) + assert "AZURE_OPENAI_ENDPOINT" in msg + assert "AZURE_OPENAI_API_KEY" in msg + assert "AZURE_OPENAI_DEPLOYMENT" in msg + + def test_raises_listing_only_missing(self) -> None: + env = { + "AZURE_OPENAI_ENDPOINT": "x", + "AZURE_OPENAI_DEPLOYMENT": "d", + # AZURE_OPENAI_API_KEY missing + } + with pytest.raises(AzureOpenAIConfigError) as exc: + _resolve_config(env) + msg = str(exc.value) + assert "AZURE_OPENAI_API_KEY" in msg + assert "AZURE_OPENAI_ENDPOINT" not in msg + assert "AZURE_OPENAI_DEPLOYMENT" not in msg + + +# --------------------------------------------------------------------------- +# AzureOpenAIClient — SDK is mocked at sys.modules level +# --------------------------------------------------------------------------- + + +@pytest.fixture +def _env(monkeypatch: pytest.MonkeyPatch) -> None: + """Populate the three required env vars with test values.""" + monkeypatch.setenv("AZURE_OPENAI_ENDPOINT", "https://x.openai.azure.com") + monkeypatch.setenv("AZURE_OPENAI_API_KEY", "test-key") + monkeypatch.setenv("AZURE_OPENAI_DEPLOYMENT", "test-deploy") + monkeypatch.delenv("AZURE_OPENAI_API_VERSION", raising=False) + + +@pytest.fixture +def _mock_openai(monkeypatch: pytest.MonkeyPatch) -> MagicMock: + """Install a fake `openai` module exporting a `AzureOpenAI` constructor. + + The adapter's lazy `from openai import AzureOpenAI` will resolve to the + `MagicMock` returned here, so call-args assertions work without any SDK + installed. + """ + mock_constructor = MagicMock(name="AzureOpenAI") + fake_module = SimpleNamespace(AzureOpenAI=mock_constructor) + monkeypatch.setitem(sys.modules, "openai", fake_module) + return mock_constructor + + +class TestAzureOpenAIClientConstruction: + """Constructor wires env config into the SDK client and surfaces clear + errors when prerequisites are missing.""" + + def test_init_constructs_sdk_with_resolved_env_config( + self, _env: None, _mock_openai: MagicMock + ) -> None: + AzureOpenAIClient() + _mock_openai.assert_called_once_with( + azure_endpoint="https://x.openai.azure.com", + api_key="test-key", + api_version=_DEFAULT_API_VERSION, + ) + + def test_init_passes_explicit_api_version( + self, + _env: None, + _mock_openai: MagicMock, + monkeypatch: pytest.MonkeyPatch, + ) -> None: + monkeypatch.setenv("AZURE_OPENAI_API_VERSION", "2025-01-01") + AzureOpenAIClient() + kwargs = _mock_openai.call_args.kwargs + assert kwargs["api_version"] == "2025-01-01" + + def test_init_raises_when_env_missing( + self, monkeypatch: pytest.MonkeyPatch + ) -> None: + for name in ( + "AZURE_OPENAI_ENDPOINT", + "AZURE_OPENAI_API_KEY", + "AZURE_OPENAI_DEPLOYMENT", + ): + monkeypatch.delenv(name, raising=False) + with pytest.raises(AzureOpenAIConfigError, match="AZURE_OPENAI_ENDPOINT"): + AzureOpenAIClient() + + def test_init_raises_when_openai_sdk_missing( + self, + _env: None, + monkeypatch: pytest.MonkeyPatch, + ) -> None: + # Force the lazy import inside __init__ to ImportError. Setting the + # module to None makes `from openai import AzureOpenAI` raise the + # exact ImportError the adapter catches. + monkeypatch.setitem(sys.modules, "openai", None) + with pytest.raises(AzureOpenAIConfigError, match="openai SDK not installed"): + AzureOpenAIClient() + + +class TestAzureOpenAIClientCalls: + """`complete` and `complete_json` dispatch correctly to the SDK and + return the assistant message content.""" + + @staticmethod + def _mock_response(content: str | None) -> MagicMock: + """Build a ChatCompletion-shaped MagicMock with the given content.""" + message = MagicMock() + message.content = content + choice = MagicMock() + choice.message = message + response = MagicMock() + response.choices = [choice] + return response + + def test_complete_json_uses_structured_output_mode( + self, _env: None, _mock_openai: MagicMock + ) -> None: + sdk_instance = _mock_openai.return_value + sdk_instance.chat.completions.create.return_value = self._mock_response( + '{"ok": true}' + ) + + client = AzureOpenAIClient() + body = client.complete_json(model="ignored-per-Protocol", prompt="judge this") + + assert body == '{"ok": true}' + call = sdk_instance.chat.completions.create.call_args + assert call.kwargs["model"] == "test-deploy" + assert call.kwargs["response_format"] == {"type": "json_object"} + messages = call.kwargs["messages"] + assert messages[0]["role"] == "system" + assert "JSON" in messages[0]["content"] + assert messages[1] == {"role": "user", "content": "judge this"} + + def test_complete_json_returns_empty_json_on_null_content( + self, _env: None, _mock_openai: MagicMock + ) -> None: + sdk_instance = _mock_openai.return_value + sdk_instance.chat.completions.create.return_value = self._mock_response(None) + + client = AzureOpenAIClient() + assert client.complete_json(model="x", prompt="x") == "{}" + + def test_complete_dispatches_user_message_to_deployment( + self, _env: None, _mock_openai: MagicMock + ) -> None: + sdk_instance = _mock_openai.return_value + sdk_instance.chat.completions.create.return_value = self._mock_response("hi") + + client = AzureOpenAIClient() + assert client.complete("say hi") == "hi" + + call = sdk_instance.chat.completions.create.call_args + assert call.kwargs["model"] == "test-deploy" + assert call.kwargs["messages"] == [{"role": "user", "content": "say hi"}] + # complete() does not pin response_format — only complete_json does + assert "response_format" not in call.kwargs + + def test_complete_returns_empty_string_on_null_content( + self, _env: None, _mock_openai: MagicMock + ) -> None: + sdk_instance = _mock_openai.return_value + sdk_instance.chat.completions.create.return_value = self._mock_response(None) + + client = AzureOpenAIClient() + assert client.complete("x") == "" From 3f3ab782b4474506d52849176e36324c2d83322e Mon Sep 17 00:00:00 2001 From: "const.koutsakis@aurecongroup.com" Date: Tue, 26 May 2026 01:20:02 +1000 Subject: [PATCH 3/4] docs: add Key interfaces section to adapters README (#94 review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit src/ README audit gate looks for a `## Key interfaces` (or `## Public surface`) anchor — the existing README had purpose / table / extension recipe / layering rationale, but no exported-names section. Adds a `## Key interfaces` section listing the two exported names: - AzureOpenAIClient — the LLMClient implementation with notes on complete() vs complete_json() and the discarded `model` arg (Azure dispatches by deployment, not model). - AzureOpenAIConfigError — the construction-time error type, noting that it batches every missing env var into a single message instead of failing-and-retrying. Both already documented in the adapter docstrings; this section hoists them to the README anchor the audit gate enforces. Refs #94 --- src/eval/adapters/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/eval/adapters/README.md b/src/eval/adapters/README.md index 4c52a51..e5a66e4 100644 --- a/src/eval/adapters/README.md +++ b/src/eval/adapters/README.md @@ -2,6 +2,13 @@ Concrete `LLMClient` adapters for the eval harness. The judge in [`src/eval/judge.py`](../judge.py) calls an `LLMClient` Protocol — never a vendor SDK directly. Each adapter in this package implements that Protocol for one provider, so the eval core stays vendor-neutral and a downstream consumer can swap providers by changing one wiring line in their test fixture. +## Key interfaces + +Exported from this package: + +- **`AzureOpenAIClient`** — implements `src.eval.judge.LLMClient`. Construct from env via `AzureOpenAIClient()`; call `complete(prompt)` for runner `answer_fn` use, `complete_json(*, model, prompt)` for judge use. The `model` argument on `complete_json` is accepted for Protocol conformance and discarded — Azure addresses by deployment name (set at construction time, read from `AZURE_OPENAI_DEPLOYMENT`). +- **`AzureOpenAIConfigError`** — raised at construction when required env is missing or the optional `openai` extra is not installed. Subclass of `RuntimeError`. The error message names every missing env var in one go so the caller doesn't have to fix-and-retry. + ## Why this layer exists Without the Protocol seam, swapping LLM providers would mean touching the eval core. With it, vendor lock-in is confined to one file per provider. The layer demonstrates that the harness's "provider-agnostic" claim is structural, not aspirational: the eval core has zero imports of any vendor SDK. From 1a32080b9595ea5adc7040f43f7bb71e366ecafc Mon Sep 17 00:00:00 2001 From: "const.koutsakis@aurecongroup.com" Date: Tue, 26 May 2026 15:19:58 +1000 Subject: [PATCH 4/4] chore: bump version to 0.2.12 (rebase onto develop after #103) --- pyproject.toml | 2 +- uv.lock | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index a416243..71c6d76 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "harness-python-react" -version = "0.2.11" +version = "0.2.12" description = "Production-quality LLM-driven coding harness — Python (FastAPI) backend, Vite + React + TypeScript frontend." readme = "README.md" requires-python = ">=3.14" diff --git a/uv.lock b/uv.lock index 8a7b1b7..1b94326 100644 --- a/uv.lock +++ b/uv.lock @@ -337,7 +337,7 @@ wheels = [ [[package]] name = "harness-python-react" -version = "0.2.11" +version = "0.2.12" source = { virtual = "." } dependencies = [ { name = "fastapi" },