A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.
LLM teams lack a lightweight way to compare prompt and tool changes before shipping.
Agent builders, prompt engineers, applied AI teams
- Load datasets from JSON or CSV
- Run prompt or agent variants
- Score outputs with rubric functions
- Compare runs and export regressions
Evaluation is moving from optional best practice to baseline engineering hygiene.
core: domain logic for evalops workbench.cli: operator-facing entrypoint for local workflows and smoke checks.docs/: product notes, roadmap, and architecture decisions.tests/: baseline regression coverage for the project contract.
uv run evalops-workbench summary
uv run evalops-workbench capabilities
uv run evalops-workbench roadmapPython, Typer, DuckDB, OpenTelemetry
- Clear product thesis
- Setup that works locally
- Tests for the primary contract
- Documentation for roadmap and architecture
- Space for production integrations in the next iteration
This repository ships with a static Vercel-ready landing page for demos and previews.
vercel deploy -yThe deployed site presents EvalOps Workbench as a standalone product page.