Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 12 additions & 14 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,12 +265,11 @@ mkdocs serve

1. Create a feature branch (never commit to `main`)
2. Make changes following code style guidelines
3. Run formatters and linters: `ruff format . && ruff check . --fix`
4. Run tests: `pytest -v`
5. Update documentation if needed
6. Open PR against `main` branch
7. Request review from `cemde`
8. Ensure all CI checks pass
3. Run `just all` before committing. This formats, lints, typechecks, and tests in one step. See the `justfile` for all available recipes.
4. Update documentation if needed
5. Open PR against `main` branch
6. Request review from `cemde`
7. Ensure all CI checks pass

**CI Pipeline:** GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge.

Expand Down Expand Up @@ -301,25 +300,24 @@ Example workflow:
## Common Tasks Quick Reference

```bash
# Fresh environment setup
uv sync --all-extras --all-groups
# Fresh environment setup / Update after pulling changes
just install # uv sync --all-extras --all-groups

# Before committing
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
# Before committing (format, lint, typecheck, test)
just all

# Run example
uv run python examples/amazon_collab.py

# Update after pulling changes
uv sync --all-extras --all-groups

# Add optional dependency
uv add --optional <extra-name> <package-name>

# Check specific test file
uv run pytest tests/test_core/test_agent.py -v
```

For more comments see `justfile`.

## Security and Confidentiality

**IMPORTANT:** This project contains confidential research material.
Expand Down Expand Up @@ -540,4 +538,4 @@ class Evaluator:
...
```

**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
12 changes: 10 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Core**

- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)

- `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
- `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)

Expand Down Expand Up @@ -39,10 +43,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Examples**

- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)

- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

**Documentation**

- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)

**Core**

- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
Expand Down Expand Up @@ -108,8 +118,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `LangGraphUser` → `LangGraphLLMUser`
- `LlamaIndexUser` → `LlamaIndexLLMUser`

**Documentation**

- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

**Testing**
Expand Down
1 change: 1 addition & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
| [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility |
| [Exception Handling](exception-handling.md) | Distinguish agent errors from infrastructure failures |
| [Seeding](seeding.md) | Enable reproducible benchmark runs with deterministic seeds |
| [Usage & Cost Tracking](usage-tracking.md) | Track token usage and compute cost across providers |
Loading
Loading