parameterlab · cemde · Mar 18, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -265,12 +265,11 @@ mkdocs serve
 
 1. Create a feature branch (never commit to `main`)
 2. Make changes following code style guidelines
-3. Run formatters and linters: `ruff format . && ruff check . --fix`
-4. Run tests: `pytest -v`
-5. Update documentation if needed
-6. Open PR against `main` branch
-7. Request review from `cemde`
-8. Ensure all CI checks pass
+3. Run `just all` before committing. This formats, lints, typechecks, and tests in one step. See the `justfile` for all available recipes.
+4. Update documentation if needed
+5. Open PR against `main` branch
+6. Request review from `cemde`
+7. Ensure all CI checks pass
 
 **CI Pipeline:** GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge.
 
@@ -301,25 +300,24 @@ Example workflow:
 ## Common Tasks Quick Reference
 
 ```bash
-# Fresh environment setup
-uv sync --all-extras --all-groups
+# Fresh environment setup / Update after pulling changes
+just install # uv sync --all-extras --all-groups
 
-# Before committing
-uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
+# Before committing (format, lint, typecheck, test)
+just all
 
 # Run example
 uv run python examples/amazon_collab.py
 
-# Update after pulling changes
-uv sync --all-extras --all-groups
-
 # Add optional dependency
 uv add --optional <extra-name> <package-name>
 
 # Check specific test file
 uv run pytest tests/test_core/test_agent.py -v
 ```
 
+For more comments see `justfile`.
+
 ## Security and Confidentiality
 
 **IMPORTANT:** This project contains confidential research material.
@@ -540,4 +538,4 @@ class Evaluator:
         ...
 ```
 
-**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
+**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Core**
 
+- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
+- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
+- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
+
 - `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
 - `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
 
@@ -39,10 +43,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Examples**
 
+- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
+
 - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
 - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
+**Documentation**
+
+- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)
+
 **Core**
 
 - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
@@ -108,8 +118,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `LangGraphUser` → `LangGraphLLMUser`
   - `LlamaIndexUser` → `LlamaIndexLLMUser`
 
-**Documentation**
-
 - All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
 
 **Testing**

diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
 | [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility          |
 | [Exception Handling](exception-handling.md)    | Distinguish agent errors from infrastructure failures         |
 | [Seeding](seeding.md)                          | Enable reproducible benchmark runs with deterministic seeds   |
+| [Usage & Cost Tracking](usage-tracking.md)     | Track token usage and compute cost across providers           |