From 6c9614f6d39463140dc48821356057e806f62896 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Thu, 12 Mar 2026 16:30:00 +0100
Subject: [PATCH 01/19] updated plan

---
 usage_tracking/PLAN.md                | 317 ++++++++++++++++
 usage_tracking/api_usage_results.json | 523 ++++++++++++++++++++++++++
 usage_tracking/api_usage_test.py      | 154 ++++++++
 3 files changed, 994 insertions(+)
 create mode 100644 usage_tracking/PLAN.md
 create mode 100644 usage_tracking/api_usage_results.json
 create mode 100644 usage_tracking/api_usage_test.py

diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md
new file mode 100644
index 00000000..f9b45133
--- /dev/null
+++ b/usage_tracking/PLAN.md
@@ -0,0 +1,317 @@
+# Usage & Cost Tracking — Implementation Plan
+
+## Motivation
+
+Benchmarking multi-agent systems incurs real costs: LLM API calls (the primary driver), but also external service calls (e.g., Bloomberg data API, geocoding services, paid search APIs). MASEval currently extracts basic token counts into `ChatResponse.usage` but does not persist, enrich, or aggregate this data. We want first-class usage tracking that:
+
+- Captures token usage and cost per LLM call with provider-specific detail
+- Supports non-token costs (external service calls billed per-request or per-unit)
+- Aggregates across provider, task, component role, and total
+- Is queryable live during benchmark execution (not just post-hoc)
+- Captures usage even for failed tasks
+- Requires zero changes from benchmark implementers for the common LLM case
+
+## Design Principles
+
+1. **LLM-first, not LLM-only.** The base abstraction is generic (cost + arbitrary units), with an LLM-specific subclass that adds token semantics.
+2. **No hardcoded prices.** Pricing changes constantly. Users supply pricing or rely on provider-reported cost (e.g., OpenRouter). If neither is available, cost is `None`.
+3. **Automatic for models, opt-in for tools.** ModelAdapter tracks usage automatically via the base `chat()` method. Tool/environment authors opt in via `UsageTrackableMixin`.
+4. **Non-breaking.** `ChatResponse.usage` stays a `Dict[str, int]` with additional optional keys. Existing code that reads `usage["input_tokens"]` continues to work.
+5. **First-class collection axis.** Usage is collected via `gather_usage()` / `collect_usage()`, parallel to `gather_traces()` / `collect_traces()` and `gather_config()` / `collect_configs()`. It is not embedded inside traces.
+6. **Live queryable.** The registry maintains a running usage total across repetitions, queryable at any time via `benchmark.usage`.
+
+---
+
+## Data Model
+
+### `Usage` (base)
+
+Generic usage record for any billable resource. Stored as a simple dataclass.
+
+```
+Usage
+  cost: Optional[float]         # Total cost in USD (None = unknown)
+  cost_details: Dict[str, float]  # Breakdown (e.g., {"input": 0.01, "output": 0.03})
+  units: Dict[str, int | float]   # Arbitrary countable units (e.g., {"api_calls": 3, "bytes": 1024})
+  metadata: Dict[str, Any]        # Provider-specific extras
+```
+
+Supports `__add__` to sum two records (costs sum if both known, else None; units sum; metadata merges).
+
+### `TokenUsage(Usage)` (LLM-specific)
+
+Extends `Usage` with token fields that every LLM provider reports.
+
+```
+TokenUsage(Usage)
+  input_tokens: int
+  output_tokens: int
+  total_tokens: int
+  # Optional provider-specific detail
+  cached_input_tokens: int        # Anthropic cache_read, OpenAI cached_tokens
+  reasoning_tokens: int           # OpenAI reasoning, Google thoughts
+  audio_tokens: int               # OpenAI audio
+```
+
+`TokenUsage.__add__` sums all token fields plus delegates to `Usage.__add__` for cost/units.
+
+Class method `TokenUsage.from_chat_response_usage(usage_dict) -> TokenUsage` maps the dict returned by adapters today into a `TokenUsage` instance, handling provider-specific key names.
+
+---
+
+## UsageTrackableMixin
+
+Follows the established mixin pattern (`TraceableMixin`, `ConfigurableMixin`). Any component that inherits `UsageTrackableMixin` will have its usage automatically collected by the registry when registered.
+
+```python
+class UsageTrackableMixin:
+    """Mixin that provides usage tracking capability to any component."""
+
+    def gather_usage(self) -> Usage:
+        """Return accumulated usage for this component.
+
+        Subclasses must override this to return their accumulated Usage.
+        Base implementation returns an empty Usage.
+        """
+        return Usage()
+```
+
+Components internally accumulate `Usage` records however they see fit (typically a list + sum). The mixin only defines the collection protocol — `gather_usage() -> Usage`.
+
+### Usage in components
+
+**ModelAdapter** (automatic):
+
+```python
+class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
+    def __init__(self, seed=None):
+        super().__init__()
+        self._usage_records: List[Usage] = []
+
+    def chat(self, messages, ...):
+        response = self._chat_impl(messages, ...)
+        if response.usage:
+            self._usage_records.append(
+                TokenUsage.from_chat_response_usage(response.usage)
+            )
+        return response
+
+    def gather_usage(self) -> Usage:
+        if not self._usage_records:
+            return Usage()
+        return sum(self._usage_records[1:], self._usage_records[0])
+```
+
+**Non-model components** (opt-in):
+
+```python
+class BloombergEnvironment(Environment, UsageTrackableMixin):
+    def __init__(self, task_data):
+        super().__init__(task_data)
+        self._usage_records: List[Usage] = []
+
+    def _call_bloomberg(self, query):
+        result = bloomberg_client.query(query)
+        self._usage_records.append(Usage(
+            cost=result.billed_amount,
+            units={"api_calls": 1, "data_points": result.count},
+        ))
+        return result
+
+    def gather_usage(self) -> Usage:
+        if not self._usage_records:
+            return Usage()
+        return sum(self._usage_records[1:], self._usage_records[0])
+```
+
+---
+
+## Registry Integration
+
+The `ComponentRegistry` gains a third collection axis for usage, parallel to traces and configs.
+
+### Per-repetition collection
+
+`collect_usage()` walks all registered `UsageTrackableMixin` components and calls `gather_usage()` on each. Returns a structured dict (same shape as `collect_traces()`/`collect_configs()`). This goes into `report["usage"]`.
+
+```python
+def collect_usage(self) -> Dict[str, Any]:
+    """Collect usage from all registered UsageTrackableMixin components."""
+    usage = {
+        "metadata": {...},
+        "agents": {},
+        "models": {},
+        "tools": {},
+        ...
+        "environment": None,
+        "user": None,
+    }
+
+    for key, component in self._usage_registry.items():
+        category, comp_name = key.split(":", 1)
+        component_usage = component.gather_usage()
+
+        # Store in structured dict (same pattern as traces/configs)
+        ...
+
+        # Accumulate into persistent aggregates
+        self._usage_total += component_usage
+        self._usage_by_component[key] += component_usage
+
+    return usage
+```
+
+### Persistent aggregates (survive `clear()`)
+
+The registry maintains running totals that persist across task repetitions:
+
+```python
+class ComponentRegistry:
+    def __init__(self):
+        # ... existing per-repetition state ...
+
+        # Persistent usage aggregates (NOT cleared between repetitions)
+        self._usage_total: Usage = Usage()
+        self._usage_by_component: Dict[str, Usage] = {}
+
+    def clear(self):
+        # Clears per-repetition registrations
+        # Does NOT clear _usage_total or _usage_by_component
+
+    @property
+    def total_usage(self) -> Usage:
+        """Running total across all repetitions. Queryable at any time."""
+        return self._usage_total
+
+    @property
+    def usage_by_component(self) -> Dict[str, Usage]:
+        """Per-component running totals across all repetitions."""
+        return dict(self._usage_by_component)
+```
+
+### Registration
+
+The `register()` method gains an `isinstance(component, UsageTrackableMixin)` check, parallel to the existing `TraceableMixin` and `ConfigurableMixin` checks:
+
+```python
+def register(self, category, name, component):
+    # ... existing trace/config registration ...
+
+    if isinstance(component, UsageTrackableMixin):
+        self._usage_registry[key] = component
+        self._usage_component_id_map[component_id] = key
+```
+
+`RegisterableComponent` type alias is updated to include `UsageTrackableMixin`.
+
+---
+
+## Benchmark Integration
+
+### Report structure
+
+Each report gains a top-level `"usage"` key alongside `"traces"` and `"config"`:
+
+```python
+report = {
+    "task_id": str(task.id),
+    "repeat_idx": repeat_idx,
+    "status": execution_status.value,
+    "traces": execution_traces,
+    "config": execution_configs,
+    "usage": execution_usage,      # <-- new
+    "eval": eval_results,
+    "task": {...},
+}
+```
+
+### Live usage access
+
+```python
+benchmark.usage        # -> Usage (running grand total, delegates to registry)
+benchmark.usage_by_component  # -> Dict[str, Usage] (per-component totals)
+```
+
+### Failed task usage
+
+`collect_usage()` is called alongside `collect_all_traces()` and `collect_all_configs()` — before error status is determined. If a task fails mid-execution, whatever usage was accumulated up to the failure point is still collected and aggregated.
+
+---
+
+## Adapter `_chat_impl` Enrichment (per-provider)
+
+Each adapter enriches the `ChatResponse.usage` dict with provider-specific fields beyond the basic three. The base class `TokenUsage.from_chat_response_usage()` handles mapping.
+
+| Adapter | Extra fields to extract |
+|---------|------------------------|
+| OpenAI | `reasoning_tokens` from `completion_tokens_details`, `cached_input_tokens` from `prompt_tokens_details.cached_tokens` |
+| Anthropic | `cached_input_tokens` from `cache_read_input_tokens` |
+| Google | `reasoning_tokens` from `thoughts_token_count` |
+| LiteLLM | `reasoning_tokens` + `cached_input_tokens` from details; `cost` from `response._hidden_params` if available |
+| HuggingFace | No change (local inference, no API cost) |
+
+---
+
+## UsageReporter (post-hoc)
+
+Post-run utility that walks `report["usage"]` across all reports for sliced analysis.
+
+```
+UsageReporter
+  @staticmethod from_reports(reports: List[Dict]) -> UsageReporter
+
+  by_task() -> Dict[str, Usage]           # keyed by task_id
+  by_component() -> Dict[str, Usage]      # keyed by registry key (e.g., "models:main_model")
+  by_model() -> Dict[str, TokenUsage]     # keyed by model_id (LLM-only)
+  total() -> Usage                        # grand total
+
+  summary() -> Dict[str, Any]             # nested dict with all breakdowns
+```
+
+Unlike the registry's live aggregates, `UsageReporter` can slice by task (since it sees the full report list with task IDs).
+
+---
+
+## Evaluators
+
+Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. That model should be registered in the benchmark via `self.register("evaluator_models", "judge", model)` inside `setup_evaluators()`. Since `ModelAdapter` now inherits `UsageTrackableMixin`, its usage is automatically collected under `usage.evaluator_models.judge`.
+
+No changes to the `Evaluator` base class. This is a registration convention.
+
+## LLMUser / AgenticLLMUser
+
+These already hold a `ModelAdapter`. Their model's usage is collected automatically (since `ModelAdapter` inherits `UsageTrackableMixin` and `chat()` accumulates records). The model is already registered by the benchmark. No changes needed.
+
+---
+
+## File Plan
+
+| File | Action | Content |
+|------|--------|---------|
+| `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` |
+| `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property |
+| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()` |
+| `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict |
+| `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility |
+| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens` |
+| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens` |
+| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens` |
+| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost` |
+| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `UsageReporter` |
+| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation |
+
+No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`.
+
+---
+
+## Non-goals
+
+- **Hardcoded pricing tables** — prices change too often; user-supplied or provider-reported only.
+- **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now.
+- **Billing integration** — no webhook/billing system integration.
+- **Streaming usage** — not supported yet (usage is captured after completion).
+
+## Open Questions
+
+1. **Pricing config format**: Should pricing be passed to `ModelAdapter.__init__` as a new param, or set externally after construction? Leaning toward a `pricing` kwarg on adapter init for ergonomics. When `pricing` is provided and a `TokenUsage` record has `cost=None`, cost is computed from `pricing["input"] * input_tokens + pricing["output"] * output_tokens`.
+2. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1.
diff --git a/usage_tracking/api_usage_results.json b/usage_tracking/api_usage_results.json
new file mode 100644
index 00000000..4dcd9b8e
--- /dev/null
+++ b/usage_tracking/api_usage_results.json
@@ -0,0 +1,523 @@
+{
+  "direct__openai__gpt5_mini": {
+    "id": "chatcmpl-DFJoysUJJeWtuOVc5UIq3EnA580Ok",
+    "choices": [
+      {
+        "finish_reason": "length",
+        "index": 0,
+        "logprobs": null,
+        "message": {
+          "content": "",
+          "refusal": null,
+          "role": "assistant",
+          "annotations": [],
+          "audio": null,
+          "function_call": null,
+          "tool_calls": null
+        }
+      }
+    ],
+    "created": 1772543484,
+    "model": "gpt-5-mini-2025-08-07",
+    "object": "chat.completion",
+    "service_tier": "default",
+    "system_fingerprint": null,
+    "usage": {
+      "completion_tokens": 64,
+      "prompt_tokens": 10,
+      "total_tokens": 74,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": 0,
+        "audio_tokens": 0,
+        "reasoning_tokens": 64,
+        "rejected_prediction_tokens": 0
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0
+      }
+    }
+  },
+  "direct__anthropic__claude_haiku": {
+    "id": "msg_01UDvWsS78tyf4xQ1wwDsNop",
+    "content": [
+      {
+        "citations": null,
+        "text": "# Hello! \ud83d\udc4b\n\nWelcome! I'm Claude, an AI assistant made by Anthropic. How can I help you today?",
+        "type": "text"
+      }
+    ],
+    "model": "claude-haiku-4-5-20251001",
+    "role": "assistant",
+    "stop_reason": "end_turn",
+    "stop_sequence": null,
+    "type": "message",
+    "usage": {
+      "cache_creation": {
+        "ephemeral_1h_input_tokens": 0,
+        "ephemeral_5m_input_tokens": 0
+      },
+      "cache_creation_input_tokens": 0,
+      "cache_read_input_tokens": 0,
+      "input_tokens": 11,
+      "output_tokens": 32,
+      "server_tool_use": null,
+      "service_tier": "standard",
+      "inference_geo": "not_available"
+    }
+  },
+  "direct__google__gemini3_flash": {
+    "sdk_http_response": {
+      "headers": {
+        "content-type": "application/json; charset=UTF-8",
+        "vary": "Origin, X-Origin, Referer",
+        "content-encoding": "gzip",
+        "date": "Tue, 03 Mar 2026 13:11:32 GMT",
+        "server": "scaffolding on HTTPServer2",
+        "x-xss-protection": "0",
+        "x-frame-options": "SAMEORIGIN",
+        "x-content-type-options": "nosniff",
+        "server-timing": "gfet4t7; dur=4579",
+        "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
+        "transfer-encoding": "chunked"
+      },
+      "body": null
+    },
+    "candidates": [
+      {
+        "content": {
+          "parts": [
+            {
+              "media_resolution": null,
+              "code_execution_result": null,
+              "executable_code": null,
+              "file_data": null,
+              "function_call": null,
+              "function_response": null,
+              "inline_data": null,
+              "text": "Hello",
+              "thought": null,
+              "thought_signature": "EqwCCqkCAb4-9vtlfRlETMXR13Bw3xpBm-D3EzoUlVhmePvHy720UANX0hdyBGaq1d8FfiHVTMccuBl5r7sg3fy_2GoTexpytWLm15I7GfRloHt278ioOMDH3Ua8SVuGCIiRyIVSye1vkQw7p0KwZMzJ51fjhuBH-4_weZe24FglHg0p3eo79cKZIMz8eiWpcGtK6Xb25Gk1mXuKCi7GaifkKaOmhXTSjVZ-P-w5qERlTscMv-2YMD26Th8MUEg13PwFlz385A9RnHLH_oXkdr0lXAHemNj7dHdJEfNzjSgqCJdeVT3PCwH0v6-AIqdQuqD6jnvODLDPms5liN7VAVAAOZiq8tLDE771c3Xc-7eIPFdD3h9_cdvb82hjefYjEwC-aWNXQrl1SlVw0Un0",
+              "video_metadata": null
+            }
+          ],
+          "role": "model"
+        },
+        "citation_metadata": null,
+        "finish_message": null,
+        "token_count": null,
+        "finish_reason": "MAX_TOKENS",
+        "avg_logprobs": null,
+        "grounding_metadata": null,
+        "index": 0,
+        "logprobs_result": null,
+        "safety_ratings": null,
+        "url_context_metadata": null
+      }
+    ],
+    "create_time": null,
+    "model_version": "gemini-3-flash-preview",
+    "prompt_feedback": null,
+    "response_id": "BN6maev1HoqB7M8P3fvBkAo",
+    "usage_metadata": {
+      "cache_tokens_details": null,
+      "cached_content_token_count": null,
+      "candidates_token_count": 1,
+      "candidates_tokens_details": null,
+      "prompt_token_count": 5,
+      "prompt_tokens_details": [
+        {
+          "modality": "TEXT",
+          "token_count": 5
+        }
+      ],
+      "thoughts_token_count": 59,
+      "tool_use_prompt_token_count": null,
+      "tool_use_prompt_tokens_details": null,
+      "total_token_count": 65,
+      "traffic_type": null
+    },
+    "automatic_function_calling_history": [],
+    "parsed": null
+  },
+  "litellm__openai__gpt5_mini": {
+    "id": "chatcmpl-DFJp7YfyGQH1HtypDlKDucycpUAdk",
+    "created": 1772543493,
+    "model": "gpt-5-mini-2025-08-07",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "length",
+        "index": 0,
+        "message": {
+          "content": "",
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "provider_specific_fields": {
+            "refusal": null
+          },
+          "annotations": []
+        },
+        "provider_specific_fields": {}
+      }
+    ],
+    "usage": {
+      "completion_tokens": 64,
+      "prompt_tokens": 10,
+      "total_tokens": 74,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": 0,
+        "audio_tokens": 0,
+        "reasoning_tokens": 64,
+        "rejected_prediction_tokens": 0,
+        "text_tokens": null
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null
+      }
+    },
+    "service_tier": "default"
+  },
+  "litellm__anthropic__claude_haiku": {
+    "id": "chatcmpl-d251dec3-5b1a-424c-a432-cb71ea3d600f",
+    "created": 1772543495,
+    "model": "claude-haiku-4-5-20251001",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "stop",
+        "index": 0,
+        "message": {
+          "content": "Hello! \ud83d\udc4b How can I help you today?",
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "provider_specific_fields": {
+            "citations": null,
+            "thinking_blocks": null
+          }
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 16,
+      "prompt_tokens": 11,
+      "total_tokens": 27,
+      "completion_tokens_details": null,
+      "prompt_tokens_details": {
+        "audio_tokens": null,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null,
+        "cache_creation_tokens": 0,
+        "cache_creation_token_details": {
+          "ephemeral_5m_input_tokens": 0,
+          "ephemeral_1h_input_tokens": 0
+        }
+      },
+      "cache_creation_input_tokens": 0,
+      "cache_read_input_tokens": 0
+    }
+  },
+  "litellm__google__gemini3_flash": {
+    "id": "DN6macurC97hnsEPvs-FmA0",
+    "created": 1772543495,
+    "model": "gemini-3-flash-preview",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "length",
+        "index": 0,
+        "message": {
+          "content": "Hello",
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "images": [],
+          "thinking_blocks": [
+            {
+              "type": "thinking",
+              "thinking": "{\"text\": \"Hello\"}",
+              "signature": "EpoCCpcCAb4+9vtdR++YPo/XeAmaLKPKkk7+YyeGjuHP9w646HEu9lG0xhb6qHOfkUTcH7xh08RlU6QXrTKAkXwfBAsSbiBfIBCGlzygFq+QGAS4LzUFaCLOD73MmSk7WiB393VWRw04NsxbhNtTH5aM9JFaxb7yvZMwWMckTON8L9Rv7gFlo6NmYjn01ct+kBKxleJzyD8d2AnAA4wMw9zqz8pLSAU9swKxmuqs0JkHt8WNRzwtw11xGt5zR909g/v/swLY/Oh+lcHiO7PMBsPHtBvzmPHTMM/ecn1VdA9sWqmoc8suFfzTaOPeegvtkhaytoZnaNZ/FoV9y9qVex5r8R0zvPd4ennA9/asI5P1i9HL0NedNJ78avW4"
+            }
+          ],
+          "provider_specific_fields": null
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 60,
+      "prompt_tokens": 5,
+      "total_tokens": 65,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": null,
+        "audio_tokens": null,
+        "reasoning_tokens": 59,
+        "rejected_prediction_tokens": null,
+        "text_tokens": 1
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": null,
+        "cached_tokens": null,
+        "text_tokens": 5,
+        "image_tokens": null
+      }
+    },
+    "vertex_ai_grounding_metadata": [],
+    "vertex_ai_url_context_metadata": [],
+    "vertex_ai_safety_results": [],
+    "vertex_ai_citation_metadata": []
+  },
+  "openrouter__openai__gpt5_mini": {
+    "id": "gen-1772543500-cToh8SauCW1u8pGlb4qQ",
+    "created": 1772543500,
+    "model": "openai/gpt-5-mini-2025-08-07",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "length",
+        "index": 0,
+        "message": {
+          "content": null,
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "provider_specific_fields": {
+            "refusal": null,
+            "reasoning": null,
+            "reasoning_details": [
+              {
+                "type": "reasoning.summary",
+                "summary": "**Greet and respond warmly**\n\nThe user says, \"Hello, World!\" \u2014 it feels like a friendly greeting. I should definitely greet back warmly and ask how I can help them! It\u2019s a classic reference in programming, but they might just want a simple hello. I'm thinking of keeping my response concise. So, I\u2019ll reply with a friendly greeting and a question about what they\u2019d like to know or discuss. That seems like a nice approach!**Greet and engage**\n\nThe user is saying \"Hello, World!\" which feels like a classic greeting\u2014possibly a nod to programming culture. I think it\u2019s best to respond warmly, so I\u2019ll greet back with enthusiasm. Since they might just be saying hello, I\u2019ll keep it concise and friendly by asking how I can help. It\u2019s always nice to invite a conversation!",
+                "format": "openai-responses-v1",
+                "index": 0
+              },
+              {
+                "type": "reasoning.encrypted",
+                "data": "gAAAAABppt4SyBWDwFHFtAmspfjQOATVwEjInh21J297wGrovStyUcpp_QsIc3E18qz3EreGtoiQVrUwb4UnffV87xCLDfmoBtxDaSxzbYEJUgYNtjZ6hPr8peySEgtsGPypJmtVJJQ2In9BN-57EeNEqAifTsmKxtCPCm4KHRRAmiXI3zXpokxr8IldC8LYXFM4stVdsWBJxwBYWM6G_vV4VWgmJr15jHIk0tVhx2Rtoca5JQ-0MAf0mQQQbLHBAFnGKhNgBoi_Qnq06A87xSejoUkb7Lb8N6_1u9nFYyixciACYaJIqMeRU5timTRIivBsypP8GPgx-6HyCfqRGhi2nbd5HvTKw4vLTFbtDBR2lRUUsFfJXnbLZvBZO2jbWhYAPvQnnQjpbU5jXE6jPM8z-J4eyGvUg49u3n7fFqe-Nxph4Fuophnbz1-ZCdboejHXfbz9-zKcX-FaVhCkuT82gUWNBq09lLpmOjGERQr5EHguZbhGC1QKkSG59iXMTfRlPssV42xYDpWL1ci0Jbg96TAq8sEnlaY9AMtJTh0NH14Ou8rX3g-g7U2MDomJbcZtY8oNZtyY_3s7ENSatmmaCsX6eQsRuuhSOrxXZSz1l4Zyxes-TseYCQya0YPu3eCNA7-7qhYBWDbtxdBqaTyN9krqM9rkC_p4fQn3q4-2S-Wt9kElCX-SrdMR_qXYZPz8O4BsJwM1aA8gQQji5X8CnYFWTBkBLEQuv2MaR6dDuwvZUsWuCf41YJGw4GJHmdBdDbZflvgpVmuBRwk476MqDac6jXl2VlOgQ0v0zk4M6j5Hb29uCgUFDv0aTyf24wqAZdYRsKQOSLV2Wke38K1qLvUfn99yqkBllsFpdk0DsJJBG4axiK4Kr10BhhApNJokRqIjkT8HU7w5PDRLPryFoc6kuMIuS72RhOKXxZrDu7D_fuWHseOMyVrDULSYhf_GfZEIcnFwGBcIRhhQZG-lzSs_wssCojIGjRX0J67fOZk8YCCvjeabRCbbGTbHDXZxhRL_5Niwz0V4Jgd_97pOlIsVVOgS2-IuIc4445WjpkqGk6mRplBTZEPwZV2ny3v9w3aq-W6_lasXOOmv342RTXXo-pSKaZrowkI3rQUJ_fR5y7mumdDI82C-2onxbNfWI65PUgRW5KUXVgL4RXPu0yI0wu2z7LTyNaVoLSaF9wOtOzEtLux9Pf50EYjqlfD7niQoVR8Pv9D-1fhvrFDmeAzmgBdaqmCWhJWJgZUvtN43Wv2UNjk=",
+                "format": "openai-responses-v1",
+                "id": "rs_0f89d92952e937610169a6de0d3f28819085143727619d92cb",
+                "index": 1
+              }
+            ]
+          }
+        },
+        "provider_specific_fields": {
+          "native_finish_reason": "max_output_tokens"
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 64,
+      "prompt_tokens": 10,
+      "total_tokens": 74,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": null,
+        "audio_tokens": 0,
+        "reasoning_tokens": 64,
+        "rejected_prediction_tokens": null,
+        "text_tokens": null,
+        "image_tokens": 0
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null,
+        "cache_write_tokens": 0,
+        "video_tokens": 0
+      },
+      "cost": 0.0001305,
+      "is_byok": false,
+      "cost_details": {
+        "upstream_inference_cost": 0.0001305,
+        "upstream_inference_prompt_cost": 2.5e-06,
+        "upstream_inference_completions_cost": 0.000128
+      }
+    },
+    "provider": "OpenAI"
+  },
+  "openrouter__anthropic__claude_haiku": {
+    "id": "gen-1772543509-FAaKhTwazzoJmVDDd3ih",
+    "created": 1772543509,
+    "model": "anthropic/claude-4.5-haiku-20251001",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "stop",
+        "index": 0,
+        "message": {
+          "content": "Hello! \ud83d\udc4b How can I help you today?",
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "provider_specific_fields": {
+            "refusal": null,
+            "reasoning": null
+          }
+        },
+        "provider_specific_fields": {
+          "native_finish_reason": "stop"
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 16,
+      "prompt_tokens": 11,
+      "total_tokens": 27,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": null,
+        "audio_tokens": 0,
+        "reasoning_tokens": 0,
+        "rejected_prediction_tokens": null,
+        "text_tokens": null,
+        "image_tokens": 0
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null,
+        "cache_write_tokens": 0,
+        "video_tokens": 0
+      },
+      "cost": 9.1e-05,
+      "is_byok": false,
+      "cost_details": {
+        "upstream_inference_cost": 9.1e-05,
+        "upstream_inference_prompt_cost": 1.1e-05,
+        "upstream_inference_completions_cost": 8e-05
+      }
+    },
+    "provider": "Google"
+  },
+  "openrouter__google__gemini3_flash": {
+    "id": "gen-1772543512-Mxn343CzRXITNLaWa3uw",
+    "created": 1772543512,
+    "model": "google/gemini-3-flash-preview-20251217",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "stop",
+        "index": 0,
+        "message": {
+          "content": "Hello, World! How can I help you today?",
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "provider_specific_fields": {
+            "refusal": null,
+            "reasoning": null,
+            "reasoning_details": [
+              {
+                "type": "reasoning.encrypted",
+                "data": "CiEBjz1rX5mfLGj1Fml96xozj3K4fv7JeTBdOSaUUlxd96c=",
+                "format": "google-gemini-v1",
+                "index": 0
+              }
+            ]
+          }
+        },
+        "provider_specific_fields": {
+          "native_finish_reason": "STOP"
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 11,
+      "prompt_tokens": 4,
+      "total_tokens": 15,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": null,
+        "audio_tokens": 0,
+        "reasoning_tokens": 0,
+        "rejected_prediction_tokens": null,
+        "text_tokens": null,
+        "image_tokens": 0
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null,
+        "cache_write_tokens": 0,
+        "video_tokens": 0
+      },
+      "cost": 3.5e-05,
+      "is_byok": false,
+      "cost_details": {
+        "upstream_inference_cost": 3.5e-05,
+        "upstream_inference_prompt_cost": 2e-06,
+        "upstream_inference_completions_cost": 3.3e-05
+      }
+    },
+    "provider": "Google"
+  },
+  "openrouter__qwen__qwen3_30b": {
+    "id": "gen-1772543515-76qFgjV9ySYOE8mtplV6",
+    "created": 1772543515,
+    "model": "qwen/qwen3-30b-a3b-04-28",
+    "object": "chat.completion",
+    "system_fingerprint": null,
+    "choices": [
+      {
+        "finish_reason": "length",
+        "index": 0,
+        "message": {
+          "content": null,
+          "role": "assistant",
+          "tool_calls": null,
+          "function_call": null,
+          "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,",
+          "provider_specific_fields": {
+            "refusal": null,
+            "reasoning": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,",
+            "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,"
+          }
+        },
+        "provider_specific_fields": {
+          "native_finish_reason": "length"
+        }
+      }
+    ],
+    "usage": {
+      "completion_tokens": 64,
+      "prompt_tokens": 13,
+      "total_tokens": 77,
+      "completion_tokens_details": {
+        "accepted_prediction_tokens": null,
+        "audio_tokens": 0,
+        "reasoning_tokens": 75,
+        "rejected_prediction_tokens": null,
+        "text_tokens": null,
+        "image_tokens": 0
+      },
+      "prompt_tokens_details": {
+        "audio_tokens": 0,
+        "cached_tokens": 0,
+        "text_tokens": null,
+        "image_tokens": null,
+        "cache_write_tokens": 0,
+        "video_tokens": 0
+      },
+      "cost": 1.896e-05,
+      "is_byok": false,
+      "cost_details": {
+        "upstream_inference_cost": 1.896e-05,
+        "upstream_inference_prompt_cost": 1.04e-06,
+        "upstream_inference_completions_cost": 1.792e-05
+      }
+    },
+    "provider": "DeepInfra"
+  }
+}
\ No newline at end of file
diff --git a/usage_tracking/api_usage_test.py b/usage_tracking/api_usage_test.py
new file mode 100644
index 00000000..1c0a34b2
--- /dev/null
+++ b/usage_tracking/api_usage_test.py
@@ -0,0 +1,154 @@
+"""
+Test script that calls GPT-5 mini, Claude Haiku 4.5, and Gemini 3 Flash in three
+conditions each — (1) native client, (2) LiteLLM, (3) LiteLLM via OpenRouter —
+plus Qwen 3 via LiteLLM+OpenRouter. Saves full response dicts to JSON for
+usage/cost analysis.
+"""
+
+import json
+import os
+import time
+from pathlib import Path
+
+import anthropic
+import litellm
+import requests
+from dotenv import load_dotenv
+from google import genai
+from google.genai import types
+from openai import OpenAI
+
+load_dotenv()
+
+OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
+ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]
+GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
+OPENROUTER_API_KEY = os.environ["OPENROUTER_API_KEY"]
+
+# LiteLLM reads OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY from env.
+# For Gemini it expects GEMINI_API_KEY, so alias it.
+os.environ.setdefault("GEMINI_API_KEY", GOOGLE_API_KEY)
+
+PROMPT = "Hello, World!"
+MAX_TOKENS = 64
+TOTAL = 10
+
+results = {}
+
+
+def step(n: int, label: str):
+    print(f"{n}/{TOTAL}  {label} ...")
+
+
+# =========================================================================== #
+#  CONDITION 1 — Native SDKs (direct)
+# =========================================================================== #
+
+# -- 1. GPT-5 mini (OpenAI) ------------------------------------------------ #
+step(1, "GPT-5 mini — direct (OpenAI SDK)")
+openai_client = OpenAI(api_key=OPENAI_API_KEY)
+resp = openai_client.chat.completions.create(
+    model="gpt-5-mini",
+    messages=[{"role": "user", "content": PROMPT}],
+    max_completion_tokens=MAX_TOKENS,
+)
+results["direct__openai__gpt5_mini"] = resp.model_dump()
+print(f"       done — {resp.usage.total_tokens} tokens")
+
+# -- 2. Claude Haiku 4.5 (Anthropic) --------------------------------------- #
+step(2, "Claude Haiku 4.5 — direct (Anthropic SDK)")
+anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
+resp = anthropic_client.messages.create(
+    model="claude-haiku-4-5-20251001",
+    max_tokens=MAX_TOKENS,
+    messages=[{"role": "user", "content": PROMPT}],
+)
+results["direct__anthropic__claude_haiku"] = resp.model_dump()
+print(f"       done — {resp.usage.input_tokens + resp.usage.output_tokens} tokens")
+
+# -- 3. Gemini 3 Flash (Google) -------------------------------------------- #
+step(3, "Gemini 3 Flash — direct (Google GenAI SDK)")
+google_client = genai.Client(api_key=GOOGLE_API_KEY)
+resp = google_client.models.generate_content(
+    model="gemini-3-flash-preview",
+    contents=PROMPT,
+    config=types.GenerateContentConfig(max_output_tokens=MAX_TOKENS),
+)
+results["direct__google__gemini3_flash"] = resp.model_dump(mode="json")
+total = resp.usage_metadata.total_token_count if resp.usage_metadata else "n/a"
+print(f"       done — {total} tokens")
+
+
+# =========================================================================== #
+#  CONDITION 2 — LiteLLM (direct to providers)
+# =========================================================================== #
+
+litellm_direct_models = {
+    "litellm__openai__gpt5_mini": "gpt-5-mini",
+    "litellm__anthropic__claude_haiku": "claude-haiku-4-5-20251001",
+    "litellm__google__gemini3_flash": "gemini/gemini-3-flash-preview",
+}
+
+for i, (label, model) in enumerate(litellm_direct_models.items(), start=4):
+    step(i, f"{model} — LiteLLM (direct)")
+    resp = litellm.completion(
+        model=model,
+        messages=[{"role": "user", "content": PROMPT}],
+        max_tokens=MAX_TOKENS,
+    )
+    results[label] = resp.model_dump()
+    usage_total = resp.usage.total_tokens if resp.usage else "n/a"
+    print(f"       done — {usage_total} tokens")
+
+
+# =========================================================================== #
+#  CONDITION 3 — LiteLLM via OpenRouter  (+Qwen)
+# =========================================================================== #
+
+
+def fetch_openrouter_generation(gen_id: str) -> dict | None:
+    """Query OpenRouter's generation endpoint for cost metadata."""
+    time.sleep(2)  # brief wait for metadata to be available
+    r = requests.get(
+        f"https://openrouter.ai/api/v1/generation?id={gen_id}",
+        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
+    )
+    if r.status_code == 200:
+        return r.json()
+    return None
+
+
+litellm_openrouter_models = {
+    "openrouter__openai__gpt5_mini": "openrouter/openai/gpt-5-mini",
+    "openrouter__anthropic__claude_haiku": "openrouter/anthropic/claude-haiku-4-5",
+    "openrouter__google__gemini3_flash": "openrouter/google/gemini-3-flash-preview",
+    "openrouter__qwen__qwen3_30b": "openrouter/qwen/qwen3-30b-a3b",
+}
+
+for i, (label, model) in enumerate(litellm_openrouter_models.items(), start=7):
+    step(i, f"{model} — LiteLLM (OpenRouter)")
+    resp = litellm.completion(
+        model=model,
+        messages=[{"role": "user", "content": PROMPT}],
+        max_tokens=MAX_TOKENS,
+    )
+    result = resp.model_dump()
+
+    # Fetch OpenRouter generation metadata (cost, native tokens, etc.)
+    gen_meta = fetch_openrouter_generation(resp.id)
+    if gen_meta:
+        result["_openrouter_generation"] = gen_meta
+
+    results[label] = result
+    usage_total = resp.usage.total_tokens if resp.usage else "n/a"
+    print(f"       done — {usage_total} tokens")
+
+
+# =========================================================================== #
+#  Save results
+# =========================================================================== #
+out_path = Path(__file__).parent / "api_usage_results.json"
+with open(out_path, "w") as f:
+    json.dump(results, f, indent=2, default=str)
+
+print(f"\nResults saved to {out_path}")

From 4eb847f7b28268bcc208702bb36f1c96535c4b4b Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Thu, 12 Mar 2026 18:05:58 +0100
Subject: [PATCH 02/19] initial commit

---
 maseval/__init__.py                         |  10 +
 maseval/core/benchmark.py                   |  43 ++-
 maseval/core/cost.py                        | 132 +++++++++
 maseval/core/model.py                       |  40 ++-
 maseval/core/registry.py                    | 125 +++++++-
 maseval/core/reporting.py                   | 148 ++++++++++
 maseval/core/usage.py                       | 301 ++++++++++++++++++++
 maseval/interface/cost.py                   | 102 +++++++
 maseval/interface/inference/anthropic.py    |  10 +-
 maseval/interface/inference/google_genai.py |  10 +-
 maseval/interface/inference/huggingface.py  |   6 +-
 maseval/interface/inference/litellm.py      |  25 +-
 maseval/interface/inference/openai.py       |  20 +-
 usage_tracking/PLAN.md                      |  85 +++++-
 14 files changed, 1029 insertions(+), 28 deletions(-)
 create mode 100644 maseval/core/cost.py
 create mode 100644 maseval/core/reporting.py
 create mode 100644 maseval/core/usage.py
 create mode 100644 maseval/interface/cost.py

diff --git a/maseval/__init__.py b/maseval/__init__.py
index d50350ff..addedf3e 100644
--- a/maseval/__init__.py
+++ b/maseval/__init__.py
@@ -37,6 +37,9 @@
 from .core.evaluator import Evaluator
 from .core.history import MessageHistory, ToolInvocationHistory
 from .core.tracing import TraceableMixin
+from .core.usage import Usage, TokenUsage, UsageTrackableMixin
+from .core.cost import CostCalculator, StaticPricingCalculator
+from .core.reporting import UsageReporter
 from .core.registry import ComponentRegistry
 from .core.context import TaskContext
 from .core.exceptions import (
@@ -87,6 +90,13 @@
     "MessageHistory",
     "ToolInvocationHistory",
     "TraceableMixin",
+    # Usage tracking
+    "Usage",
+    "TokenUsage",
+    "UsageTrackableMixin",
+    "CostCalculator",
+    "StaticPricingCalculator",
+    "UsageReporter",
     # Registry and execution context
     "ComponentRegistry",
     "TaskContext",
diff --git a/maseval/core/benchmark.py b/maseval/core/benchmark.py
index a027eb9d..d59ad233 100644
--- a/maseval/core/benchmark.py
+++ b/maseval/core/benchmark.py
@@ -16,6 +16,7 @@
 from .callback import BenchmarkCallback
 from .user import User
 from .tracing import TraceableMixin
+from .usage import Usage
 from .registry import ComponentRegistry, RegisterableComponent
 from .context import TaskContext
 from .utils.system_info import gather_benchmark_config
@@ -442,6 +443,36 @@ def collect_all_configs(self) -> Dict[str, Any]:
         """
         return self._registry.collect_configs()
 
+    def collect_all_usage(self) -> Dict[str, Any]:
+        """Collect usage from all registered components for the current task repetition.
+
+        This method is called automatically by ``run()`` after each task repetition
+        completes. It gathers usage from all registered ``UsageTrackableMixin``
+        components and also accumulates into persistent running totals accessible
+        via ``usage`` and ``usage_by_component``.
+
+        Returns:
+            Structured dictionary containing usage from all registered components.
+        """
+        return self._registry.collect_usage()
+
+    @property
+    def usage(self) -> Usage:
+        """Running usage total across all task repetitions.
+
+        Queryable at any time, including while the benchmark is still running.
+        Returns the grand total of all usage collected so far.
+        """
+        return self._registry.total_usage
+
+    @property
+    def usage_by_component(self) -> Dict[str, Usage]:
+        """Per-component running usage totals across all repetitions.
+
+        Keys are registry keys (e.g., ``"models:main_model"``).
+        """
+        return self._registry.usage_by_component
+
     def _invoke_callbacks(self, method_name: str, *args, suppress_errors: bool = True, **kwargs) -> List[Exception]:
         """Invoke a callback method on all registered callbacks (thread-safe).
 
@@ -1176,14 +1207,16 @@ def _execute_task_repetition(
 
             final_answers = None
 
-        # 3. Collect traces and configs (always attempt this)
+        # 3. Collect traces, configs, and usage (always attempt this)
+        execution_usage: Optional[Dict[str, Any]] = None
         try:
             execution_configs = self.collect_all_configs()
             execution_traces = self.collect_all_traces()
+            execution_usage = self.collect_all_usage()
             # Store in context for potential timeout errors
             context.set_collected_traces(execution_traces)
         except Exception as e:
-            # If trace/config collection fails, record it but continue
+            # If collection fails, record it but continue
             execution_configs = {
                 "error": f"Failed to collect configs: {e}",
                 "error_type": type(e).__name__,
@@ -1192,6 +1225,11 @@ def _execute_task_repetition(
                 "error": f"Failed to collect traces: {e}",
                 "error_type": type(e).__name__,
             }
+            if execution_usage is None:
+                execution_usage = {
+                    "error": f"Failed to collect usage: {e}",
+                    "error_type": type(e).__name__,
+                }
 
         # 4. Evaluate (skip if task execution failed)
         if execution_status == TaskExecutionStatus.SUCCESS:
@@ -1234,6 +1272,7 @@ def _execute_task_repetition(
             "error": error_info,
             "traces": execution_traces,
             "config": execution_configs,
+            "usage": execution_usage,
             "eval": eval_results,
             "task": {
                 "query": task.query,
diff --git a/maseval/core/cost.py b/maseval/core/cost.py
new file mode 100644
index 00000000..a6eb8583
--- /dev/null
+++ b/maseval/core/cost.py
@@ -0,0 +1,132 @@
+"""Pluggable cost calculation for usage records.
+
+This module provides the ``CostCalculator`` protocol and a built-in
+``StaticPricingCalculator`` that computes cost from token counts and
+user-supplied pricing tables. For automatic pricing via LiteLLM's
+bundled model database, see ``maseval.interface.cost``.
+
+Cost calculators are optional — if no calculator is provided to a
+``ModelAdapter``, cost is only set when the provider reports it directly
+(e.g., LiteLLM's ``response._hidden_params.response_cost``).
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, Optional, Protocol, runtime_checkable
+
+from .usage import TokenUsage
+
+
+@runtime_checkable
+class CostCalculator(Protocol):
+    """Protocol for computing cost from token usage.
+
+    Implementations receive a ``TokenUsage`` and the model ID, and return
+    the cost in whatever unit the calculator declares (typically USD).
+
+    Example:
+        ```python
+        class MyCostCalculator:
+            def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+                rate = MY_PRICING.get(model_id)
+                if rate is None:
+                    return None
+                return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
+        ```
+    """
+
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        """Compute cost for a single chat call.
+
+        Args:
+            usage: Token usage from the call.
+            model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``).
+
+        Returns:
+            Cost as a float, or ``None`` if pricing is unknown for this model.
+        """
+        ...
+
+
+class StaticPricingCalculator:
+    """Cost calculator using user-supplied per-model pricing.
+
+    Pricing is specified as cost per token (not per 1K or 1M tokens).
+    If a model is not in the pricing table, ``calculate_cost`` returns ``None``.
+
+    Args:
+        pricing: Dict mapping model IDs to their per-token rates.
+            Each value is a dict with keys:
+
+            - ``"input"`` — cost per input token (required)
+            - ``"output"`` — cost per output token (required)
+            - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate)
+
+    Example:
+        ```python
+        calculator = StaticPricingCalculator({
+            "gpt-4": {"input": 0.00003, "output": 0.00006},
+            "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
+        })
+
+        model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator)
+        ```
+
+    For university clusters or custom credit systems, the "cost" unit
+    is whatever the pricing values represent (credits, EUR, etc.):
+
+        ```python
+        calculator = StaticPricingCalculator({
+            "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
+        })
+        ```
+    """
+
+    def __init__(self, pricing: Dict[str, Dict[str, float]]):
+        self._pricing = pricing
+
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        """Compute cost from static per-token rates.
+
+        Args:
+            usage: Token usage from the call.
+            model_id: The model identifier to look up in the pricing table.
+
+        Returns:
+            Computed cost, or ``None`` if the model is not in the pricing table.
+        """
+        rates = self._pricing.get(model_id)
+        if rates is None:
+            return None
+
+        input_rate = rates.get("input", 0.0)
+        output_rate = rates.get("output", 0.0)
+        cached_rate = rates.get("cached_input", input_rate)
+
+        # Non-cached input tokens = total input - cached
+        non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens)
+
+        cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate
+
+        return cost
+
+    def add_model(self, model_id: str, rates: Dict[str, float]) -> None:
+        """Add or update pricing for a model.
+
+        Args:
+            model_id: The model identifier.
+            rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``).
+        """
+        self._pricing[model_id] = rates
+
+    @property
+    def models(self) -> list[str]:
+        """List of model IDs with pricing configured."""
+        return list(self._pricing.keys())
+
+    def gather_config(self) -> Dict[str, Any]:
+        """Return pricing configuration for reproducibility."""
+        return {
+            "type": type(self).__name__,
+            "pricing": dict(self._pricing),
+        }
diff --git a/maseval/core/model.py b/maseval/core/model.py
index cac1c2ed..f33e48f4 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -55,6 +55,8 @@
 
 from .tracing import TraceableMixin
 from .config import ConfigurableMixin
+from .usage import Usage, TokenUsage, UsageTrackableMixin
+from .cost import CostCalculator
 from .history import MessageHistory
 
 
@@ -133,7 +135,7 @@ def to_message(self) -> Dict[str, Any]:
         return msg
 
 
-class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin):
+class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
     """Abstract base class for model adapters.
 
     ModelAdapter provides a consistent interface for LLM inference across
@@ -169,17 +171,24 @@ class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin):
         adapter's seed parameter.
     """
 
-    def __init__(self, seed: Optional[int] = None):
+    def __init__(self, seed: Optional[int] = None, cost_calculator: Optional[CostCalculator] = None):
         """Initialize the model adapter with call tracing.
 
         Args:
             seed: Seed for deterministic generation. Passed to the underlying
                 provider API if supported. If the provider doesn't support
                 seeding, subclasses should raise SeedingError.
+            cost_calculator: Optional cost calculator for computing USD (or
+                other unit) cost from token counts. If provided and the
+                provider does not report cost directly, the calculator is
+                used to fill in ``Usage.cost`` after each call. Provider-
+                reported cost always takes precedence.
         """
         super().__init__()
         self._seed = seed
+        self._cost_calculator = cost_calculator
         self.logs: List[Dict[str, Any]] = []
+        self._usage_records: List[Usage] = []
 
     @property
     def seed(self) -> Optional[int]:
@@ -298,6 +307,17 @@ def chat(
                 }
             )
 
+            # Record token usage if available
+            if result.usage:
+                cost = result.usage.get("cost") if isinstance(result.usage.get("cost"), (int, float)) else None
+                token_usage = TokenUsage.from_chat_response_usage(result.usage, cost=cost, kind="llm")
+
+                # If no provider-reported cost, try the cost calculator
+                if token_usage.cost is None and self._cost_calculator is not None:
+                    token_usage.cost = self._cost_calculator.calculate_cost(token_usage, self.model_id)
+
+                self._usage_records.append(token_usage)
+
             return result
 
         except Exception as e:
@@ -375,6 +395,16 @@ def generate(
         response = self.chat(messages, generation_params=generation_params, **kwargs)
         return response.content or ""
 
+    def gather_usage(self) -> Usage:
+        """Gather accumulated token usage from all chat calls.
+
+        Returns:
+            Summed TokenUsage across all calls, or empty Usage if no calls were made.
+        """
+        if not self._usage_records:
+            return Usage()
+        return sum(self._usage_records, Usage())
+
     def gather_traces(self) -> Dict[str, Any]:
         """Gather execution traces from this model adapter.
 
@@ -431,9 +461,13 @@ def gather_config(self) -> Dict[str, Any]:
         Returns:
             Dictionary containing model configuration.
         """
-        return {
+        config = {
             **super().gather_config(),
             "model_id": self.model_id,
             "adapter_type": type(self).__name__,
             "seed": self._seed,
         }
+        if self._cost_calculator is not None:
+            gather = getattr(self._cost_calculator, "gather_config", None)
+            config["cost_calculator"] = gather() if callable(gather) else type(self._cost_calculator).__name__
+        return config
diff --git a/maseval/core/registry.py b/maseval/core/registry.py
index 267e9ee2..e34fc972 100644
--- a/maseval/core/registry.py
+++ b/maseval/core/registry.py
@@ -11,9 +11,10 @@
 
 from .tracing import TraceableMixin
 from .config import ConfigurableMixin
+from .usage import Usage, UsageTrackableMixin
 
 # Type alias for components that can be registered
-RegisterableComponent = Union[TraceableMixin, ConfigurableMixin]
+RegisterableComponent = Union[TraceableMixin, ConfigurableMixin, UsageTrackableMixin]
 
 
 class ComponentRegistry:
@@ -48,6 +49,12 @@ def __init__(self, benchmark_config: Optional[Dict[str, Any]] = None):
         self._local = threading.local()
         self._benchmark_config = benchmark_config or {}
 
+        # Persistent usage aggregates (NOT cleared between repetitions).
+        # Protected by a lock since multiple threads may call collect_usage().
+        self._usage_lock = threading.Lock()
+        self._usage_total: Usage = Usage()
+        self._usage_by_component: Dict[str, Usage] = {}
+
     # --- Thread-local state properties ---
 
     @property
@@ -74,6 +81,18 @@ def _config_component_id_map(self) -> Dict[int, str]:
             self._local.config_component_id_map = {}
         return self._local.config_component_id_map
 
+    @property
+    def _usage_registry(self) -> Dict[str, UsageTrackableMixin]:
+        if not hasattr(self._local, "usage_registry"):
+            self._local.usage_registry = {}
+        return self._local.usage_registry
+
+    @property
+    def _usage_component_id_map(self) -> Dict[int, str]:
+        if not hasattr(self._local, "usage_component_id_map"):
+            self._local.usage_component_id_map = {}
+        return self._local.usage_component_id_map
+
     # --- Public API ---
 
     def register(self, category: str, name: str, component: RegisterableComponent) -> RegisterableComponent:
@@ -94,7 +113,11 @@ def register(self, category: str, name: str, component: RegisterableComponent) -
         key = f"{category}:{name}"
 
         # Check for duplicate registration under different key
-        existing_key = self._component_id_map.get(component_id) or self._config_component_id_map.get(component_id)
+        existing_key = (
+            self._component_id_map.get(component_id)
+            or self._config_component_id_map.get(component_id)
+            or self._usage_component_id_map.get(component_id)
+        )
         if existing_key and existing_key != key:
             raise ValueError(
                 f"Component is already registered as '{existing_key}' and cannot be "
@@ -114,14 +137,25 @@ def register(self, category: str, name: str, component: RegisterableComponent) -
             self._config_registry[key] = component
             self._config_component_id_map[component_id] = key
 
+        # Register for usage tracking if supported
+        if isinstance(component, UsageTrackableMixin):
+            self._usage_registry[key] = component
+            self._usage_component_id_map[component_id] = key
+
         return component
 
     def clear(self) -> None:
-        """Clear all registrations for the current thread."""
+        """Clear per-repetition registrations for the current thread.
+
+        Does NOT clear persistent usage aggregates (``total_usage``,
+        ``usage_by_component``), which accumulate across all repetitions.
+        """
         self._trace_registry.clear()
         self._component_id_map.clear()
         self._config_registry.clear()
         self._config_component_id_map.clear()
+        self._usage_registry.clear()
+        self._usage_component_id_map.clear()
 
     def collect_traces(self) -> Dict[str, Any]:
         """Collect execution traces from all registered components."""
@@ -238,6 +272,91 @@ def collect_configs(self) -> Dict[str, Any]:
 
         return configs
 
+    def collect_usage(self) -> Dict[str, Any]:
+        """Collect usage from all registered UsageTrackableMixin components.
+
+        Returns a structured dict (same shape as ``collect_traces()`` and
+        ``collect_configs()``). Also accumulates into persistent aggregates
+        (``total_usage``, ``usage_by_component``) that survive ``clear()``.
+        """
+        usage: Dict[str, Any] = {
+            "metadata": {
+                "timestamp": datetime.now().isoformat(),
+                "thread_id": threading.current_thread().ident,
+                "total_components": len(self._usage_registry),
+            },
+            "agents": {},
+            "models": {},
+            "tools": {},
+            "simulators": {},
+            "callbacks": {},
+            "environment": None,
+            "user": None,
+            "other": {},
+        }
+
+        for key, component in self._usage_registry.items():
+            category, comp_name = key.split(":", 1)
+
+            try:
+                component_usage = component.gather_usage()
+
+                # Inject grouping fields from registry context if not set
+                if component_usage.category is None:
+                    component_usage.category = category
+                if component_usage.component_name is None:
+                    component_usage.component_name = comp_name
+
+                usage_dict = component_usage.to_dict()
+
+                # Handle environment and user as direct values (not nested in dict)
+                if category == "environment":
+                    usage["environment"] = usage_dict
+                elif category == "user":
+                    usage["user"] = usage_dict
+                else:
+                    if category not in usage:
+                        usage[category] = {}
+                    usage[category][comp_name] = usage_dict
+
+                # Accumulate into persistent aggregates (thread-safe)
+                with self._usage_lock:
+                    self._usage_total = self._usage_total + component_usage
+                    if key in self._usage_by_component:
+                        self._usage_by_component[key] = self._usage_by_component[key] + component_usage
+                    else:
+                        self._usage_by_component[key] = component_usage
+
+            except Exception as e:
+                error_info = {
+                    "error": f"Failed to gather usage: {e}",
+                    "error_type": type(e).__name__,
+                    "component_type": type(component).__name__,
+                }
+
+                if category == "environment":
+                    usage["environment"] = error_info
+                elif category == "user":
+                    usage["user"] = error_info
+                else:
+                    if category not in usage:
+                        usage[category] = {}
+                    usage[category][comp_name] = error_info
+
+        return usage
+
+    @property
+    def total_usage(self) -> Usage:
+        """Running usage total across all repetitions. Queryable at any time."""
+        with self._usage_lock:
+            return self._usage_total
+
+    @property
+    def usage_by_component(self) -> Dict[str, Usage]:
+        """Per-component running totals across all repetitions."""
+        with self._usage_lock:
+            return dict(self._usage_by_component)
+
     def update_benchmark_config(self, benchmark_config: Dict[str, Any]) -> None:
         """Update the benchmark-level configuration.
 
diff --git a/maseval/core/reporting.py b/maseval/core/reporting.py
new file mode 100644
index 00000000..2465fba0
--- /dev/null
+++ b/maseval/core/reporting.py
@@ -0,0 +1,148 @@
+"""Post-hoc usage reporting utilities.
+
+This module provides ``UsageReporter`` for slicing and analyzing usage data
+from benchmark reports. Unlike the registry's live aggregates (which provide
+running totals), the reporter can slice by task since it sees the full report
+list with task IDs.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List
+
+from .usage import Usage, TokenUsage
+
+
+class UsageReporter:
+    """Post-hoc utility for analyzing usage across benchmark reports.
+
+    Walks ``report["usage"]`` across all reports to produce breakdowns
+    by task, component, model, etc.
+
+    Example:
+        ```python
+        reporter = UsageReporter.from_reports(benchmark.reports)
+        print(reporter.total())
+        print(reporter.by_task())
+        print(reporter.by_component())
+        ```
+    """
+
+    def __init__(self, entries: List[Dict[str, Any]]):
+        """Initialize with raw entries extracted from reports.
+
+        Args:
+            entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``,
+                and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples).
+        """
+        self._entries = entries
+
+    @staticmethod
+    def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter:
+        """Create a UsageReporter from benchmark reports.
+
+        Args:
+            reports: The ``benchmark.reports`` list.
+
+        Returns:
+            A UsageReporter ready for analysis.
+        """
+        entries = []
+        for report in reports:
+            usage_data = report.get("usage")
+            if not usage_data or "error" in usage_data:
+                continue
+
+            usage_items = []
+            for category, value in usage_data.items():
+                if category == "metadata":
+                    continue
+                if isinstance(value, dict) and "cost" in value:
+                    # Direct value (environment/user) — it's a usage dict
+                    usage_items.append((category, value))
+                elif isinstance(value, dict):
+                    # Category dict with component names as keys
+                    for comp_name, comp_usage in value.items():
+                        if isinstance(comp_usage, dict) and "error" not in comp_usage:
+                            usage_items.append((f"{category}:{comp_name}", comp_usage))
+
+            entries.append(
+                {
+                    "task_id": report.get("task_id"),
+                    "repeat_idx": report.get("repeat_idx"),
+                    "usage_items": usage_items,
+                }
+            )
+
+        return UsageReporter(entries)
+
+    @staticmethod
+    def _usage_from_dict(d: Dict[str, Any]) -> Usage:
+        """Reconstruct a Usage (or TokenUsage) from a serialized dict."""
+        has_tokens = "input_tokens" in d
+        if has_tokens:
+            return TokenUsage(
+                cost=d.get("cost"),
+                units=d.get("units", {}),
+                provider=d.get("provider"),
+                category=d.get("category"),
+                component_name=d.get("component_name"),
+                kind=d.get("kind"),
+                input_tokens=d.get("input_tokens", 0),
+                output_tokens=d.get("output_tokens", 0),
+                total_tokens=d.get("total_tokens", 0),
+                cached_input_tokens=d.get("cached_input_tokens", 0),
+                reasoning_tokens=d.get("reasoning_tokens", 0),
+                audio_tokens=d.get("audio_tokens", 0),
+            )
+        return Usage(
+            cost=d.get("cost"),
+            units=d.get("units", {}),
+            provider=d.get("provider"),
+            category=d.get("category"),
+            component_name=d.get("component_name"),
+            kind=d.get("kind"),
+        )
+
+    def by_task(self) -> Dict[str, Usage]:
+        """Aggregate usage by task_id across all repetitions."""
+        result: Dict[str, Usage] = {}
+        for entry in self._entries:
+            task_id = entry["task_id"]
+            for _key, usage_dict in entry["usage_items"]:
+                usage = self._usage_from_dict(usage_dict)
+                if task_id in result:
+                    result[task_id] = result[task_id] + usage
+                else:
+                    result[task_id] = usage
+        return result
+
+    def by_component(self) -> Dict[str, Usage]:
+        """Aggregate usage by registry key (e.g., ``"models:main_model"``)."""
+        result: Dict[str, Usage] = {}
+        for entry in self._entries:
+            for key, usage_dict in entry["usage_items"]:
+                usage = self._usage_from_dict(usage_dict)
+                if key in result:
+                    result[key] = result[key] + usage
+                else:
+                    result[key] = usage
+        return result
+
+    def total(self) -> Usage:
+        """Grand total across all tasks and components."""
+        all_usages = []
+        for entry in self._entries:
+            for _key, usage_dict in entry["usage_items"]:
+                all_usages.append(self._usage_from_dict(usage_dict))
+        if not all_usages:
+            return Usage()
+        return sum(all_usages, Usage())
+
+    def summary(self) -> Dict[str, Any]:
+        """Nested dict with all breakdowns."""
+        return {
+            "total": self.total().to_dict(),
+            "by_task": {k: v.to_dict() for k, v in self.by_task().items()},
+            "by_component": {k: v.to_dict() for k, v in self.by_component().items()},
+        }
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
new file mode 100644
index 00000000..78edeab9
--- /dev/null
+++ b/maseval/core/usage.py
@@ -0,0 +1,301 @@
+"""Core usage tracking infrastructure for API cost and resource monitoring.
+
+This module provides the `Usage` and `TokenUsage` data classes for recording
+billable resource consumption, and the `UsageTrackableMixin` that enables
+automatic usage collection through the component registry.
+
+Usage tracking is a first-class collection axis alongside tracing
+(`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that
+inherit `UsageTrackableMixin` have their usage automatically collected by the
+registry via `gather_usage()`.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any, Optional, Dict
+
+
+@dataclass
+class Usage:
+    """Generic usage record for any billable resource.
+
+    Represents accumulated cost and countable units for a component or
+    aggregated group. Grouping fields (`provider`, `category`,
+    `component_name`, `kind`) identify what scope the record covers.
+    When two records are summed, matching grouping fields are preserved;
+    mismatches become `None` (meaning "aggregated over").
+
+    Attributes:
+        cost: Total cost in USD. `None` means unknown/not reported.
+        units: Arbitrary countable units (e.g., ``{"api_calls": 3}``).
+        provider: Provider identifier (e.g., ``"anthropic"``, ``"bloomberg"``).
+        category: Registry category (e.g., ``"models"``, ``"tools"``).
+        component_name: Component name within category (e.g., ``"main_model"``).
+        kind: Component kind (e.g., ``"llm"``, ``"service"``, ``"local"``).
+
+    Example:
+        ```python
+        usage = Usage(cost=0.05, units={"api_calls": 1}, provider="bloomberg", kind="service")
+
+        # Summing preserves matching fields
+        total = usage + Usage(cost=0.03, units={"api_calls": 2}, provider="bloomberg", kind="service")
+        assert total.cost == 0.08
+        assert total.units == {"api_calls": 3}
+        assert total.provider == "bloomberg"
+
+        # Mismatched fields become None
+        mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm")
+        assert mixed.provider is None  # aggregated over
+        assert mixed.kind is None      # aggregated over
+        ```
+    """
+
+    cost: Optional[float] = None
+    units: Dict[str, int | float] = field(default_factory=dict)
+    provider: Optional[str] = None
+    category: Optional[str] = None
+    component_name: Optional[str] = None
+    kind: Optional[str] = None
+
+    def __add__(self, other: Usage) -> Usage:
+        if not isinstance(other, Usage):
+            return NotImplemented
+
+        # Sum costs: both known -> sum, either unknown -> None
+        if self.cost is not None and other.cost is not None:
+            cost = self.cost + other.cost
+        else:
+            cost = None
+
+        # Sum units
+        units: Dict[str, int | float] = dict(self.units)
+        for key, value in other.units.items():
+            units[key] = units.get(key, 0) + value
+
+        # Grouping fields: preserve on match, None on mismatch
+        provider = self.provider if self.provider == other.provider else None
+        category = self.category if self.category == other.category else None
+        component_name = self.component_name if self.component_name == other.component_name else None
+        kind = self.kind if self.kind == other.kind else None
+
+        return Usage(
+            cost=cost,
+            units=units,
+            provider=provider,
+            category=category,
+            component_name=component_name,
+            kind=kind,
+        )
+
+    def __radd__(self, other: object) -> Usage:
+        """Support sum() by handling 0 + Usage."""
+        if other == 0:
+            return self
+        if isinstance(other, Usage):
+            return other.__add__(self)
+        return NotImplemented
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to a JSON-compatible dictionary."""
+        return {
+            "cost": self.cost,
+            "units": dict(self.units),
+            "provider": self.provider,
+            "category": self.category,
+            "component_name": self.component_name,
+            "kind": self.kind,
+        }
+
+
+@dataclass
+class TokenUsage(Usage):
+    """LLM-specific usage record with token counts.
+
+    Extends `Usage` with token fields reported by LLM providers. Use
+    `from_chat_response_usage()` to create from the dict returned by
+    model adapters.
+
+    Attributes:
+        input_tokens: Number of input/prompt tokens.
+        output_tokens: Number of output/completion tokens.
+        total_tokens: Total tokens (input + output).
+        cached_input_tokens: Tokens served from cache (Anthropic ``cache_read_input_tokens``,
+            OpenAI ``cached_tokens``).
+        reasoning_tokens: Tokens used for reasoning (OpenAI ``reasoning_tokens``,
+            Google ``thoughts_token_count``).
+        audio_tokens: Tokens for audio processing (OpenAI).
+
+    Example:
+        ```python
+        token_usage = TokenUsage.from_chat_response_usage({
+            "input_tokens": 100,
+            "output_tokens": 50,
+            "total_tokens": 150,
+        })
+        assert token_usage.input_tokens == 100
+        ```
+    """
+
+    input_tokens: int = 0
+    output_tokens: int = 0
+    total_tokens: int = 0
+    cached_input_tokens: int = 0
+    reasoning_tokens: int = 0
+    audio_tokens: int = 0
+
+    def __add__(self, other: Usage) -> Usage:
+        base = super().__add__(other)
+        if not isinstance(base, Usage):
+            return NotImplemented
+
+        if isinstance(other, TokenUsage):
+            return TokenUsage(
+                cost=base.cost,
+                units=base.units,
+                provider=base.provider,
+                category=base.category,
+                component_name=base.component_name,
+                kind=base.kind,
+                input_tokens=self.input_tokens + other.input_tokens,
+                output_tokens=self.output_tokens + other.output_tokens,
+                total_tokens=self.total_tokens + other.total_tokens,
+                cached_input_tokens=self.cached_input_tokens + other.cached_input_tokens,
+                reasoning_tokens=self.reasoning_tokens + other.reasoning_tokens,
+                audio_tokens=self.audio_tokens + other.audio_tokens,
+            )
+
+        # Adding TokenUsage + plain Usage: preserve token fields from self
+        return TokenUsage(
+            cost=base.cost,
+            units=base.units,
+            provider=base.provider,
+            category=base.category,
+            component_name=base.component_name,
+            kind=base.kind,
+            input_tokens=self.input_tokens,
+            output_tokens=self.output_tokens,
+            total_tokens=self.total_tokens,
+            cached_input_tokens=self.cached_input_tokens,
+            reasoning_tokens=self.reasoning_tokens,
+            audio_tokens=self.audio_tokens,
+        )
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to a JSON-compatible dictionary."""
+        return {
+            **super().to_dict(),
+            "input_tokens": self.input_tokens,
+            "output_tokens": self.output_tokens,
+            "total_tokens": self.total_tokens,
+            "cached_input_tokens": self.cached_input_tokens,
+            "reasoning_tokens": self.reasoning_tokens,
+            "audio_tokens": self.audio_tokens,
+        }
+
+    @classmethod
+    def from_chat_response_usage(
+        cls,
+        usage_dict: Dict[str, int],
+        *,
+        cost: Optional[float] = None,
+        provider: Optional[str] = None,
+        category: Optional[str] = None,
+        component_name: Optional[str] = None,
+        kind: str = "llm",
+    ) -> TokenUsage:
+        """Create a TokenUsage from a ChatResponse.usage dict.
+
+        Maps provider-specific key names to the canonical fields.
+
+        Args:
+            usage_dict: The usage dict from ``ChatResponse.usage``.
+            cost: Cost in USD if known (e.g., from provider-reported cost).
+            provider: Provider identifier.
+            category: Registry category.
+            component_name: Component name.
+            kind: Component kind, defaults to ``"llm"``.
+
+        Returns:
+            A TokenUsage instance with mapped fields.
+        """
+        return cls(
+            cost=cost,
+            provider=provider,
+            category=category,
+            component_name=component_name,
+            kind=kind,
+            input_tokens=usage_dict.get("input_tokens", 0),
+            output_tokens=usage_dict.get("output_tokens", 0),
+            total_tokens=usage_dict.get("total_tokens", 0),
+            cached_input_tokens=usage_dict.get("cached_input_tokens", 0),
+            reasoning_tokens=usage_dict.get("reasoning_tokens", 0),
+            audio_tokens=usage_dict.get("audio_tokens", 0),
+        )
+
+
+class UsageTrackableMixin:
+    """Mixin that provides usage tracking capability to any component.
+
+    Classes that inherit from UsageTrackableMixin can be registered with a
+    Benchmark instance and will have their usage automatically collected
+    by the registry via `collect_usage()`.
+
+    The `gather_usage()` method provides a default implementation that returns
+    an empty `Usage`. Subclasses should override this to return their
+    accumulated usage data.
+
+    How to use:
+        For custom components that incur billable costs, inherit from
+        UsageTrackableMixin and override `gather_usage()`:
+
+        ```python
+        class MyPaidService(TraceableMixin, UsageTrackableMixin):
+            def __init__(self):
+                self._usage_records: List[Usage] = []
+
+            def call_api(self, query):
+                result = api.call(query)
+                self._usage_records.append(Usage(
+                    cost=result.cost,
+                    units={"api_calls": 1},
+                ))
+                return result
+
+            def gather_usage(self) -> Usage:
+                return sum(self._usage_records, Usage())
+        ```
+
+        Then register it with your benchmark:
+
+        ```python
+        service = MyPaidService()
+        benchmark.register("tools", "my_service", service)
+        ```
+
+    Thread Safety:
+        Usage collection happens synchronously in the main thread after
+        task execution completes. Components should use thread-safe data
+        structures when accumulating usage during concurrent execution,
+        but `gather_usage()` itself is called sequentially.
+    """
+
+    def gather_usage(self) -> Usage:
+        """Gather accumulated usage from this component.
+
+        Provides a default implementation that returns an empty Usage.
+        Subclasses should override this to return their accumulated
+        usage data.
+
+        Returns:
+            Accumulated usage for this component.
+
+        How to use:
+            Override this method to return your component's usage:
+
+            ```python
+            def gather_usage(self) -> Usage:
+                return sum(self._usage_records, Usage())
+            ```
+        """
+        return Usage()
diff --git a/maseval/interface/cost.py b/maseval/interface/cost.py
new file mode 100644
index 00000000..178b0cc6
--- /dev/null
+++ b/maseval/interface/cost.py
@@ -0,0 +1,102 @@
+"""Cost calculators that depend on optional third-party packages.
+
+This module provides ``LiteLLMCostCalculator``, which uses LiteLLM's
+bundled model pricing database to compute cost from token counts.
+
+Requires: ``pip install litellm``
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, Optional
+
+from maseval.core.cost import CostCalculator  # noqa: F401 — re-export protocol
+from maseval.core.usage import TokenUsage
+
+
+class LiteLLMCostCalculator:
+    """Cost calculator using LiteLLM's bundled pricing database.
+
+    LiteLLM maintains a comprehensive `model_prices_and_context_window.json
+    <https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json>`_
+    that covers most major LLM providers. This calculator delegates to
+    ``litellm.cost_per_token`` for per-token rates and computes the total.
+
+    This is the recommended calculator for most users — it covers OpenAI,
+    Anthropic, Google, Mistral, Cohere, and many more without requiring
+    manual pricing tables.
+
+    Note:
+        If you're already using the ``LiteLLMModelAdapter``, it extracts
+        provider-reported cost from ``response._hidden_params.response_cost``
+        automatically. This calculator is useful as a fallback when using
+        other adapters (OpenAI, Anthropic, Google) directly.
+
+    Example:
+        ```python
+        from maseval.interface.cost import LiteLLMCostCalculator
+        from maseval.interface.inference import OpenAIModelAdapter
+
+        calculator = LiteLLMCostCalculator()
+        model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
+
+        # Cost is now computed automatically after each chat() call
+        response = model.chat([{"role": "user", "content": "Hello"}])
+        print(model.gather_usage().cost)  # e.g., 0.00123
+        ```
+    """
+
+    def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None):
+        """Initialize the LiteLLM cost calculator.
+
+        Args:
+            custom_pricing: Optional overrides for specific models. Keys are
+                model IDs, values are dicts with ``"input_cost_per_token"``
+                and ``"output_cost_per_token"``. These take precedence over
+                LiteLLM's built-in pricing.
+        """
+        try:
+            import litellm  # noqa: F401
+        except ImportError as e:
+            raise ImportError("LiteLLMCostCalculator requires litellm. Install it with: pip install litellm") from e
+
+        self._custom_pricing = custom_pricing or {}
+
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        """Compute cost using LiteLLM's pricing database.
+
+        Args:
+            usage: Token usage from the call.
+            model_id: The model identifier (must match LiteLLM's naming).
+
+        Returns:
+            Cost in USD, or ``None`` if LiteLLM doesn't have pricing for
+            this model and no custom pricing was provided.
+        """
+        # Check custom overrides first
+        if model_id in self._custom_pricing:
+            rates = self._custom_pricing[model_id]
+            input_cost = rates.get("input_cost_per_token", 0.0) * usage.input_tokens
+            output_cost = rates.get("output_cost_per_token", 0.0) * usage.output_tokens
+            return input_cost + output_cost
+
+        # Fall back to LiteLLM's built-in pricing
+        try:
+            import litellm
+
+            input_cost, output_cost = litellm.cost_per_token(
+                model=model_id,
+                prompt_tokens=usage.input_tokens,
+                completion_tokens=usage.output_tokens,
+            )
+            return input_cost + output_cost
+        except Exception:
+            # Model not in LiteLLM's pricing database
+            return None
+
+    def gather_config(self) -> Dict[str, Any]:
+        """Return calculator configuration for reproducibility."""
+        return {
+            "type": type(self).__name__,
+            "custom_pricing": dict(self._custom_pricing) if self._custom_pricing else None,
+        }
diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py
index 1c6389ea..5e0c92d4 100644
--- a/maseval/interface/inference/anthropic.py
+++ b/maseval/interface/inference/anthropic.py
@@ -52,6 +52,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
+from maseval.core.cost import CostCalculator
 from maseval.core.seeding import SeedingError
 
 
@@ -76,6 +77,7 @@ def __init__(
         default_generation_params: Optional[Dict[str, Any]] = None,
         max_tokens: int = 4096,
         seed: Optional[int] = None,
+        cost_calculator: Optional[CostCalculator] = None,
     ):
         """Initialize Anthropic model adapter.
 
@@ -88,6 +90,8 @@ def __init__(
                 parameter. Default is 4096.
             seed: Seed for deterministic generation. Note: Anthropic does NOT
                 support seeding. Providing a seed will raise SeedingError.
+            cost_calculator: Optional cost calculator for computing cost from
+                token counts when the provider doesn't report cost directly.
 
         Raises:
             SeedingError: If seed is provided (Anthropic doesn't support seeding).
@@ -98,7 +102,7 @@ def __init__(
                 f"Model '{model_id}' cannot use seed={seed}. "
                 f"Remove the seed parameter or use a provider that supports seeding."
             )
-        super().__init__(seed=seed)
+        super().__init__(seed=seed, cost_calculator=cost_calculator)
         self._client = client
         self._model_id = model_id
         self._default_generation_params = default_generation_params or {}
@@ -344,6 +348,10 @@ def _parse_response(self, response: Any) -> ChatResponse:
                 "output_tokens": getattr(response.usage, "output_tokens", 0),
                 "total_tokens": (getattr(response.usage, "input_tokens", 0) + getattr(response.usage, "output_tokens", 0)),
             }
+            # Provider-specific detail
+            cached = getattr(response.usage, "cache_read_input_tokens", 0)
+            if cached:
+                usage["cached_input_tokens"] = cached
 
         # Extract stop reason
         stop_reason = None
diff --git a/maseval/interface/inference/google_genai.py b/maseval/interface/inference/google_genai.py
index 8a38a281..8ceb466a 100644
--- a/maseval/interface/inference/google_genai.py
+++ b/maseval/interface/inference/google_genai.py
@@ -47,6 +47,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
+from maseval.core.cost import CostCalculator
 
 
 class GoogleGenAIModelAdapter(ModelAdapter):
@@ -65,6 +66,7 @@ def __init__(
         model_id: str,
         default_generation_params: Optional[Dict[str, Any]] = None,
         seed: Optional[int] = None,
+        cost_calculator: Optional[CostCalculator] = None,
     ):
         """Initialize Google GenAI model adapter.
 
@@ -74,8 +76,10 @@ def __init__(
             default_generation_params: Default parameters for all calls.
                 Common parameters: temperature, max_output_tokens, top_p.
             seed: Seed for deterministic generation. Google GenAI supports this.
+            cost_calculator: Optional cost calculator for computing cost from
+                token counts when the provider doesn't report cost directly.
         """
-        super().__init__(seed=seed)
+        super().__init__(seed=seed, cost_calculator=cost_calculator)
         self._client = client
         self._model_id = model_id
         self._default_generation_params = default_generation_params or {}
@@ -291,6 +295,10 @@ def _parse_response(self, response: Any) -> ChatResponse:
                 "output_tokens": getattr(um, "candidates_token_count", 0),
                 "total_tokens": getattr(um, "total_token_count", 0),
             }
+            # Provider-specific detail
+            thoughts = getattr(um, "thoughts_token_count", 0)
+            if thoughts:
+                usage["reasoning_tokens"] = thoughts
 
         # Extract stop reason
         stop_reason = None
diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py
index 45fac7e8..f28cc293 100644
--- a/maseval/interface/inference/huggingface.py
+++ b/maseval/interface/inference/huggingface.py
@@ -34,6 +34,7 @@
 from typing import Any, Optional, Dict, List, Callable, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
+from maseval.core.cost import CostCalculator
 
 
 class ToolCallingNotSupportedError(Exception):
@@ -65,6 +66,7 @@ def __init__(
         model_id: Optional[str] = None,
         default_generation_params: Optional[Dict[str, Any]] = None,
         seed: Optional[int] = None,
+        cost_calculator: Optional[CostCalculator] = None,
     ):
         """Initialize HuggingFace model adapter.
 
@@ -78,8 +80,10 @@ def __init__(
                 Common parameters: max_new_tokens, temperature, top_p, do_sample.
             seed: Seed for deterministic generation. Sets the random seed before
                 each generation call using transformers.set_seed().
+            cost_calculator: Optional cost calculator for computing cost from
+                token counts when the provider doesn't report cost directly.
         """
-        super().__init__(seed=seed)
+        super().__init__(seed=seed, cost_calculator=cost_calculator)
         self._model = model
         self._model_id = model_id or getattr(model, "name_or_path", "huggingface:unknown")
         self._default_generation_params = default_generation_params or {}
diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py
index ed932247..a13fcd6d 100644
--- a/maseval/interface/inference/litellm.py
+++ b/maseval/interface/inference/litellm.py
@@ -44,6 +44,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
+from maseval.core.cost import CostCalculator
 
 
 class LiteLLMModelAdapter(ModelAdapter):
@@ -70,6 +71,7 @@ def __init__(
         api_key: Optional[str] = None,
         api_base: Optional[str] = None,
         seed: Optional[int] = None,
+        cost_calculator: Optional[CostCalculator] = None,
     ):
         """Initialize LiteLLM model adapter.
 
@@ -87,8 +89,12 @@ def __init__(
             api_base: Custom API base URL for self-hosted or Azure endpoints.
             seed: Seed for deterministic generation. LiteLLM passes this to
                 the underlying provider. Note: Not all providers support seeding.
+            cost_calculator: Optional cost calculator for computing cost from
+                token counts. Note: LiteLLM already reports cost via
+                ``response._hidden_params.response_cost`` for most models,
+                so a calculator is only needed as a fallback or override.
         """
-        super().__init__(seed=seed)
+        super().__init__(seed=seed, cost_calculator=cost_calculator)
         self._model_id = model_id
         self._default_generation_params = default_generation_params or {}
         self._api_key = api_key
@@ -176,6 +182,23 @@ def _chat_impl(
                 "output_tokens": getattr(response.usage, "completion_tokens", 0),
                 "total_tokens": getattr(response.usage, "total_tokens", 0),
             }
+            # Provider-specific detail
+            completion_details = getattr(response.usage, "completion_tokens_details", None)
+            if completion_details:
+                reasoning = getattr(completion_details, "reasoning_tokens", 0)
+                if reasoning:
+                    usage["reasoning_tokens"] = reasoning
+            prompt_details = getattr(response.usage, "prompt_tokens_details", None)
+            if prompt_details:
+                cached = getattr(prompt_details, "cached_tokens", 0)
+                if cached:
+                    usage["cached_input_tokens"] = cached
+            # LiteLLM provider-reported cost
+            hidden = getattr(response, "_hidden_params", None)
+            if hidden and isinstance(hidden, dict):
+                cost = hidden.get("response_cost")
+                if isinstance(cost, (int, float)):
+                    usage["cost"] = cost
 
         return ChatResponse(
             content=message.content,
diff --git a/maseval/interface/inference/openai.py b/maseval/interface/inference/openai.py
index 5855b12d..fd31fd44 100644
--- a/maseval/interface/inference/openai.py
+++ b/maseval/interface/inference/openai.py
@@ -50,6 +50,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
+from maseval.core.cost import CostCalculator
 
 
 class OpenAIModelAdapter(ModelAdapter):
@@ -70,6 +71,7 @@ def __init__(
         model_id: str,
         default_generation_params: Optional[Dict[str, Any]] = None,
         seed: Optional[int] = None,
+        cost_calculator: Optional[CostCalculator] = None,
     ):
         """Initialize OpenAI model adapter.
 
@@ -81,8 +83,10 @@ def __init__(
                 Common parameters: temperature, max_tokens, top_p.
             seed: Seed for deterministic generation. OpenAI supports this natively.
                 Note: Determinism is best-effort, not guaranteed by OpenAI.
+            cost_calculator: Optional cost calculator for computing cost from
+                token counts when the provider doesn't report cost directly.
         """
-        super().__init__(seed=seed)
+        super().__init__(seed=seed, cost_calculator=cost_calculator)
         self._client = client
         self._model_id = model_id
         self._default_generation_params = default_generation_params or {}
@@ -209,6 +213,20 @@ def _parse_response(self, response: Any) -> ChatResponse:
                 "output_tokens": getattr(response.usage, "completion_tokens", 0),
                 "total_tokens": getattr(response.usage, "total_tokens", 0),
             }
+            # Provider-specific detail
+            completion_details = getattr(response.usage, "completion_tokens_details", None)
+            if completion_details:
+                reasoning = getattr(completion_details, "reasoning_tokens", 0)
+                if reasoning:
+                    usage["reasoning_tokens"] = reasoning
+                audio = getattr(completion_details, "audio_tokens", 0)
+                if audio:
+                    usage["audio_tokens"] = audio
+            prompt_details = getattr(response.usage, "prompt_tokens_details", None)
+            if prompt_details:
+                cached = getattr(prompt_details, "cached_tokens", 0)
+                if cached:
+                    usage["cached_input_tokens"] = cached
 
         return ChatResponse(
             content=message.content,
diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md
index f9b45133..91683ee0 100644
--- a/usage_tracking/PLAN.md
+++ b/usage_tracking/PLAN.md
@@ -30,13 +30,15 @@ Generic usage record for any billable resource. Stored as a simple dataclass.
 
 ```
 Usage
-  cost: Optional[float]         # Total cost in USD (None = unknown)
-  cost_details: Dict[str, float]  # Breakdown (e.g., {"input": 0.01, "output": 0.03})
-  units: Dict[str, int | float]   # Arbitrary countable units (e.g., {"api_calls": 3, "bytes": 1024})
-  metadata: Dict[str, Any]        # Provider-specific extras
+  cost: Optional[float]            # Total cost in USD (None = unknown)
+  units: Dict[str, int | float]    # Countable units (e.g., {"api_calls": 3, "bytes": 1024})
+  provider: Optional[str]          # e.g., "anthropic", "openai", "bloomberg"
+  category: Optional[str]          # e.g., "models", "evaluator_models", "tools"
+  component_name: Optional[str]    # e.g., "main_model", "judge", "bloomberg_api"
+  kind: Optional[str]              # e.g., "llm", "service", "local"
 ```
 
-Supports `__add__` to sum two records (costs sum if both known, else None; units sum; metadata merges).
+Supports `__add__`: costs sum (if both known, else None), units sum. Grouping fields (`provider`, `category`, `component_name`, `kind`) are preserved when they match, set to `None` on mismatch. `None` means "aggregated over" — e.g., `provider=None, category="models"` represents all models summed across providers. A fully `None` grouping is a grand total.
 
 ### `TokenUsage(Usage)` (LLM-specific)
 
@@ -289,29 +291,82 @@ These already hold a `ModelAdapter`. Their model's usage is collected automatica
 | File | Action | Content |
 |------|--------|---------|
 | `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` |
+| `maseval/core/cost.py` | **Create** | `CostCalculator` protocol, `StaticPricingCalculator` |
 | `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property |
-| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()` |
+| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()`, accept `cost_calculator` param |
 | `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict |
 | `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility |
-| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens` |
-| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens` |
-| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens` |
-| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost` |
-| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `UsageReporter` |
-| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation |
+| `maseval/interface/cost.py` | **Create** | `LiteLLMCostCalculator` (optional `litellm` dependency) |
+| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens`; accept `cost_calculator` |
+| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens`; accept `cost_calculator` |
+| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens`; accept `cost_calculator` |
+| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost`; accept `cost_calculator` |
+| `maseval/interface/inference/huggingface.py` | **Edit** | Accept `cost_calculator` |
+| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `CostCalculator`, `StaticPricingCalculator`, `UsageReporter` |
+| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation, cost calculators |
 
 No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`.
 
 ---
 
+## Cost Calculation
+
+Most LLM APIs return token counts but **not** cost. Cost calculation is a client-side concern.
+
+### CostCalculator protocol
+
+A `CostCalculator` is a simple protocol with one method:
+
+```python
+class CostCalculator(Protocol):
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: ...
+```
+
+`ModelAdapter` accepts an optional `cost_calculator` parameter. After each `chat()` call, if the provider didn't report cost and a calculator is present, the calculator fills in `TokenUsage.cost`. Provider-reported cost always takes precedence.
+
+### Built-in implementations
+
+| Calculator | Location | Dependencies | Use case |
+|-----------|----------|-------------|----------|
+| `StaticPricingCalculator` | `maseval.core.cost` | None | User-supplied per-model rates. Supports custom units (USD, EUR, credits). |
+| `LiteLLMCostCalculator` | `maseval.interface.cost` | `litellm` | Automatic pricing via LiteLLM's bundled model database. Covers OpenAI, Anthropic, Google, Mistral, etc. |
+
+### Cost flow (priority order)
+
+1. **Provider-reported cost** — e.g., LiteLLM's `response._hidden_params.response_cost`. Set directly in `ChatResponse.usage["cost"]`.
+2. **CostCalculator** — if no provider cost, `ModelAdapter.chat()` calls `calculator.calculate_cost(token_usage, model_id)`.
+3. **None** — if neither source provides cost, `Usage.cost` stays `None`.
+
+### Examples
+
+```python
+# Static pricing for a university cluster (credits per token)
+calculator = StaticPricingCalculator({
+    "llama-3-70b": {"input": 0.5, "output": 1.0},
+})
+
+# Automatic pricing via LiteLLM's database
+from maseval.interface.cost import LiteLLMCostCalculator
+calculator = LiteLLMCostCalculator()
+
+# Pass to any model adapter
+model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
+```
+
+### Non-LLM components
+
+Non-LLM components (tools, environments) set cost directly in their `gather_usage()` implementation — there is no calculator involvement. Each component knows its own billing model.
+
+---
+
 ## Non-goals
 
-- **Hardcoded pricing tables** — prices change too often; user-supplied or provider-reported only.
+- **Hardcoded pricing tables** — prices change too often; delegated to LiteLLM or user-supplied.
 - **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now.
 - **Billing integration** — no webhook/billing system integration.
 - **Streaming usage** — not supported yet (usage is captured after completion).
+- **Currency conversion** — `Usage.cost` is a bare float in whatever unit the calculator uses. Mixing units in one benchmark is a user error.
 
 ## Open Questions
 
-1. **Pricing config format**: Should pricing be passed to `ModelAdapter.__init__` as a new param, or set externally after construction? Leaning toward a `pricing` kwarg on adapter init for ergonomics. When `pricing` is provided and a `TokenUsage` record has `cost=None`, cost is computed from `pricing["input"] * input_tokens + pricing["output"] * output_tokens`.
-2. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1.
+1. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1.

From dd4864f76f6253df36cb42e04daafd6d3831cc2c Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Thu, 12 Mar 2026 18:23:35 +0100
Subject: [PATCH 03/19] updated cost tracking

---
 maseval/interface/cost.py | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/maseval/interface/cost.py b/maseval/interface/cost.py
index 178b0cc6..77d4f48d 100644
--- a/maseval/interface/cost.py
+++ b/maseval/interface/cost.py
@@ -46,7 +46,11 @@ class LiteLLMCostCalculator:
         ```
     """
 
-    def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None):
+    def __init__(
+        self,
+        custom_pricing: Optional[Dict[str, Dict[str, float]]] = None,
+        model_id_map: Optional[Dict[str, str]] = None,
+    ):
         """Initialize the LiteLLM cost calculator.
 
         Args:
@@ -54,6 +58,18 @@ def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None)
                 model IDs, values are dicts with ``"input_cost_per_token"``
                 and ``"output_cost_per_token"``. These take precedence over
                 LiteLLM's built-in pricing.
+            model_id_map: Optional mapping from adapter model IDs to LiteLLM
+                model IDs. Use this when your adapter's ``model_id`` doesn't
+                match LiteLLM's naming convention — e.g., when using Google's
+                OpenAI-compatible endpoint where the adapter sees
+                ``"gemini-2.0-flash"`` but LiteLLM expects
+                ``"gemini/gemini-2.0-flash"``.
+
+                Example::
+
+                    LiteLLMCostCalculator(model_id_map={
+                        "gemini-2.0-flash": "gemini/gemini-2.0-flash",
+                    })
         """
         try:
             import litellm  # noqa: F401

From 4ab0efe80a263879586bda122c0ca81479baee65 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Thu, 12 Mar 2026 19:55:59 +0100
Subject: [PATCH 04/19] updated litellm cost calculator

---
 docs/guides/index.md                        |   1 +
 docs/guides/usage-tracking.md               | 309 ++++++++++++++++++++
 docs/reference/usage.md                     |  31 ++
 maseval/__init__.py                         |   2 +-
 maseval/core/cost.py                        | 132 ---------
 maseval/core/model.py                       |   3 +-
 maseval/core/usage.py                       | 128 +++++++-
 maseval/interface/inference/anthropic.py    |   2 +-
 maseval/interface/inference/google_genai.py |   2 +-
 maseval/interface/inference/huggingface.py  |   2 +-
 maseval/interface/inference/litellm.py      |   2 +-
 maseval/interface/inference/openai.py       |   2 +-
 maseval/interface/{cost.py => usage.py}     |  16 +-
 mkdocs.yml                                  |   2 +
 14 files changed, 486 insertions(+), 148 deletions(-)
 create mode 100644 docs/guides/usage-tracking.md
 create mode 100644 docs/reference/usage.md
 delete mode 100644 maseval/core/cost.py
 rename maseval/interface/{cost.py => usage.py} (87%)

diff --git a/docs/guides/index.md b/docs/guides/index.md
index f659e5f4..9ff77ba2 100644
--- a/docs/guides/index.md
+++ b/docs/guides/index.md
@@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
 | [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility          |
 | [Exception Handling](exception-handling.md)    | Distinguish agent errors from infrastructure failures         |
 | [Seeding](seeding.md)                          | Enable reproducible benchmark runs with deterministic seeds   |
+| [Usage & Cost Tracking](usage-tracking.md)     | Track token usage and compute cost across providers           |
diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md
new file mode 100644
index 00000000..6b249a05
--- /dev/null
+++ b/docs/guides/usage-tracking.md
@@ -0,0 +1,309 @@
+# Usage & Cost Tracking
+
+## Overview
+
+MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for:
+
+- **Cost control**: Track how much each benchmark run costs across providers
+- **Budgeting**: Compare cost across models, tasks, and components
+- **Billing**: Support custom credit systems (university clusters, internal APIs)
+- **Analysis**: Understand token usage patterns per task, agent, or model
+
+!!! info "Usage vs Cost"
+
+    **Usage** = Token counts and arbitrary resource units (API calls, data points, etc.)
+
+    **Cost** = Monetary value computed from usage (USD, EUR, credits, etc.)
+
+    Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator.
+
+## Core Concepts
+
+**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata.
+
+**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.).
+
+**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`.
+
+**`CostCalculator`**: Protocol for pluggable cost computation from token counts.
+
+## Automatic LLM Usage Tracking
+
+All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally.
+
+```python
+from maseval.interface.inference import OpenAIModelAdapter
+
+model = OpenAIModelAdapter(client=client, model_id="gpt-4")
+
+# Make some calls
+model.chat([{"role": "user", "content": "Hello"}])
+model.chat([{"role": "user", "content": "How are you?"}])
+
+# Inspect accumulated usage
+usage = model.gather_usage()
+print(usage.input_tokens)   # e.g., 25
+print(usage.output_tokens)  # e.g., 42
+print(usage.cost)           # None (no cost calculator configured)
+```
+
+### In Benchmarks
+
+Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key:
+
+```python
+results = benchmark.run()
+
+for report in results:
+    print(f"Task {report['task_id']}: {report['usage']}")
+```
+
+Live running totals are available during execution:
+
+```python
+benchmark.usage               # -> Usage (grand total across all tasks)
+benchmark.usage_by_component  # -> Dict[str, Usage] (per-component totals)
+```
+
+## Cost Calculation
+
+Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones.
+
+### Cost Priority
+
+When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order:
+
+1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
+2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
+3. **None** — if neither source provides cost, `Usage.cost` stays `None`.
+
+### StaticPricingCalculator
+
+Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`.
+
+```python
+from maseval import StaticPricingCalculator
+
+calculator = StaticPricingCalculator({
+    "gpt-4": {"input": 0.00003, "output": 0.00006},
+    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
+})
+
+model = OpenAIModelAdapter(
+    client=client,
+    model_id="gpt-4",
+    cost_calculator=calculator,
+)
+
+response = model.chat([{"role": "user", "content": "Hello"}])
+print(model.gather_usage().cost)  # e.g., 0.00234
+```
+
+Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate:
+
+```python
+calculator = StaticPricingCalculator({
+    "claude-sonnet-4-5": {
+        "input": 0.000003,
+        "output": 0.000015,
+        "cached_input": 0.0000003,  # 10x cheaper for cached tokens
+    },
+})
+```
+
+For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents:
+
+```python
+calculator = StaticPricingCalculator({
+    "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
+})
+```
+
+### LiteLLMCostCalculator
+
+Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more.
+
+```python
+from maseval.interface.usage import LiteLLMCostCalculator
+
+calculator = LiteLLMCostCalculator()
+
+model = OpenAIModelAdapter(
+    client=client,
+    model_id="gpt-4",
+    cost_calculator=calculator,
+)
+```
+
+!!! tip "LiteLLMModelAdapter already reports cost"
+
+    If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup.
+
+#### Custom Pricing Overrides
+
+Override pricing for specific models while using LiteLLM's database for the rest:
+
+```python
+calculator = LiteLLMCostCalculator(custom_pricing={
+    "my-finetuned-gpt4": {
+        "input_cost_per_token": 0.00006,
+        "output_cost_per_token": 0.00012,
+    },
+})
+```
+
+#### Model ID Remapping
+
+When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap:
+
+```python
+calculator = LiteLLMCostCalculator(model_id_map={
+    "gemini-2.0-flash": "gemini/gemini-2.0-flash",
+    "my-custom-gpt4": "gpt-4",
+})
+```
+
+The map is applied before both custom pricing and LiteLLM lookup.
+
+### Custom Cost Calculator
+
+Implement the `CostCalculator` protocol for custom pricing logic:
+
+```python
+from maseval import CostCalculator, TokenUsage
+from typing import Optional
+
+class MyCostCalculator:
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        rate = MY_PRICING_TABLE.get(model_id)
+        if rate is None:
+            return None
+        return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
+```
+
+The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model.
+
+### Sharing Calculators Across Adapters
+
+A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing:
+
+```python
+calculator = StaticPricingCalculator({
+    "gpt-4": {"input": 0.00003, "output": 0.00006},
+    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
+})
+
+model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
+model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator)
+```
+
+## Non-LLM Usage Tracking
+
+Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`:
+
+```python
+from maseval import Usage, UsageTrackableMixin
+from maseval.core.tracing import TraceableMixin
+
+class BloombergEnvironment(Environment, UsageTrackableMixin):
+    def __init__(self, task_data):
+        super().__init__(task_data)
+        self._usage_records = []
+
+    def _call_bloomberg(self, query):
+        result = bloomberg_client.query(query)
+        self._usage_records.append(Usage(
+            cost=result.billed_amount,
+            units={"api_calls": 1, "data_points": result.count},
+            provider="bloomberg",
+            kind="service",
+        ))
+        return result
+
+    def gather_usage(self) -> Usage:
+        if not self._usage_records:
+            return Usage()
+        return sum(self._usage_records, Usage())
+```
+
+Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model.
+
+## Post-hoc Analysis with UsageReporter
+
+`UsageReporter` provides sliced analysis across all benchmark reports:
+
+```python
+from maseval import UsageReporter
+
+reporter = UsageReporter.from_reports(benchmark.reports)
+
+# Grand total
+total = reporter.total()
+print(f"Total cost: ${total.cost:.4f}")
+print(f"Total tokens: {total.input_tokens + total.output_tokens}")
+
+# Per-task breakdown
+for task_id, usage in reporter.by_task().items():
+    print(f"  {task_id}: ${usage.cost:.4f}")
+
+# Per-component breakdown
+for component, usage in reporter.by_component().items():
+    print(f"  {component}: ${usage.cost:.4f}")
+
+# Full nested summary dict
+summary = reporter.summary()
+```
+
+## Usage Data Model
+
+### Usage
+
+Generic record for any billable resource:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. |
+| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). |
+| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). |
+| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). |
+| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). |
+| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). |
+
+`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch.
+
+### TokenUsage
+
+Extends `Usage` with LLM-specific token counts:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `input_tokens` | `int` | Input/prompt tokens. |
+| `output_tokens` | `int` | Output/completion tokens. |
+| `total_tokens` | `int` | Total tokens. |
+| `cached_input_tokens` | `int` | Tokens served from cache. |
+| `reasoning_tokens` | `int` | Reasoning/thinking tokens. |
+| `audio_tokens` | `int` | Audio processing tokens. |
+
+## Evaluator Usage
+
+Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically:
+
+```python
+class MyBenchmark(Benchmark):
+    def setup_evaluators(self, task, environment):
+        judge_model = OpenAIModelAdapter(client=client, model_id="gpt-4")
+        self.register("evaluator_models", "judge", judge_model)
+        return [MyLLMEvaluator(judge_model)]
+```
+
+The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage.
+
+## Tips
+
+**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates.
+
+**For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming.
+
+**For failed tasks**: Usage is collected before error status is determined, so partial usage from failed tasks is still tracked.
+
+**For live monitoring**: Access `benchmark.usage` during execution to check running totals.
diff --git a/docs/reference/usage.md b/docs/reference/usage.md
new file mode 100644
index 00000000..2326aaef
--- /dev/null
+++ b/docs/reference/usage.md
@@ -0,0 +1,31 @@
+# Usage & Cost Tracking
+
+Usage and cost tracking provides data classes for recording resource consumption, a mixin for automatic collection, and pluggable cost calculators.
+
+See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage patterns and examples.
+
+## Core
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/usage.py){ .md-source-file }
+
+::: maseval.core.usage.Usage
+
+::: maseval.core.usage.TokenUsage
+
+::: maseval.core.usage.UsageTrackableMixin
+
+::: maseval.core.usage.CostCalculator
+
+::: maseval.core.usage.StaticPricingCalculator
+
+## Reporting
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/reporting.py){ .md-source-file }
+
+::: maseval.core.reporting.UsageReporter
+
+## Interface
+
+[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/usage.py){ .md-source-file }
+
+::: maseval.interface.usage.LiteLLMCostCalculator
diff --git a/maseval/__init__.py b/maseval/__init__.py
index addedf3e..460bf1a9 100644
--- a/maseval/__init__.py
+++ b/maseval/__init__.py
@@ -38,7 +38,7 @@
 from .core.history import MessageHistory, ToolInvocationHistory
 from .core.tracing import TraceableMixin
 from .core.usage import Usage, TokenUsage, UsageTrackableMixin
-from .core.cost import CostCalculator, StaticPricingCalculator
+from .core.usage import CostCalculator, StaticPricingCalculator
 from .core.reporting import UsageReporter
 from .core.registry import ComponentRegistry
 from .core.context import TaskContext
diff --git a/maseval/core/cost.py b/maseval/core/cost.py
deleted file mode 100644
index a6eb8583..00000000
--- a/maseval/core/cost.py
+++ /dev/null
@@ -1,132 +0,0 @@
-"""Pluggable cost calculation for usage records.
-
-This module provides the ``CostCalculator`` protocol and a built-in
-``StaticPricingCalculator`` that computes cost from token counts and
-user-supplied pricing tables. For automatic pricing via LiteLLM's
-bundled model database, see ``maseval.interface.cost``.
-
-Cost calculators are optional — if no calculator is provided to a
-``ModelAdapter``, cost is only set when the provider reports it directly
-(e.g., LiteLLM's ``response._hidden_params.response_cost``).
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, Optional, Protocol, runtime_checkable
-
-from .usage import TokenUsage
-
-
-@runtime_checkable
-class CostCalculator(Protocol):
-    """Protocol for computing cost from token usage.
-
-    Implementations receive a ``TokenUsage`` and the model ID, and return
-    the cost in whatever unit the calculator declares (typically USD).
-
-    Example:
-        ```python
-        class MyCostCalculator:
-            def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
-                rate = MY_PRICING.get(model_id)
-                if rate is None:
-                    return None
-                return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
-        ```
-    """
-
-    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
-        """Compute cost for a single chat call.
-
-        Args:
-            usage: Token usage from the call.
-            model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``).
-
-        Returns:
-            Cost as a float, or ``None`` if pricing is unknown for this model.
-        """
-        ...
-
-
-class StaticPricingCalculator:
-    """Cost calculator using user-supplied per-model pricing.
-
-    Pricing is specified as cost per token (not per 1K or 1M tokens).
-    If a model is not in the pricing table, ``calculate_cost`` returns ``None``.
-
-    Args:
-        pricing: Dict mapping model IDs to their per-token rates.
-            Each value is a dict with keys:
-
-            - ``"input"`` — cost per input token (required)
-            - ``"output"`` — cost per output token (required)
-            - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate)
-
-    Example:
-        ```python
-        calculator = StaticPricingCalculator({
-            "gpt-4": {"input": 0.00003, "output": 0.00006},
-            "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
-        })
-
-        model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator)
-        ```
-
-    For university clusters or custom credit systems, the "cost" unit
-    is whatever the pricing values represent (credits, EUR, etc.):
-
-        ```python
-        calculator = StaticPricingCalculator({
-            "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
-        })
-        ```
-    """
-
-    def __init__(self, pricing: Dict[str, Dict[str, float]]):
-        self._pricing = pricing
-
-    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
-        """Compute cost from static per-token rates.
-
-        Args:
-            usage: Token usage from the call.
-            model_id: The model identifier to look up in the pricing table.
-
-        Returns:
-            Computed cost, or ``None`` if the model is not in the pricing table.
-        """
-        rates = self._pricing.get(model_id)
-        if rates is None:
-            return None
-
-        input_rate = rates.get("input", 0.0)
-        output_rate = rates.get("output", 0.0)
-        cached_rate = rates.get("cached_input", input_rate)
-
-        # Non-cached input tokens = total input - cached
-        non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens)
-
-        cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate
-
-        return cost
-
-    def add_model(self, model_id: str, rates: Dict[str, float]) -> None:
-        """Add or update pricing for a model.
-
-        Args:
-            model_id: The model identifier.
-            rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``).
-        """
-        self._pricing[model_id] = rates
-
-    @property
-    def models(self) -> list[str]:
-        """List of model IDs with pricing configured."""
-        return list(self._pricing.keys())
-
-    def gather_config(self) -> Dict[str, Any]:
-        """Return pricing configuration for reproducibility."""
-        return {
-            "type": type(self).__name__,
-            "pricing": dict(self._pricing),
-        }
diff --git a/maseval/core/model.py b/maseval/core/model.py
index f33e48f4..110d9879 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -55,8 +55,7 @@
 
 from .tracing import TraceableMixin
 from .config import ConfigurableMixin
-from .usage import Usage, TokenUsage, UsageTrackableMixin
-from .cost import CostCalculator
+from .usage import Usage, TokenUsage, UsageTrackableMixin, CostCalculator
 from .history import MessageHistory
 
 
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index 78edeab9..aa2c2e08 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -1,19 +1,26 @@
 """Core usage tracking infrastructure for API cost and resource monitoring.
 
 This module provides the `Usage` and `TokenUsage` data classes for recording
-billable resource consumption, and the `UsageTrackableMixin` that enables
-automatic usage collection through the component registry.
+billable resource consumption, the `UsageTrackableMixin` that enables
+automatic usage collection through the component registry, and pluggable
+cost calculators (`CostCalculator`, `StaticPricingCalculator`) for translating
+token counts into monetary cost.
 
 Usage tracking is a first-class collection axis alongside tracing
 (`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that
 inherit `UsageTrackableMixin` have their usage automatically collected by the
 registry via `gather_usage()`.
+
+Cost calculators are optional — if no calculator is provided to a
+``ModelAdapter``, cost is only set when the provider reports it directly
+(e.g., LiteLLM's ``response._hidden_params.response_cost``). For automatic
+pricing via LiteLLM's bundled model database, see ``maseval.interface.usage``.
 """
 
 from __future__ import annotations
 
 from dataclasses import dataclass, field
-from typing import Any, Optional, Dict
+from typing import Any, Dict, Optional, Protocol, runtime_checkable
 
 
 @dataclass
@@ -299,3 +306,118 @@ def gather_usage(self) -> Usage:
             ```
         """
         return Usage()
+
+
+@runtime_checkable
+class CostCalculator(Protocol):
+    """Protocol for computing cost from token usage.
+
+    Implementations receive a ``TokenUsage`` and the model ID, and return
+    the cost in whatever unit the calculator declares (typically USD).
+
+    Example:
+        ```python
+        class MyCostCalculator:
+            def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+                rate = MY_PRICING.get(model_id)
+                if rate is None:
+                    return None
+                return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
+        ```
+    """
+
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        """Compute cost for a single chat call.
+
+        Args:
+            usage: Token usage from the call.
+            model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``).
+
+        Returns:
+            Cost as a float, or ``None`` if pricing is unknown for this model.
+        """
+        ...
+
+
+class StaticPricingCalculator:
+    """Cost calculator using user-supplied per-model pricing.
+
+    Pricing is specified as cost per token (not per 1K or 1M tokens).
+    If a model is not in the pricing table, ``calculate_cost`` returns ``None``.
+
+    Args:
+        pricing: Dict mapping model IDs to their per-token rates.
+            Each value is a dict with keys:
+
+            - ``"input"`` — cost per input token (required)
+            - ``"output"`` — cost per output token (required)
+            - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate)
+
+    Example:
+        ```python
+        calculator = StaticPricingCalculator({
+            "gpt-4": {"input": 0.00003, "output": 0.00006},
+            "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
+        })
+
+        model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator)
+        ```
+
+    For university clusters or custom credit systems, the "cost" unit
+    is whatever the pricing values represent (credits, EUR, etc.):
+
+        ```python
+        calculator = StaticPricingCalculator({
+            "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
+        })
+        ```
+    """
+
+    def __init__(self, pricing: Dict[str, Dict[str, float]]):
+        self._pricing = pricing
+
+    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
+        """Compute cost from static per-token rates.
+
+        Args:
+            usage: Token usage from the call.
+            model_id: The model identifier to look up in the pricing table.
+
+        Returns:
+            Computed cost, or ``None`` if the model is not in the pricing table.
+        """
+        rates = self._pricing.get(model_id)
+        if rates is None:
+            return None
+
+        input_rate = rates.get("input", 0.0)
+        output_rate = rates.get("output", 0.0)
+        cached_rate = rates.get("cached_input", input_rate)
+
+        # Non-cached input tokens = total input - cached
+        non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens)
+
+        cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate
+
+        return cost
+
+    def add_model(self, model_id: str, rates: Dict[str, float]) -> None:
+        """Add or update pricing for a model.
+
+        Args:
+            model_id: The model identifier.
+            rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``).
+        """
+        self._pricing[model_id] = rates
+
+    @property
+    def models(self) -> list[str]:
+        """List of model IDs with pricing configured."""
+        return list(self._pricing.keys())
+
+    def gather_config(self) -> Dict[str, Any]:
+        """Return pricing configuration for reproducibility."""
+        return {
+            "type": type(self).__name__,
+            "pricing": dict(self._pricing),
+        }
diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py
index 5e0c92d4..5c816e76 100644
--- a/maseval/interface/inference/anthropic.py
+++ b/maseval/interface/inference/anthropic.py
@@ -52,7 +52,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
-from maseval.core.cost import CostCalculator
+from maseval.core.usage import CostCalculator
 from maseval.core.seeding import SeedingError
 
 
diff --git a/maseval/interface/inference/google_genai.py b/maseval/interface/inference/google_genai.py
index 8ceb466a..5bbf33ce 100644
--- a/maseval/interface/inference/google_genai.py
+++ b/maseval/interface/inference/google_genai.py
@@ -47,7 +47,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
-from maseval.core.cost import CostCalculator
+from maseval.core.usage import CostCalculator
 
 
 class GoogleGenAIModelAdapter(ModelAdapter):
diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py
index f28cc293..9de0c7df 100644
--- a/maseval/interface/inference/huggingface.py
+++ b/maseval/interface/inference/huggingface.py
@@ -34,7 +34,7 @@
 from typing import Any, Optional, Dict, List, Callable, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
-from maseval.core.cost import CostCalculator
+from maseval.core.usage import CostCalculator
 
 
 class ToolCallingNotSupportedError(Exception):
diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py
index a13fcd6d..ce5385e7 100644
--- a/maseval/interface/inference/litellm.py
+++ b/maseval/interface/inference/litellm.py
@@ -44,7 +44,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
-from maseval.core.cost import CostCalculator
+from maseval.core.usage import CostCalculator
 
 
 class LiteLLMModelAdapter(ModelAdapter):
diff --git a/maseval/interface/inference/openai.py b/maseval/interface/inference/openai.py
index fd31fd44..ff0c4245 100644
--- a/maseval/interface/inference/openai.py
+++ b/maseval/interface/inference/openai.py
@@ -50,7 +50,7 @@
 from typing import Any, Optional, Dict, List, Union
 
 from maseval.core.model import ModelAdapter, ChatResponse
-from maseval.core.cost import CostCalculator
+from maseval.core.usage import CostCalculator
 
 
 class OpenAIModelAdapter(ModelAdapter):
diff --git a/maseval/interface/cost.py b/maseval/interface/usage.py
similarity index 87%
rename from maseval/interface/cost.py
rename to maseval/interface/usage.py
index 77d4f48d..87070f13 100644
--- a/maseval/interface/cost.py
+++ b/maseval/interface/usage.py
@@ -1,4 +1,4 @@
-"""Cost calculators that depend on optional third-party packages.
+"""Usage and cost utilities that depend on optional third-party packages.
 
 This module provides ``LiteLLMCostCalculator``, which uses LiteLLM's
 bundled model pricing database to compute cost from token counts.
@@ -10,8 +10,7 @@
 
 from typing import Any, Dict, Optional
 
-from maseval.core.cost import CostCalculator  # noqa: F401 — re-export protocol
-from maseval.core.usage import TokenUsage
+from maseval.core.usage import CostCalculator, TokenUsage  # noqa: F401 — re-export protocol
 
 
 class LiteLLMCostCalculator:
@@ -34,7 +33,7 @@ class LiteLLMCostCalculator:
 
     Example:
         ```python
-        from maseval.interface.cost import LiteLLMCostCalculator
+        from maseval.interface.usage import LiteLLMCostCalculator
         from maseval.interface.inference import OpenAIModelAdapter
 
         calculator = LiteLLMCostCalculator()
@@ -77,18 +76,24 @@ def __init__(
             raise ImportError("LiteLLMCostCalculator requires litellm. Install it with: pip install litellm") from e
 
         self._custom_pricing = custom_pricing or {}
+        self._model_id_map = model_id_map or {}
 
     def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
         """Compute cost using LiteLLM's pricing database.
 
         Args:
             usage: Token usage from the call.
-            model_id: The model identifier (must match LiteLLM's naming).
+            model_id: The model identifier. Remapped via ``model_id_map``
+                if configured, then looked up in custom pricing and
+                LiteLLM's database.
 
         Returns:
             Cost in USD, or ``None`` if LiteLLM doesn't have pricing for
             this model and no custom pricing was provided.
         """
+        # Remap model ID if configured
+        model_id = self._model_id_map.get(model_id, model_id)
+
         # Check custom overrides first
         if model_id in self._custom_pricing:
             rates = self._custom_pricing[model_id]
@@ -115,4 +120,5 @@ def gather_config(self) -> Dict[str, Any]:
         return {
             "type": type(self).__name__,
             "custom_pricing": dict(self._custom_pricing) if self._custom_pricing else None,
+            "model_id_map": dict(self._model_id_map) if self._model_id_map else None,
         }
diff --git a/mkdocs.yml b/mkdocs.yml
index 4b489f50..6cbce841 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -95,6 +95,7 @@ nav:
       - Configuration Gathering: guides/config-gathering.md
       - Exception Handling: guides/exception-handling.md
       - Seeding: guides/seeding.md
+      - Usage & Cost Tracking: guides/usage-tracking.md
   - Examples:
       - examples/index.md
       - Tiny Tutorial: examples/tutorial.ipynb
@@ -113,6 +114,7 @@ nav:
           - Seeding: reference/seeding.md
           - Simulator: reference/simulator.md
           - Tasks: reference/task.md
+          - Usage & Cost: reference/usage.md
           - User: reference/user.md
           - Utilities: reference/utilities.md
       - Interface:

From afe16a71488802a5fde38330e50e02a22b9b7728 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Fri, 13 Mar 2026 14:30:09 +0100
Subject: [PATCH 05/19] - updated examples - moved reporting to usage

---
 CHANGELOG.md                                  |  16 +-
 docs/reference/usage.md                       |   6 +-
 .../five_a_day_benchmark.ipynb                |  69 ++------
 .../five_a_day_benchmark.py                   |  22 ++-
 maseval/__init__.py                           |   3 +-
 maseval/core/reporting.py                     | 148 ------------------
 maseval/core/usage.py                         | 144 ++++++++++++++++-
 7 files changed, 192 insertions(+), 216 deletions(-)
 delete mode 100644 maseval/core/reporting.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 6f74dab6..98a6f0d3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,6 +11,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Core**
 
+- Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45)
+- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45)
+- `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45)
+- `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45)
+- Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45)
+- `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45)
+
 - `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
 - `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
 
@@ -39,10 +46,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Examples**
 
+- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
+
 - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
 - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
+**Documentation**
+
+- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) covering automatic LLM tracking, cost calculators, non-LLM usage, post-hoc analysis with `UsageReporter`, and the data model. (PR: #45)
+- Usage & Cost reference page (`docs/reference/usage.md`) with API documentation for all usage and cost classes. (PR: #45)
+
 **Core**
 
 - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
@@ -108,8 +122,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `LangGraphUser` → `LangGraphLLMUser`
   - `LlamaIndexUser` → `LlamaIndexLLMUser`
 
-**Documentation**
-
 - All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
 
 **Testing**
diff --git a/docs/reference/usage.md b/docs/reference/usage.md
index 2326aaef..460769bf 100644
--- a/docs/reference/usage.md
+++ b/docs/reference/usage.md
@@ -18,11 +18,7 @@ See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage pat
 
 ::: maseval.core.usage.StaticPricingCalculator
 
-## Reporting
-
-[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/reporting.py){ .md-source-file }
-
-::: maseval.core.reporting.UsageReporter
+::: maseval.core.usage.UsageReporter
 
 ## Interface
 
diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
index 1eebd71f..2e4a6375 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
@@ -651,64 +651,25 @@
     "            print(f\"{k:<35} {v}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "tspsj2zzdyo",
+   "source": "### 2.7 Usage & Cost Tracking\n\nMASEval automatically tracks token usage for every LLM call made during benchmark execution. Each report includes a `\"usage\"` key with per-component breakdowns, and the benchmark maintains running totals across all tasks.\n\nFor cost estimation, pass a `CostCalculator` to your model adapters. MASEval ships two built-in calculators:\n\n- **`StaticPricingCalculator`** — user-supplied per-token rates (no dependencies)\n- **`LiteLLMCostCalculator`** — automatic pricing via LiteLLM's model database (requires `litellm`)\n\nSince this benchmark uses smolagents with LiteLLM models (which don't go through MASEval's `ModelAdapter`), token usage is tracked at the tool level. In benchmarks that use MASEval's model adapters directly, token-level usage and cost are captured automatically.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "amrylkbxkb7",
+   "source": "from maseval import UsageReporter\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\"  Total cost:   {f'${total.cost:.6f}' if total.cost is not None else 'N/A (no cost calculator)'}\")\nprint(f\"  Total units:  {dict(total.units) if total.units else '{}'}\")\nprint()\n\n# Per-component breakdown\nprint(\"Per-Component Breakdown\")\nprint(\"-\" * 60)\nfor component_key, usage in benchmark.usage_by_component.items():\n    cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n    units_str = dict(usage.units) if usage.units else \"\"\n    print(f\"  {component_key:<35} cost={cost_str}  units={units_str}\")\nprint()\n\n# --- Post-hoc analysis with UsageReporter ---\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n    cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n    print(f\"  {task_id:<35} cost={cost_str}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
   {
    "cell_type": "markdown",
    "id": "080f6216",
    "metadata": {},
-   "source": [
-    "## Summary and Key Takeaways\n",
-    "\n",
-    "### What You've Learned\n",
-    "\n",
-    "You now understand how to build production agent benchmarks with MASEval:\n",
-    "\n",
-    "#### Part 1: Multi-Agent Systems\n",
-    "- **Model creation** with LiteLLM for framework compatibility\n",
-    "- **Framework-agnostic tools** that convert to any agent library\n",
-    "- **Multi-agent architecture** with orchestrators and specialists\n",
-    "- **Tool state management** for realistic task environments\n",
-    "\n",
-    "#### Part 2: MASEval Framework\n",
-    "- **Task abstraction** packages queries, environments, and evaluation criteria\n",
-    "- **Environment class** creates tools and enables automatic tracing\n",
-    "- **Benchmark class** orchestrates evaluation across multiple tasks\n",
-    "- **Custom evaluators** for diverse evaluation approaches (unit tests, LLM judges, etc.)\n",
-    "- **Automatic tracing** captures all tool calls and agent interactions\n",
-    "\n",
-    "### Key Design Patterns\n",
-    "\n",
-    "1. **Separation of Concerns**:\n",
-    "   - Tasks define WHAT to evaluate\n",
-    "   - Environments provides a world in which the agents act (tools and state)\n",
-    "   - Benchmarks orchestrate WHEN and WHERE\n",
-    "   - Evaluators determine SUCCESS\n",
-    "\n",
-    "2. **Framework Agnostic**:\n",
-    "   - Same tasks work with smolagents, LangGraph, LlamaIndex\n",
-    "   - Tools convert automatically to framework-specific formats\n",
-    "   - Easy to compare frameworks on identical tasks\n",
-    "\n",
-    "3. **Reproducibility**:\n",
-    "   - Seeds derived systematically from task_id + agent_id\n",
-    "   - All parameters logged automatically\n",
-    "   - Results saved in structured JSONL format\n",
-    "\n",
-    "## Next Steps\n",
-    "\n",
-    "1. **Explore evaluators** — Check `evaluators/` for different evaluation strategies\n",
-    "2. **Try single-agent mode** — Load `data/singleagent.json` to compare architectures\n",
-    "3. **Run from CLI** — Use `five_a_day_benchmark.py` for scripted runs with different frameworks\n",
-    "4. **Add custom tasks** — Create your own task definitions and evaluators\n",
-    "5. **Compare frameworks** — Run the same benchmark with LangGraph or LlamaIndex\n",
-    "\n",
-    "## Resources\n",
-    "\n",
-    "- [MASEval Documentation](https://github.com/parameterlab/MASEval)\n",
-    "- Example code: [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark)\n",
-    "- Example data: [`examples/five_a_day_benchmark/data/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/data)\n",
-    "- Tool implementations: [`examples/five_a_day_benchmark/tools/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/tools)\n",
-    "- Evaluator implementations: [`examples/five_a_day_benchmark/evaluators/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/evaluators)"
-   ]
+   "source": "## Summary and Key Takeaways\n\n### What You've Learned\n\nYou now understand how to build production agent benchmarks with MASEval:\n\n#### Part 1: Multi-Agent Systems\n- **Model creation** with LiteLLM for framework compatibility\n- **Framework-agnostic tools** that convert to any agent library\n- **Multi-agent architecture** with orchestrators and specialists\n- **Tool state management** for realistic task environments\n\n#### Part 2: MASEval Framework\n- **Task abstraction** packages queries, environments, and evaluation criteria\n- **Environment class** creates tools and enables automatic tracing\n- **Benchmark class** orchestrates evaluation across multiple tasks\n- **Custom evaluators** for diverse evaluation approaches (unit tests, LLM judges, etc.)\n- **Automatic tracing** captures all tool calls and agent interactions\n- **Usage & cost tracking** monitors token consumption and computes cost across providers\n\n### Key Design Patterns\n\n1. **Separation of Concerns**:\n   - Tasks define WHAT to evaluate\n   - Environments provides a world in which the agents act (tools and state)\n   - Benchmarks orchestrate WHEN and WHERE\n   - Evaluators determine SUCCESS\n\n2. **Framework Agnostic**:\n   - Same tasks work with smolagents, LangGraph, LlamaIndex\n   - Tools convert automatically to framework-specific formats\n   - Easy to compare frameworks on identical tasks\n\n3. **Reproducibility**:\n   - Seeds derived systematically from task_id + agent_id\n   - All parameters logged automatically\n   - Results saved in structured JSONL format\n\n## Next Steps\n\n1. **Explore evaluators** — Check `evaluators/` for different evaluation strategies\n2. **Try single-agent mode** — Load `data/singleagent.json` to compare architectures\n3. **Run from CLI** — Use `five_a_day_benchmark.py` for scripted runs with different frameworks\n4. **Add custom tasks** — Create your own task definitions and evaluators\n5. **Compare frameworks** — Run the same benchmark with LangGraph or LlamaIndex\n\n## Resources\n\n- [MASEval Documentation](https://github.com/parameterlab/MASEval)\n- Example code: [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark)\n- Example data: [`examples/five_a_day_benchmark/data/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/data)\n- Tool implementations: [`examples/five_a_day_benchmark/tools/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/tools)\n- Evaluator implementations: [`examples/five_a_day_benchmark/evaluators/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/evaluators)"
   }
  ],
  "metadata": {
diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py
index 6972a203..a3972a9c 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.py
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py
@@ -26,7 +26,7 @@
 
 from utils import sanitize_name  # type: ignore[unresolved-import]
 
-from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue, AgentAdapter, ModelAdapter, SeedGenerator
+from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue, AgentAdapter, ModelAdapter, SeedGenerator, UsageReporter
 from maseval.core.callbacks.result_logger import FileResultLogger
 
 # Import tool implementations
@@ -960,6 +960,26 @@ def load_benchmark_data(
     )
     results = benchmark.run(tasks=tasks, agent_data=agent_configs)
 
+    # --- Usage summary ---
+    print("\n--- Usage Summary ---")
+    total = benchmark.usage
+    cost_str = f"${total.cost:.6f}" if total.cost is not None else "N/A (no cost calculator)"
+    print(f"Total cost: {cost_str}")
+
+    if benchmark.usage_by_component:
+        print("\nPer-component:")
+        for key, usage in benchmark.usage_by_component.items():
+            c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A"
+            print(f"  {key:<35} cost={c}  units={dict(usage.units) if usage.units else '{}'}")
+
+    reporter = UsageReporter.from_reports(results)
+    by_task = reporter.by_task()
+    if by_task:
+        print("\nPer-task:")
+        for task_id, usage in by_task.items():
+            c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A"
+            print(f"  {task_id:<35} cost={c}")
+
     print("\n--- Benchmark Complete ---")
     print(f"Total tasks: {len(tasks)}")
     print(f"Results saved to: {logger.output_dir}")
diff --git a/maseval/__init__.py b/maseval/__init__.py
index 460bf1a9..c50012ac 100644
--- a/maseval/__init__.py
+++ b/maseval/__init__.py
@@ -38,8 +38,7 @@
 from .core.history import MessageHistory, ToolInvocationHistory
 from .core.tracing import TraceableMixin
 from .core.usage import Usage, TokenUsage, UsageTrackableMixin
-from .core.usage import CostCalculator, StaticPricingCalculator
-from .core.reporting import UsageReporter
+from .core.usage import CostCalculator, StaticPricingCalculator, UsageReporter
 from .core.registry import ComponentRegistry
 from .core.context import TaskContext
 from .core.exceptions import (
diff --git a/maseval/core/reporting.py b/maseval/core/reporting.py
deleted file mode 100644
index 2465fba0..00000000
--- a/maseval/core/reporting.py
+++ /dev/null
@@ -1,148 +0,0 @@
-"""Post-hoc usage reporting utilities.
-
-This module provides ``UsageReporter`` for slicing and analyzing usage data
-from benchmark reports. Unlike the registry's live aggregates (which provide
-running totals), the reporter can slice by task since it sees the full report
-list with task IDs.
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, List
-
-from .usage import Usage, TokenUsage
-
-
-class UsageReporter:
-    """Post-hoc utility for analyzing usage across benchmark reports.
-
-    Walks ``report["usage"]`` across all reports to produce breakdowns
-    by task, component, model, etc.
-
-    Example:
-        ```python
-        reporter = UsageReporter.from_reports(benchmark.reports)
-        print(reporter.total())
-        print(reporter.by_task())
-        print(reporter.by_component())
-        ```
-    """
-
-    def __init__(self, entries: List[Dict[str, Any]]):
-        """Initialize with raw entries extracted from reports.
-
-        Args:
-            entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``,
-                and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples).
-        """
-        self._entries = entries
-
-    @staticmethod
-    def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter:
-        """Create a UsageReporter from benchmark reports.
-
-        Args:
-            reports: The ``benchmark.reports`` list.
-
-        Returns:
-            A UsageReporter ready for analysis.
-        """
-        entries = []
-        for report in reports:
-            usage_data = report.get("usage")
-            if not usage_data or "error" in usage_data:
-                continue
-
-            usage_items = []
-            for category, value in usage_data.items():
-                if category == "metadata":
-                    continue
-                if isinstance(value, dict) and "cost" in value:
-                    # Direct value (environment/user) — it's a usage dict
-                    usage_items.append((category, value))
-                elif isinstance(value, dict):
-                    # Category dict with component names as keys
-                    for comp_name, comp_usage in value.items():
-                        if isinstance(comp_usage, dict) and "error" not in comp_usage:
-                            usage_items.append((f"{category}:{comp_name}", comp_usage))
-
-            entries.append(
-                {
-                    "task_id": report.get("task_id"),
-                    "repeat_idx": report.get("repeat_idx"),
-                    "usage_items": usage_items,
-                }
-            )
-
-        return UsageReporter(entries)
-
-    @staticmethod
-    def _usage_from_dict(d: Dict[str, Any]) -> Usage:
-        """Reconstruct a Usage (or TokenUsage) from a serialized dict."""
-        has_tokens = "input_tokens" in d
-        if has_tokens:
-            return TokenUsage(
-                cost=d.get("cost"),
-                units=d.get("units", {}),
-                provider=d.get("provider"),
-                category=d.get("category"),
-                component_name=d.get("component_name"),
-                kind=d.get("kind"),
-                input_tokens=d.get("input_tokens", 0),
-                output_tokens=d.get("output_tokens", 0),
-                total_tokens=d.get("total_tokens", 0),
-                cached_input_tokens=d.get("cached_input_tokens", 0),
-                reasoning_tokens=d.get("reasoning_tokens", 0),
-                audio_tokens=d.get("audio_tokens", 0),
-            )
-        return Usage(
-            cost=d.get("cost"),
-            units=d.get("units", {}),
-            provider=d.get("provider"),
-            category=d.get("category"),
-            component_name=d.get("component_name"),
-            kind=d.get("kind"),
-        )
-
-    def by_task(self) -> Dict[str, Usage]:
-        """Aggregate usage by task_id across all repetitions."""
-        result: Dict[str, Usage] = {}
-        for entry in self._entries:
-            task_id = entry["task_id"]
-            for _key, usage_dict in entry["usage_items"]:
-                usage = self._usage_from_dict(usage_dict)
-                if task_id in result:
-                    result[task_id] = result[task_id] + usage
-                else:
-                    result[task_id] = usage
-        return result
-
-    def by_component(self) -> Dict[str, Usage]:
-        """Aggregate usage by registry key (e.g., ``"models:main_model"``)."""
-        result: Dict[str, Usage] = {}
-        for entry in self._entries:
-            for key, usage_dict in entry["usage_items"]:
-                usage = self._usage_from_dict(usage_dict)
-                if key in result:
-                    result[key] = result[key] + usage
-                else:
-                    result[key] = usage
-        return result
-
-    def total(self) -> Usage:
-        """Grand total across all tasks and components."""
-        all_usages = []
-        for entry in self._entries:
-            for _key, usage_dict in entry["usage_items"]:
-                all_usages.append(self._usage_from_dict(usage_dict))
-        if not all_usages:
-            return Usage()
-        return sum(all_usages, Usage())
-
-    def summary(self) -> Dict[str, Any]:
-        """Nested dict with all breakdowns."""
-        return {
-            "total": self.total().to_dict(),
-            "by_task": {k: v.to_dict() for k, v in self.by_task().items()},
-            "by_component": {k: v.to_dict() for k, v in self.by_component().items()},
-        }
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index aa2c2e08..319d89bb 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -2,9 +2,10 @@
 
 This module provides the `Usage` and `TokenUsage` data classes for recording
 billable resource consumption, the `UsageTrackableMixin` that enables
-automatic usage collection through the component registry, and pluggable
+automatic usage collection through the component registry, pluggable
 cost calculators (`CostCalculator`, `StaticPricingCalculator`) for translating
-token counts into monetary cost.
+token counts into monetary cost, and `UsageReporter` for post-hoc analysis
+of usage data from benchmark reports.
 
 Usage tracking is a first-class collection axis alongside tracing
 (`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that
@@ -20,7 +21,7 @@
 from __future__ import annotations
 
 from dataclasses import dataclass, field
-from typing import Any, Dict, Optional, Protocol, runtime_checkable
+from typing import Any, Dict, List, Optional, Protocol, runtime_checkable
 
 
 @dataclass
@@ -411,7 +412,7 @@ def add_model(self, model_id: str, rates: Dict[str, float]) -> None:
         self._pricing[model_id] = rates
 
     @property
-    def models(self) -> list[str]:
+    def models(self) -> List[str]:
         """List of model IDs with pricing configured."""
         return list(self._pricing.keys())
 
@@ -421,3 +422,138 @@ def gather_config(self) -> Dict[str, Any]:
             "type": type(self).__name__,
             "pricing": dict(self._pricing),
         }
+
+
+class UsageReporter:
+    """Post-hoc utility for analyzing usage across benchmark reports.
+
+    Walks ``report["usage"]`` across all reports to produce breakdowns
+    by task, component, model, etc.
+
+    Example:
+        ```python
+        reporter = UsageReporter.from_reports(benchmark.reports)
+        print(reporter.total())
+        print(reporter.by_task())
+        print(reporter.by_component())
+        ```
+    """
+
+    def __init__(self, entries: List[Dict[str, Any]]):
+        """Initialize with raw entries extracted from reports.
+
+        Args:
+            entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``,
+                and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples).
+        """
+        self._entries = entries
+
+    @staticmethod
+    def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter:
+        """Create a UsageReporter from benchmark reports.
+
+        Args:
+            reports: The ``benchmark.reports`` list.
+
+        Returns:
+            A UsageReporter ready for analysis.
+        """
+        entries = []
+        for report in reports:
+            usage_data = report.get("usage")
+            if not usage_data or "error" in usage_data:
+                continue
+
+            usage_items = []
+            for category, value in usage_data.items():
+                if category == "metadata":
+                    continue
+                if isinstance(value, dict) and "cost" in value:
+                    # Direct value (environment/user) — it's a usage dict
+                    usage_items.append((category, value))
+                elif isinstance(value, dict):
+                    # Category dict with component names as keys
+                    for comp_name, comp_usage in value.items():
+                        if isinstance(comp_usage, dict) and "error" not in comp_usage:
+                            usage_items.append((f"{category}:{comp_name}", comp_usage))
+
+            entries.append(
+                {
+                    "task_id": report.get("task_id"),
+                    "repeat_idx": report.get("repeat_idx"),
+                    "usage_items": usage_items,
+                }
+            )
+
+        return UsageReporter(entries)
+
+    @staticmethod
+    def _usage_from_dict(d: Dict[str, Any]) -> Usage:
+        """Reconstruct a Usage (or TokenUsage) from a serialized dict."""
+        has_tokens = "input_tokens" in d
+        if has_tokens:
+            return TokenUsage(
+                cost=d.get("cost"),
+                units=d.get("units", {}),
+                provider=d.get("provider"),
+                category=d.get("category"),
+                component_name=d.get("component_name"),
+                kind=d.get("kind"),
+                input_tokens=d.get("input_tokens", 0),
+                output_tokens=d.get("output_tokens", 0),
+                total_tokens=d.get("total_tokens", 0),
+                cached_input_tokens=d.get("cached_input_tokens", 0),
+                reasoning_tokens=d.get("reasoning_tokens", 0),
+                audio_tokens=d.get("audio_tokens", 0),
+            )
+        return Usage(
+            cost=d.get("cost"),
+            units=d.get("units", {}),
+            provider=d.get("provider"),
+            category=d.get("category"),
+            component_name=d.get("component_name"),
+            kind=d.get("kind"),
+        )
+
+    def by_task(self) -> Dict[str, Usage]:
+        """Aggregate usage by task_id across all repetitions."""
+        result: Dict[str, Usage] = {}
+        for entry in self._entries:
+            task_id = entry["task_id"]
+            for _key, usage_dict in entry["usage_items"]:
+                usage = self._usage_from_dict(usage_dict)
+                if task_id in result:
+                    result[task_id] = result[task_id] + usage
+                else:
+                    result[task_id] = usage
+        return result
+
+    def by_component(self) -> Dict[str, Usage]:
+        """Aggregate usage by registry key (e.g., ``"models:main_model"``)."""
+        result: Dict[str, Usage] = {}
+        for entry in self._entries:
+            for key, usage_dict in entry["usage_items"]:
+                usage = self._usage_from_dict(usage_dict)
+                if key in result:
+                    result[key] = result[key] + usage
+                else:
+                    result[key] = usage
+        return result
+
+    def total(self) -> Usage:
+        """Grand total across all tasks and components."""
+        all_usages = []
+        for entry in self._entries:
+            for _key, usage_dict in entry["usage_items"]:
+                all_usages.append(self._usage_from_dict(usage_dict))
+        if not all_usages:
+            return Usage()
+        return sum(all_usages, Usage())
+
+    def summary(self) -> Dict[str, Any]:
+        """Nested dict with all breakdowns."""
+        return {
+            "total": self.total().to_dict(),
+            "by_task": {k: v.to_dict() for k, v in self.by_task().items()},
+            "by_component": {k: v.to_dict() for k, v in self.by_component().items()},
+        }

From 13067b93c242abef1cfc587a760034b0bbc4e24f Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Fri, 13 Mar 2026 16:49:30 +0100
Subject: [PATCH 06/19] updated tests and fixed bugs in cost tracking

---
 docs/reference/usage.md                       |   4 -
 maseval/core/model.py                         |   5 +-
 maseval/core/usage.py                         |  26 +-
 maseval/interface/inference/anthropic.py      |   3 +
 maseval/interface/inference/litellm.py        |   3 +
 maseval/interface/usage.py                    |   2 +
 tests/test_core/test_usage.py                 | 681 ++++++++++++++++++
 .../test_api_contracts.py                     | 422 +++++++++++
 8 files changed, 1137 insertions(+), 9 deletions(-)
 create mode 100644 tests/test_core/test_usage.py

diff --git a/docs/reference/usage.md b/docs/reference/usage.md
index 460769bf..87bbcfe9 100644
--- a/docs/reference/usage.md
+++ b/docs/reference/usage.md
@@ -4,8 +4,6 @@ Usage and cost tracking provides data classes for recording resource consumption
 
 See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage patterns and examples.
 
-## Core
-
 [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/usage.py){ .md-source-file }
 
 ::: maseval.core.usage.Usage
@@ -20,8 +18,6 @@ See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage pat
 
 ::: maseval.core.usage.UsageReporter
 
-## Interface
-
 [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/usage.py){ .md-source-file }
 
 ::: maseval.interface.usage.LiteLLMCostCalculator
diff --git a/maseval/core/model.py b/maseval/core/model.py
index 110d9879..4c4a7f6a 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -402,7 +402,10 @@ def gather_usage(self) -> Usage:
         """
         if not self._usage_records:
             return Usage()
-        return sum(self._usage_records, Usage())
+        result = self._usage_records[0]
+        for record in self._usage_records[1:]:
+            result = result + record
+        return result
 
     def gather_traces(self) -> Dict[str, Any]:
         """Gather execution traces from this model adapter.
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index 319d89bb..2a53931c 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -130,6 +130,8 @@ class TokenUsage(Usage):
         total_tokens: Total tokens (input + output).
         cached_input_tokens: Tokens served from cache (Anthropic ``cache_read_input_tokens``,
             OpenAI ``cached_tokens``).
+        cache_creation_input_tokens: Tokens used to create a new cache entry
+            (Anthropic ``cache_creation_input_tokens``). Billed at a higher rate.
         reasoning_tokens: Tokens used for reasoning (OpenAI ``reasoning_tokens``,
             Google ``thoughts_token_count``).
         audio_tokens: Tokens for audio processing (OpenAI).
@@ -149,6 +151,7 @@ class TokenUsage(Usage):
     output_tokens: int = 0
     total_tokens: int = 0
     cached_input_tokens: int = 0
+    cache_creation_input_tokens: int = 0
     reasoning_tokens: int = 0
     audio_tokens: int = 0
 
@@ -169,6 +172,7 @@ def __add__(self, other: Usage) -> Usage:
                 output_tokens=self.output_tokens + other.output_tokens,
                 total_tokens=self.total_tokens + other.total_tokens,
                 cached_input_tokens=self.cached_input_tokens + other.cached_input_tokens,
+                cache_creation_input_tokens=self.cache_creation_input_tokens + other.cache_creation_input_tokens,
                 reasoning_tokens=self.reasoning_tokens + other.reasoning_tokens,
                 audio_tokens=self.audio_tokens + other.audio_tokens,
             )
@@ -185,6 +189,7 @@ def __add__(self, other: Usage) -> Usage:
             output_tokens=self.output_tokens,
             total_tokens=self.total_tokens,
             cached_input_tokens=self.cached_input_tokens,
+            cache_creation_input_tokens=self.cache_creation_input_tokens,
             reasoning_tokens=self.reasoning_tokens,
             audio_tokens=self.audio_tokens,
         )
@@ -197,6 +202,7 @@ def to_dict(self) -> Dict[str, Any]:
             "output_tokens": self.output_tokens,
             "total_tokens": self.total_tokens,
             "cached_input_tokens": self.cached_input_tokens,
+            "cache_creation_input_tokens": self.cache_creation_input_tokens,
             "reasoning_tokens": self.reasoning_tokens,
             "audio_tokens": self.audio_tokens,
         }
@@ -237,6 +243,7 @@ def from_chat_response_usage(
             output_tokens=usage_dict.get("output_tokens", 0),
             total_tokens=usage_dict.get("total_tokens", 0),
             cached_input_tokens=usage_dict.get("cached_input_tokens", 0),
+            cache_creation_input_tokens=usage_dict.get("cache_creation_input_tokens", 0),
             reasoning_tokens=usage_dict.get("reasoning_tokens", 0),
             audio_tokens=usage_dict.get("audio_tokens", 0),
         )
@@ -353,6 +360,7 @@ class StaticPricingCalculator:
             - ``"input"`` — cost per input token (required)
             - ``"output"`` — cost per output token (required)
             - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate)
+            - ``"cache_creation_input"`` — cost per cache creation token (optional, defaults to ``"input"`` rate)
 
     Example:
         ```python
@@ -394,11 +402,17 @@ def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
         input_rate = rates.get("input", 0.0)
         output_rate = rates.get("output", 0.0)
         cached_rate = rates.get("cached_input", input_rate)
+        cache_creation_rate = rates.get("cache_creation_input", input_rate)
 
-        # Non-cached input tokens = total input - cached
-        non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens)
+        # Non-cached input tokens = total input - cached - cache_creation
+        non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens - usage.cache_creation_input_tokens)
 
-        cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate
+        cost = (
+            non_cached_input * input_rate
+            + usage.cached_input_tokens * cached_rate
+            + usage.cache_creation_input_tokens * cache_creation_rate
+            + usage.output_tokens * output_rate
+        )
 
         return cost
 
@@ -503,6 +517,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage:
                 output_tokens=d.get("output_tokens", 0),
                 total_tokens=d.get("total_tokens", 0),
                 cached_input_tokens=d.get("cached_input_tokens", 0),
+                cache_creation_input_tokens=d.get("cache_creation_input_tokens", 0),
                 reasoning_tokens=d.get("reasoning_tokens", 0),
                 audio_tokens=d.get("audio_tokens", 0),
             )
@@ -548,7 +563,10 @@ def total(self) -> Usage:
                 all_usages.append(self._usage_from_dict(usage_dict))
         if not all_usages:
             return Usage()
-        return sum(all_usages, Usage())
+        result = all_usages[0]
+        for u in all_usages[1:]:
+            result = result + u
+        return result
 
     def summary(self) -> Dict[str, Any]:
         """Nested dict with all breakdowns."""
diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py
index 5c816e76..dfd07579 100644
--- a/maseval/interface/inference/anthropic.py
+++ b/maseval/interface/inference/anthropic.py
@@ -352,6 +352,9 @@ def _parse_response(self, response: Any) -> ChatResponse:
             cached = getattr(response.usage, "cache_read_input_tokens", 0)
             if cached:
                 usage["cached_input_tokens"] = cached
+            cache_creation = getattr(response.usage, "cache_creation_input_tokens", 0)
+            if cache_creation:
+                usage["cache_creation_input_tokens"] = cache_creation
 
         # Extract stop reason
         stop_reason = None
diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py
index ce5385e7..b12b618e 100644
--- a/maseval/interface/inference/litellm.py
+++ b/maseval/interface/inference/litellm.py
@@ -193,6 +193,9 @@ def _chat_impl(
                 cached = getattr(prompt_details, "cached_tokens", 0)
                 if cached:
                     usage["cached_input_tokens"] = cached
+                cache_creation = getattr(prompt_details, "cache_creation_tokens", 0)
+                if cache_creation:
+                    usage["cache_creation_input_tokens"] = cache_creation
             # LiteLLM provider-reported cost
             hidden = getattr(response, "_hidden_params", None)
             if hidden and isinstance(hidden, dict):
diff --git a/maseval/interface/usage.py b/maseval/interface/usage.py
index 87070f13..f5767e2d 100644
--- a/maseval/interface/usage.py
+++ b/maseval/interface/usage.py
@@ -109,6 +109,8 @@ def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
                 model=model_id,
                 prompt_tokens=usage.input_tokens,
                 completion_tokens=usage.output_tokens,
+                cache_read_input_tokens=usage.cached_input_tokens,
+                cache_creation_input_tokens=usage.cache_creation_input_tokens,
             )
             return input_cost + output_cost
         except Exception:
diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py
new file mode 100644
index 00000000..fa6da9a6
--- /dev/null
+++ b/tests/test_core/test_usage.py
@@ -0,0 +1,681 @@
+"""Tests for usage tracking and cost calculation correctness.
+
+Verifies that:
+- TokenUsage arithmetic produces correct results
+- StaticPricingCalculator computes exact expected costs
+- LiteLLMCostCalculator passes the right parameters to litellm
+- Full pipeline (adapter → TokenUsage → CostCalculator → cost) is correct
+- UsageReporter aggregates correctly from report dicts
+- Serialization roundtrips preserve all fields
+"""
+
+import pytest
+
+from maseval.core.usage import (
+    Usage,
+    TokenUsage,
+    StaticPricingCalculator,
+    UsageReporter,
+)
+
+pytestmark = [pytest.mark.core]
+
+
+# =============================================================================
+# TokenUsage — Construction & Serialization
+# =============================================================================
+
+
+class TestTokenUsageConstruction:
+    """Verify TokenUsage fields map correctly from various sources."""
+
+    def test_from_chat_response_basic(self):
+        """Minimal usage dict maps to the right fields."""
+        tu = TokenUsage.from_chat_response_usage(
+            {"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}
+        )
+        assert tu.input_tokens == 100
+        assert tu.output_tokens == 50
+        assert tu.total_tokens == 150
+        assert tu.cached_input_tokens == 0
+        assert tu.cache_creation_input_tokens == 0
+        assert tu.reasoning_tokens == 0
+        assert tu.audio_tokens == 0
+        assert tu.cost is None
+
+    def test_from_chat_response_all_fields(self):
+        """All optional fields are mapped when present."""
+        tu = TokenUsage.from_chat_response_usage(
+            {
+                "input_tokens": 1000,
+                "output_tokens": 200,
+                "total_tokens": 1200,
+                "cached_input_tokens": 800,
+                "cache_creation_input_tokens": 50,
+                "reasoning_tokens": 100,
+                "audio_tokens": 10,
+            },
+            cost=0.05,
+            provider="anthropic",
+        )
+        assert tu.input_tokens == 1000
+        assert tu.output_tokens == 200
+        assert tu.cached_input_tokens == 800
+        assert tu.cache_creation_input_tokens == 50
+        assert tu.reasoning_tokens == 100
+        assert tu.audio_tokens == 10
+        assert tu.cost == 0.05
+        assert tu.provider == "anthropic"
+
+    def test_serialization_roundtrip(self):
+        """to_dict → from_dict preserves every field."""
+        original = TokenUsage(
+            cost=0.123,
+            input_tokens=500,
+            output_tokens=100,
+            total_tokens=600,
+            cached_input_tokens=200,
+            cache_creation_input_tokens=50,
+            reasoning_tokens=80,
+            audio_tokens=5,
+            provider="openai",
+            category="models",
+            component_name="main_model",
+            kind="llm",
+        )
+        d = original.to_dict()
+
+        # Verify dict has all expected keys
+        assert d["input_tokens"] == 500
+        assert d["output_tokens"] == 100
+        assert d["total_tokens"] == 600
+        assert d["cached_input_tokens"] == 200
+        assert d["cache_creation_input_tokens"] == 50
+        assert d["reasoning_tokens"] == 80
+        assert d["audio_tokens"] == 5
+        assert d["cost"] == 0.123
+        assert d["provider"] == "openai"
+        assert d["category"] == "models"
+        assert d["component_name"] == "main_model"
+        assert d["kind"] == "llm"
+
+        # Reconstruct via UsageReporter's deserialization path
+        reconstructed = UsageReporter._usage_from_dict(d)
+        assert isinstance(reconstructed, TokenUsage)
+        assert reconstructed.input_tokens == original.input_tokens
+        assert reconstructed.output_tokens == original.output_tokens
+        assert reconstructed.cached_input_tokens == original.cached_input_tokens
+        assert reconstructed.cache_creation_input_tokens == original.cache_creation_input_tokens
+        assert reconstructed.reasoning_tokens == original.reasoning_tokens
+        assert reconstructed.audio_tokens == original.audio_tokens
+        assert reconstructed.cost == original.cost
+
+
+# =============================================================================
+# TokenUsage — Arithmetic
+# =============================================================================
+
+
+class TestTokenUsageArithmetic:
+    """Verify addition produces mathematically correct results."""
+
+    def test_add_two_token_usages(self):
+        """All token fields and cost sum correctly."""
+        a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150, cached_input_tokens=20, cache_creation_input_tokens=10)
+        b = TokenUsage(cost=0.05, input_tokens=200, output_tokens=30, total_tokens=230, cached_input_tokens=50, cache_creation_input_tokens=5)
+        total = a + b
+
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(0.15)
+        assert total.input_tokens == 300
+        assert total.output_tokens == 80
+        assert total.total_tokens == 380
+        assert total.cached_input_tokens == 70
+        assert total.cache_creation_input_tokens == 15
+
+    def test_sum_multiple(self):
+        """sum() over a list of TokenUsages works correctly."""
+        records = [
+            TokenUsage(cost=0.01, input_tokens=10, output_tokens=5, total_tokens=15),
+            TokenUsage(cost=0.02, input_tokens=20, output_tokens=10, total_tokens=30),
+            TokenUsage(cost=0.03, input_tokens=30, output_tokens=15, total_tokens=45),
+        ]
+        total = records[0]
+        for r in records[1:]:
+            total = total + r
+
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(0.06)
+        assert total.input_tokens == 60
+        assert total.output_tokens == 30
+        assert total.total_tokens == 90
+
+    def test_none_cost_propagates(self):
+        """If either cost is None, sum cost is None."""
+        a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150)
+        b = TokenUsage(cost=None, input_tokens=200, output_tokens=30, total_tokens=230)
+        total = a + b
+
+        assert total.cost is None
+        # Token fields still sum correctly
+        assert total.input_tokens == 300
+        assert total.output_tokens == 80
+
+    def test_grouping_fields_match(self):
+        """Matching grouping fields are preserved."""
+        a = TokenUsage(cost=0.10, provider="anthropic", kind="llm", input_tokens=100, output_tokens=50, total_tokens=150)
+        b = TokenUsage(cost=0.05, provider="anthropic", kind="llm", input_tokens=200, output_tokens=30, total_tokens=230)
+        total = a + b
+
+        assert total.provider == "anthropic"
+        assert total.kind == "llm"
+
+    def test_grouping_fields_mismatch(self):
+        """Mismatched grouping fields become None."""
+        a = TokenUsage(cost=0.10, provider="anthropic", input_tokens=100, output_tokens=50, total_tokens=150)
+        b = TokenUsage(cost=0.05, provider="openai", input_tokens=200, output_tokens=30, total_tokens=230)
+        total = a + b
+
+        assert total.provider is None
+
+    def test_add_token_usage_plus_plain_usage(self):
+        """TokenUsage + plain Usage preserves token fields from left operand."""
+        token = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150, cached_input_tokens=20)
+        plain = Usage(cost=0.05, units={"api_calls": 1})
+        total = token + plain
+
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(0.15)
+        assert total.input_tokens == 100
+        assert total.cached_input_tokens == 20
+        assert total.units == {"api_calls": 1}
+
+
+# =============================================================================
+# StaticPricingCalculator — Cost Correctness
+# =============================================================================
+
+
+class TestStaticPricingCalculator:
+    """Verify cost formulas with hand-calculated expected values."""
+
+    def test_basic_cost(self):
+        """Simple input + output cost with no caching.
+
+        100 input * $0.01 = $1.00
+        50 output * $0.02 = $1.00
+        Total = $2.00
+        """
+        calc = StaticPricingCalculator({
+            "test-model": {"input": 0.01, "output": 0.02},
+        })
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+        cost = calc.calculate_cost(usage, "test-model")
+
+        assert cost == pytest.approx(2.00)
+
+    def test_cached_input_tokens(self):
+        """Cached tokens use the cheaper rate.
+
+        input_tokens=1000, cached_input_tokens=800
+        Non-cached: 200 * $0.003 = $0.60
+        Cached: 800 * $0.0003 = $0.24
+        Output: 100 * $0.015 = $1.50
+        Total = $2.34
+        """
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5": {
+                "input": 0.003,
+                "output": 0.015,
+                "cached_input": 0.0003,
+            },
+        })
+        usage = TokenUsage(input_tokens=1000, output_tokens=100, total_tokens=1100, cached_input_tokens=800)
+        cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
+
+        assert cost == pytest.approx(2.34)
+
+    def test_cache_creation_tokens(self):
+        """Cache creation tokens use the higher rate.
+
+        input_tokens=1000, cached_input_tokens=600, cache_creation_input_tokens=200
+        Non-cached: (1000 - 600 - 200) = 200 * $0.003 = $0.60
+        Cached: 600 * $0.0003 = $0.18
+        Cache creation: 200 * $0.00375 = $0.75
+        Output: 100 * $0.015 = $1.50
+        Total = $3.03
+        """
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5": {
+                "input": 0.003,
+                "output": 0.015,
+                "cached_input": 0.0003,
+                "cache_creation_input": 0.00375,
+            },
+        })
+        usage = TokenUsage(
+            input_tokens=1000,
+            output_tokens=100,
+            total_tokens=1100,
+            cached_input_tokens=600,
+            cache_creation_input_tokens=200,
+        )
+        cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
+
+        assert cost == pytest.approx(3.03)
+
+    def test_cache_creation_defaults_to_input_rate(self):
+        """When cache_creation_input is not specified, it defaults to the input rate.
+
+        input_tokens=1000, cache_creation_input_tokens=200
+        Non-cached: 800 * $0.003 = $2.40
+        Cache creation: 200 * $0.003 = $0.60 (uses input rate)
+        Output: 100 * $0.015 = $1.50
+        Total = $4.50
+        """
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
+        })
+        usage = TokenUsage(
+            input_tokens=1000,
+            output_tokens=100,
+            total_tokens=1100,
+            cache_creation_input_tokens=200,
+        )
+        cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
+
+        assert cost == pytest.approx(4.50)
+
+    def test_unknown_model_returns_none(self):
+        """Model not in pricing table returns None, not zero."""
+        calc = StaticPricingCalculator({"gpt-4": {"input": 0.01, "output": 0.02}})
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+
+        assert calc.calculate_cost(usage, "unknown-model") is None
+
+    def test_zero_tokens(self):
+        """Zero tokens produces zero cost."""
+        calc = StaticPricingCalculator({"m": {"input": 0.01, "output": 0.02}})
+        usage = TokenUsage(input_tokens=0, output_tokens=0, total_tokens=0)
+
+        assert calc.calculate_cost(usage, "m") == pytest.approx(0.0)
+
+    def test_real_world_anthropic_pricing(self):
+        """Real Anthropic Sonnet 4 pricing: $3/$15 per 1M tokens.
+
+        500 input * $0.000003 = $0.0015
+        200 output * $0.000015 = $0.003
+        Total = $0.0045
+        """
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6},
+        })
+        usage = TokenUsage(input_tokens=500, output_tokens=200, total_tokens=700)
+        cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
+
+        assert cost == pytest.approx(0.0045)
+
+    def test_real_world_openai_pricing(self):
+        """Real GPT-4o pricing: $2.50/$10 per 1M tokens.
+
+        1000 input * $0.0000025 = $0.0025
+        500 output * $0.000010 = $0.005
+        Total = $0.0075
+        """
+        calc = StaticPricingCalculator({
+            "gpt-4o": {"input": 2.5e-6, "output": 10e-6},
+        })
+        usage = TokenUsage(input_tokens=1000, output_tokens=500, total_tokens=1500)
+        cost = calc.calculate_cost(usage, "gpt-4o")
+
+        assert cost == pytest.approx(0.0075)
+
+
+# =============================================================================
+# LiteLLMCostCalculator — Parameter Passing
+# =============================================================================
+
+
+class TestLiteLLMCostCalculator:
+    """Verify LiteLLMCostCalculator passes the right params to litellm."""
+
+    def test_passes_cache_tokens_to_cost_per_token(self):
+        """Verify cache_read and cache_creation tokens are forwarded."""
+        litellm = pytest.importorskip("litellm")
+        from unittest.mock import patch
+        from maseval.interface.usage import LiteLLMCostCalculator
+
+        calc = LiteLLMCostCalculator()
+        usage = TokenUsage(
+            input_tokens=1000,
+            output_tokens=200,
+            total_tokens=1200,
+            cached_input_tokens=600,
+            cache_creation_input_tokens=100,
+        )
+
+        with patch("litellm.cost_per_token", return_value=(0.003, 0.006)) as mock_cpt:
+            cost = calc.calculate_cost(usage, "claude-sonnet-4-5-20250514")
+
+        mock_cpt.assert_called_once_with(
+            model="claude-sonnet-4-5-20250514",
+            prompt_tokens=1000,
+            completion_tokens=200,
+            cache_read_input_tokens=600,
+            cache_creation_input_tokens=100,
+        )
+        assert cost == pytest.approx(0.009)
+
+    def test_model_id_map_remapping(self):
+        """model_id_map remaps before calling litellm."""
+        pytest.importorskip("litellm")
+        from unittest.mock import patch
+        from maseval.interface.usage import LiteLLMCostCalculator
+
+        calc = LiteLLMCostCalculator(model_id_map={
+            "gemini-2.0-flash": "gemini/gemini-2.0-flash",
+        })
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+
+        with patch("litellm.cost_per_token", return_value=(0.001, 0.002)) as mock_cpt:
+            calc.calculate_cost(usage, "gemini-2.0-flash")
+
+        # Verify it called with the remapped ID
+        assert mock_cpt.call_args.kwargs["model"] == "gemini/gemini-2.0-flash"
+
+    def test_custom_pricing_overrides_litellm(self):
+        """custom_pricing takes precedence over litellm database.
+
+        100 input * $0.0001 = $0.01
+        50 output * $0.0002 = $0.01
+        Total = $0.02
+        """
+        pytest.importorskip("litellm")
+        from unittest.mock import patch
+        from maseval.interface.usage import LiteLLMCostCalculator
+
+        calc = LiteLLMCostCalculator(custom_pricing={
+            "my-model": {
+                "input_cost_per_token": 0.0001,
+                "output_cost_per_token": 0.0002,
+            },
+        })
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+
+        with patch("litellm.cost_per_token") as mock_cpt:
+            cost = calc.calculate_cost(usage, "my-model")
+
+        # litellm.cost_per_token should NOT be called
+        mock_cpt.assert_not_called()
+        assert cost == pytest.approx(0.02)
+
+    def test_unknown_model_returns_none(self):
+        """Model not in litellm's database returns None."""
+        pytest.importorskip("litellm")
+        from unittest.mock import patch
+        from maseval.interface.usage import LiteLLMCostCalculator
+
+        calc = LiteLLMCostCalculator()
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+
+        with patch("litellm.cost_per_token", side_effect=Exception("not found")):
+            cost = calc.calculate_cost(usage, "nonexistent-model-xyz")
+
+        assert cost is None
+
+
+# =============================================================================
+# Full Pipeline — DummyModelAdapter + CostCalculator
+# =============================================================================
+
+
+class TestFullPipeline:
+    """End-to-end: adapter → TokenUsage → CostCalculator → gather_usage().cost.
+
+    Uses DummyModelAdapter from conftest with known usage dicts and a
+    StaticPricingCalculator with known rates, then verifies the final cost
+    matches hand-calculated values.
+    """
+
+    def test_basic_pipeline(self):
+        """Single chat call → correct cost on gather_usage().
+
+        100 input * $0.01 + 50 output * $0.02 = $2.00
+        """
+        from tests.conftest import DummyModelAdapter
+
+        calc = StaticPricingCalculator({
+            "test-model": {"input": 0.01, "output": 0.02},
+        })
+        adapter = DummyModelAdapter(
+            model_id="test-model",
+            usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150},
+        )
+        adapter._cost_calculator = calc
+
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.input_tokens == 100
+        assert total.output_tokens == 50
+        assert total.cost == pytest.approx(2.00)
+
+    def test_pipeline_multiple_calls_accumulate(self):
+        """Multiple chat calls accumulate usage correctly.
+
+        Call 1: 100 input * $0.01 + 50 output * $0.02 = $2.00
+        Call 2: 100 input * $0.01 + 50 output * $0.02 = $2.00
+        Total = $4.00, 200 input, 100 output
+        """
+        from tests.conftest import DummyModelAdapter
+
+        calc = StaticPricingCalculator({
+            "test-model": {"input": 0.01, "output": 0.02},
+        })
+        adapter = DummyModelAdapter(
+            model_id="test-model",
+            usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150},
+        )
+        adapter._cost_calculator = calc
+
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        adapter.chat([{"role": "user", "content": "World"}])
+        total = adapter.gather_usage()
+
+        assert total.input_tokens == 200
+        assert total.output_tokens == 100
+        assert total.cost == pytest.approx(4.00)
+
+    def test_pipeline_provider_cost_takes_precedence(self):
+        """Provider-reported cost wins over calculator.
+
+        Usage dict has cost=0.99 (provider-reported).
+        Calculator would compute $2.00.
+        Provider cost should win.
+        """
+        from tests.conftest import DummyModelAdapter
+
+        calc = StaticPricingCalculator({
+            "test-model": {"input": 0.01, "output": 0.02},
+        })
+        adapter = DummyModelAdapter(
+            model_id="test-model",
+            usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150, "cost": 0.99},
+        )
+        adapter._cost_calculator = calc
+
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert total.cost == pytest.approx(0.99)
+
+    def test_pipeline_no_calculator_no_provider_cost(self):
+        """Without calculator or provider cost, cost is None."""
+        from tests.conftest import DummyModelAdapter
+
+        adapter = DummyModelAdapter(
+            model_id="test-model",
+            usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150},
+        )
+
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert total.input_tokens == 100
+        assert total.cost is None
+
+    def test_pipeline_with_cached_tokens(self):
+        """Pipeline correctly handles cached tokens in cost calculation.
+
+        input_tokens=1000, cached_input_tokens=800
+        Non-cached: 200 * $0.003 = $0.60
+        Cached: 800 * $0.0003 = $0.24
+        Output: 100 * $0.015 = $1.50
+        Total = $2.34
+        """
+        from tests.conftest import DummyModelAdapter
+
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5": {
+                "input": 0.003,
+                "output": 0.015,
+                "cached_input": 0.0003,
+            },
+        })
+        adapter = DummyModelAdapter(
+            model_id="claude-sonnet-4-5",
+            usage={
+                "input_tokens": 1000,
+                "output_tokens": 100,
+                "total_tokens": 1100,
+                "cached_input_tokens": 800,
+            },
+        )
+        adapter._cost_calculator = calc
+
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert total.cached_input_tokens == 800
+        assert total.cost == pytest.approx(2.34)
+
+
+# =============================================================================
+# UsageReporter — Aggregation Correctness
+# =============================================================================
+
+
+class TestUsageReporter:
+    """Verify UsageReporter produces correct aggregations from report dicts."""
+
+    @pytest.fixture
+    def sample_reports(self):
+        """Two tasks, each with a model component."""
+        return [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "models": {
+                        "main_model": {
+                            "cost": 0.10,
+                            "input_tokens": 100,
+                            "output_tokens": 50,
+                            "total_tokens": 150,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": "openai",
+                            "category": "models",
+                            "component_name": "main_model",
+                            "kind": "llm",
+                        }
+                    }
+                },
+            },
+            {
+                "task_id": "task_2",
+                "repeat_idx": 0,
+                "usage": {
+                    "models": {
+                        "main_model": {
+                            "cost": 0.20,
+                            "input_tokens": 200,
+                            "output_tokens": 100,
+                            "total_tokens": 300,
+                            "cached_input_tokens": 50,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": "openai",
+                            "category": "models",
+                            "component_name": "main_model",
+                            "kind": "llm",
+                        }
+                    }
+                },
+            },
+        ]
+
+    def test_total(self, sample_reports):
+        reporter = UsageReporter.from_reports(sample_reports)
+        total = reporter.total()
+
+        assert total.cost == pytest.approx(0.30)
+        assert total.input_tokens == 300
+        assert total.output_tokens == 150
+        assert total.cached_input_tokens == 50
+
+    def test_by_task(self, sample_reports):
+        reporter = UsageReporter.from_reports(sample_reports)
+        by_task = reporter.by_task()
+
+        assert len(by_task) == 2
+        assert by_task["task_1"].cost == pytest.approx(0.10)
+        assert by_task["task_1"].input_tokens == 100
+        assert by_task["task_2"].cost == pytest.approx(0.20)
+        assert by_task["task_2"].input_tokens == 200
+
+    def test_by_component(self, sample_reports):
+        reporter = UsageReporter.from_reports(sample_reports)
+        by_comp = reporter.by_component()
+
+        assert len(by_comp) == 1
+        assert "models:main_model" in by_comp
+        assert by_comp["models:main_model"].cost == pytest.approx(0.30)
+        assert by_comp["models:main_model"].input_tokens == 300
+
+    def test_summary_structure(self, sample_reports):
+        reporter = UsageReporter.from_reports(sample_reports)
+        summary = reporter.summary()
+
+        assert "total" in summary
+        assert "by_task" in summary
+        assert "by_component" in summary
+        assert summary["total"]["cost"] == pytest.approx(0.30)
+        assert summary["total"]["input_tokens"] == 300
+
+    def test_empty_reports(self):
+        reporter = UsageReporter.from_reports([])
+        total = reporter.total()
+
+        # Empty reports return a plain Usage with no cost
+        assert total.cost is None
+        assert isinstance(total, Usage)
+
+    def test_skips_error_reports(self):
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {"error": "setup failed"},
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        total = reporter.total()
+        assert total.cost is None
+        assert isinstance(total, Usage)
diff --git a/tests/test_interface/test_model_integration/test_api_contracts.py b/tests/test_interface/test_model_integration/test_api_contracts.py
index 022d3ec5..b32732b2 100644
--- a/tests/test_interface/test_model_integration/test_api_contracts.py
+++ b/tests/test_interface/test_model_integration/test_api_contracts.py
@@ -580,3 +580,425 @@ def test_tool_call_response(self):
         assert response.usage is not None
         assert response.usage["input_tokens"] == 82
         assert response.usage["output_tokens"] == 18
+
+
+# =============================================================================
+# Usage Extraction Contract Tests
+# =============================================================================
+#
+# These tests verify that each adapter correctly extracts ALL usage fields
+# (including cache tokens, reasoning tokens, provider cost) from realistic
+# API response payloads, and that the cost calculator produces correct costs.
+# =============================================================================
+
+
+# -- OpenAI usage-rich fixture ------------------------------------------------
+
+OPENAI_USAGE_RICH_RESPONSE = {
+    "id": "chatcmpl-usage-test",
+    "object": "chat.completion",
+    "created": 1700000000,
+    "model": "gpt-4o",
+    "system_fingerprint": "fp_usage_test",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "Hello!",
+                "refusal": None,
+            },
+            "logprobs": None,
+            "finish_reason": "stop",
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 500,
+        "completion_tokens": 200,
+        "total_tokens": 700,
+        "prompt_tokens_details": {
+            "cached_tokens": 300,
+        },
+        "completion_tokens_details": {
+            "reasoning_tokens": 80,
+            "audio_tokens": 0,
+            "accepted_prediction_tokens": 0,
+            "rejected_prediction_tokens": 0,
+        },
+    },
+}
+
+
+# -- Anthropic usage-rich fixture --------------------------------------------
+
+ANTHROPIC_USAGE_RICH_RESPONSE = {
+    "id": "msg_usage_test",
+    "type": "message",
+    "role": "assistant",
+    "content": [{"type": "text", "text": "Hello!"}],
+    "model": "claude-sonnet-4-5-20250514",
+    "stop_reason": "end_turn",
+    "stop_sequence": None,
+    "usage": {
+        "input_tokens": 1000,
+        "output_tokens": 200,
+        "cache_read_input_tokens": 600,
+        "cache_creation_input_tokens": 100,
+    },
+}
+
+
+# -- Google usage-rich fixture -----------------------------------------------
+
+GOOGLE_USAGE_RICH_RESPONSE = {
+    "candidates": [
+        {
+            "content": {
+                "parts": [{"text": "Hello!"}],
+                "role": "model",
+            },
+            "finishReason": "STOP",
+        }
+    ],
+    "usageMetadata": {
+        "promptTokenCount": 500,
+        "candidatesTokenCount": 200,
+        "totalTokenCount": 700,
+        "thoughtsTokenCount": 120,
+    },
+    "modelVersion": "gemini-2.0-flash-thinking",
+}
+
+
+class TestOpenAIUsageExtraction:
+    """Verify OpenAI adapter extracts all usage fields correctly."""
+
+    @respx.mock
+    def test_extracts_cached_and_reasoning_tokens(self):
+        """Cached tokens and reasoning tokens are extracted from nested details."""
+        pytest.importorskip("openai")
+        from openai import OpenAI
+        from maseval.interface.inference.openai import OpenAIModelAdapter
+
+        respx.post("https://api.openai.com/v1/chat/completions").respond(
+            200, json=OPENAI_USAGE_RICH_RESPONSE
+        )
+
+        client = OpenAI(api_key="test-key-not-real")
+        adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o")
+        response = adapter.chat([{"role": "user", "content": "Hello"}])
+
+        assert response.usage["input_tokens"] == 500
+        assert response.usage["output_tokens"] == 200
+        assert response.usage["total_tokens"] == 700
+        assert response.usage["cached_input_tokens"] == 300
+        assert response.usage["reasoning_tokens"] == 80
+
+    @respx.mock
+    def test_cost_calculation_with_cached_tokens(self):
+        """Full pipeline: OpenAI adapter + StaticPricingCalculator with caching.
+
+        input_tokens=500, cached_input_tokens=300
+        Non-cached: 200 * $2.5e-6 = $0.0005
+        Cached: 300 * $1.25e-6 = $0.000375
+        Output: 200 * $10e-6 = $0.002
+        Total = $0.002875
+        """
+        pytest.importorskip("openai")
+        from openai import OpenAI
+        from maseval.interface.inference.openai import OpenAIModelAdapter
+        from maseval.core.usage import StaticPricingCalculator, TokenUsage
+
+        respx.post("https://api.openai.com/v1/chat/completions").respond(
+            200, json=OPENAI_USAGE_RICH_RESPONSE
+        )
+
+        calc = StaticPricingCalculator({
+            "gpt-4o": {
+                "input": 2.5e-6,
+                "output": 10e-6,
+                "cached_input": 1.25e-6,
+            },
+        })
+
+        client = OpenAI(api_key="test-key-not-real")
+        adapter = OpenAIModelAdapter(
+            client=client, model_id="gpt-4o", cost_calculator=calc
+        )
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.input_tokens == 500
+        assert total.cached_input_tokens == 300
+        assert total.reasoning_tokens == 80
+        assert total.cost == pytest.approx(0.002875)
+
+
+class TestAnthropicUsageExtraction:
+    """Verify Anthropic adapter extracts all usage fields correctly."""
+
+    @respx.mock
+    def test_extracts_cache_read_and_creation_tokens(self):
+        """Both cache_read and cache_creation tokens are extracted."""
+        pytest.importorskip("anthropic")
+        from anthropic import Anthropic
+        from maseval.interface.inference.anthropic import AnthropicModelAdapter
+
+        respx.post("https://api.anthropic.com/v1/messages").respond(
+            200, json=ANTHROPIC_USAGE_RICH_RESPONSE
+        )
+
+        client = Anthropic(api_key="test-key-not-real")
+        adapter = AnthropicModelAdapter(
+            client=client, model_id="claude-sonnet-4-5-20250514"
+        )
+        response = adapter.chat([{"role": "user", "content": "Hello"}])
+
+        assert response.usage["input_tokens"] == 1000
+        assert response.usage["output_tokens"] == 200
+        assert response.usage["total_tokens"] == 1200  # computed by adapter
+        assert response.usage["cached_input_tokens"] == 600
+        assert response.usage["cache_creation_input_tokens"] == 100
+
+    @respx.mock
+    def test_cost_calculation_with_cache_creation(self):
+        """Full pipeline: Anthropic adapter + StaticPricingCalculator with cache creation.
+
+        input_tokens=1000, cached=600, cache_creation=100
+        Non-cached: (1000 - 600 - 100) = 300 * $3e-6 = $0.0009
+        Cached: 600 * $0.3e-6 = $0.00018
+        Cache creation: 100 * $3.75e-6 = $0.000375
+        Output: 200 * $15e-6 = $0.003
+        Total = $0.004455
+        """
+        pytest.importorskip("anthropic")
+        from anthropic import Anthropic
+        from maseval.interface.inference.anthropic import AnthropicModelAdapter
+        from maseval.core.usage import StaticPricingCalculator, TokenUsage
+
+        respx.post("https://api.anthropic.com/v1/messages").respond(
+            200, json=ANTHROPIC_USAGE_RICH_RESPONSE
+        )
+
+        calc = StaticPricingCalculator({
+            "claude-sonnet-4-5-20250514": {
+                "input": 3e-6,
+                "output": 15e-6,
+                "cached_input": 0.3e-6,
+                "cache_creation_input": 3.75e-6,
+            },
+        })
+
+        client = Anthropic(api_key="test-key-not-real")
+        adapter = AnthropicModelAdapter(
+            client=client,
+            model_id="claude-sonnet-4-5-20250514",
+            cost_calculator=calc,
+        )
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.cached_input_tokens == 600
+        assert total.cache_creation_input_tokens == 100
+        assert total.cost == pytest.approx(0.004455)
+
+
+class TestGoogleGenAIUsageExtraction:
+    """Verify Google GenAI adapter extracts all usage fields correctly."""
+
+    @respx.mock
+    def test_extracts_thoughts_as_reasoning_tokens(self):
+        """Google's thoughtsTokenCount maps to reasoning_tokens."""
+        pytest.importorskip("google.genai")
+        from google import genai
+        from maseval.interface.inference.google_genai import GoogleGenAIModelAdapter
+
+        respx.route(
+            method="POST",
+            url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*",
+        ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE)
+
+        client = genai.Client(
+            api_key="test-key-not-real",
+            http_options={"api_version": "v1beta"},
+        )
+        adapter = GoogleGenAIModelAdapter(
+            client=client, model_id="gemini-2.0-flash-thinking"
+        )
+        response = adapter.chat([{"role": "user", "content": "Hello"}])
+
+        assert response.usage["input_tokens"] == 500
+        assert response.usage["output_tokens"] == 200
+        assert response.usage["total_tokens"] == 700
+        assert response.usage["reasoning_tokens"] == 120
+
+    @respx.mock
+    def test_cost_calculation_basic(self):
+        """Full pipeline: Google adapter + StaticPricingCalculator.
+
+        500 input * $0.075e-6 = $0.0000375
+        200 output * $0.3e-6 = $0.00006
+        Total = $0.0000975
+        """
+        pytest.importorskip("google.genai")
+        from google import genai
+        from maseval.interface.inference.google_genai import GoogleGenAIModelAdapter
+        from maseval.core.usage import StaticPricingCalculator, TokenUsage
+
+        respx.route(
+            method="POST",
+            url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*",
+        ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE)
+
+        calc = StaticPricingCalculator({
+            "gemini-2.0-flash-thinking": {
+                "input": 0.075e-6,
+                "output": 0.3e-6,
+            },
+        })
+
+        client = genai.Client(
+            api_key="test-key-not-real",
+            http_options={"api_version": "v1beta"},
+        )
+        adapter = GoogleGenAIModelAdapter(
+            client=client,
+            model_id="gemini-2.0-flash-thinking",
+            cost_calculator=calc,
+        )
+        adapter.chat([{"role": "user", "content": "Hello"}])
+        total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.reasoning_tokens == 120
+        assert total.cost == pytest.approx(0.0000975)
+
+
+class TestLiteLLMUsageExtraction:
+    """Verify LiteLLM adapter extracts all usage fields correctly."""
+
+    def test_extracts_cached_and_cache_creation_tokens(self):
+        """LiteLLM's prompt_tokens_details with cached_tokens and cache_creation_tokens."""
+        pytest.importorskip("litellm")
+        from unittest.mock import patch, MagicMock
+        from maseval.interface.inference.litellm import LiteLLMModelAdapter
+
+        mock_prompt_details = MagicMock()
+        mock_prompt_details.cached_tokens = 400
+        mock_prompt_details.cache_creation_tokens = 50
+
+        mock_completion_details = MagicMock()
+        mock_completion_details.reasoning_tokens = 60
+
+        mock_usage = MagicMock()
+        mock_usage.prompt_tokens = 800
+        mock_usage.completion_tokens = 150
+        mock_usage.total_tokens = 950
+        mock_usage.prompt_tokens_details = mock_prompt_details
+        mock_usage.completion_tokens_details = mock_completion_details
+
+        mock_response = MagicMock()
+        mock_response.choices = [MagicMock()]
+        mock_response.choices[0].message.content = "Hello!"
+        mock_response.choices[0].message.role = "assistant"
+        mock_response.choices[0].message.tool_calls = None
+        mock_response.choices[0].finish_reason = "stop"
+        mock_response.model = "claude-sonnet-4-5-20250514"
+        mock_response.usage = mock_usage
+        mock_response._hidden_params = {"response_cost": 0.0042}
+
+        with patch("litellm.completion", return_value=mock_response):
+            adapter = LiteLLMModelAdapter(model_id="claude-sonnet-4-5-20250514")
+            response = adapter.chat([{"role": "user", "content": "Hello"}])
+
+        assert response.usage["input_tokens"] == 800
+        assert response.usage["output_tokens"] == 150
+        assert response.usage["total_tokens"] == 950
+        assert response.usage["cached_input_tokens"] == 400
+        assert response.usage["cache_creation_input_tokens"] == 50
+        assert response.usage["reasoning_tokens"] == 60
+
+    def test_provider_cost_from_hidden_params(self):
+        """LiteLLM's _hidden_params.response_cost is extracted as provider cost.
+
+        Provider cost ($0.0042) should take precedence over calculator.
+        """
+        pytest.importorskip("litellm")
+        from unittest.mock import patch, MagicMock
+        from maseval.interface.inference.litellm import LiteLLMModelAdapter
+        from maseval.core.usage import StaticPricingCalculator, TokenUsage
+
+        mock_usage = MagicMock()
+        mock_usage.prompt_tokens = 100
+        mock_usage.completion_tokens = 50
+        mock_usage.total_tokens = 150
+        mock_usage.prompt_tokens_details = None
+        mock_usage.completion_tokens_details = None
+
+        mock_response = MagicMock()
+        mock_response.choices = [MagicMock()]
+        mock_response.choices[0].message.content = "Hello!"
+        mock_response.choices[0].message.role = "assistant"
+        mock_response.choices[0].message.tool_calls = None
+        mock_response.choices[0].finish_reason = "stop"
+        mock_response.model = "gpt-4o"
+        mock_response.usage = mock_usage
+        mock_response._hidden_params = {"response_cost": 0.0042}
+
+        # Calculator would compute a different cost — provider should win
+        calc = StaticPricingCalculator({
+            "gpt-4o": {"input": 0.01, "output": 0.02},
+        })
+
+        with patch("litellm.completion", return_value=mock_response):
+            adapter = LiteLLMModelAdapter(
+                model_id="gpt-4o", cost_calculator=calc
+            )
+            adapter.chat([{"role": "user", "content": "Hello"}])
+            total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(0.0042)
+
+    def test_calculator_used_when_no_provider_cost(self):
+        """When _hidden_params has no cost, calculator is used.
+
+        100 input * $0.01 + 50 output * $0.02 = $2.00
+        """
+        pytest.importorskip("litellm")
+        from unittest.mock import patch, MagicMock
+        from maseval.interface.inference.litellm import LiteLLMModelAdapter
+        from maseval.core.usage import StaticPricingCalculator, TokenUsage
+
+        mock_usage = MagicMock()
+        mock_usage.prompt_tokens = 100
+        mock_usage.completion_tokens = 50
+        mock_usage.total_tokens = 150
+        mock_usage.prompt_tokens_details = None
+        mock_usage.completion_tokens_details = None
+
+        mock_response = MagicMock()
+        mock_response.choices = [MagicMock()]
+        mock_response.choices[0].message.content = "Hello!"
+        mock_response.choices[0].message.role = "assistant"
+        mock_response.choices[0].message.tool_calls = None
+        mock_response.choices[0].finish_reason = "stop"
+        mock_response.model = "gpt-4o"
+        mock_response.usage = mock_usage
+        mock_response._hidden_params = {}
+
+        calc = StaticPricingCalculator({
+            "gpt-4o": {"input": 0.01, "output": 0.02},
+        })
+
+        with patch("litellm.completion", return_value=mock_response):
+            adapter = LiteLLMModelAdapter(
+                model_id="gpt-4o", cost_calculator=calc
+            )
+            adapter.chat([{"role": "user", "content": "Hello"}])
+            total = adapter.gather_usage()
+
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(2.00)

From 60f205aa0eecb1dd9401df6f83720e6502c91ea4 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Fri, 13 Mar 2026 17:56:01 +0100
Subject: [PATCH 07/19] upgraded testing

---
 maseval/core/registry.py                      |   9 +-
 .../test_benchmark/test_usage_collection.py   |  97 ++++++++
 tests/test_core/test_registry.py              | 227 ++++++++++++++++++
 3 files changed, 331 insertions(+), 2 deletions(-)
 create mode 100644 tests/test_core/test_benchmark/test_usage_collection.py

diff --git a/maseval/core/registry.py b/maseval/core/registry.py
index e34fc972..cf2193e8 100644
--- a/maseval/core/registry.py
+++ b/maseval/core/registry.py
@@ -319,9 +319,14 @@ def collect_usage(self) -> Dict[str, Any]:
                         usage[category] = {}
                     usage[category][comp_name] = usage_dict
 
-                # Accumulate into persistent aggregates (thread-safe)
+                # Accumulate into persistent aggregates (thread-safe).
+                # _usage_total starts as Usage(cost=None); adding to it would
+                # poison the cost (None + X = None).  Assign directly on first use.
                 with self._usage_lock:
-                    self._usage_total = self._usage_total + component_usage
+                    if self._usage_total.cost is None and not self._usage_total.units:
+                        self._usage_total = component_usage
+                    else:
+                        self._usage_total = self._usage_total + component_usage
                     if key in self._usage_by_component:
                         self._usage_by_component[key] = self._usage_by_component[key] + component_usage
                     else:
diff --git a/tests/test_core/test_benchmark/test_usage_collection.py b/tests/test_core/test_benchmark/test_usage_collection.py
new file mode 100644
index 00000000..fdf2a74e
--- /dev/null
+++ b/tests/test_core/test_benchmark/test_usage_collection.py
@@ -0,0 +1,97 @@
+"""Test usage collection through the benchmark execution loop.
+
+These tests verify that benchmark.run() collects usage from registered
+model adapters and includes it in report dicts.
+"""
+
+import pytest
+from maseval import TaskQueue
+from maseval.core.usage import StaticPricingCalculator
+
+
+@pytest.mark.core
+class TestBenchmarkUsageCollection:
+    """Tests for usage collection during benchmark execution."""
+
+    def test_usage_in_report(self):
+        """Benchmark run includes a 'usage' key in each report."""
+        from conftest import DummyBenchmark
+
+        tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}])
+        benchmark = DummyBenchmark()
+
+        reports = benchmark.run(tasks, agent_data={"model": "test"})
+
+        assert "usage" in reports[0]
+        usage = reports[0]["usage"]
+        assert "metadata" in usage
+        assert "models" in usage
+        assert "agents" in usage
+
+    def test_usage_has_correct_structure(self):
+        """Usage dict has the expected category keys and metadata."""
+        from conftest import DummyBenchmark
+
+        tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}])
+        benchmark = DummyBenchmark()
+
+        reports = benchmark.run(tasks, agent_data={"model": "test"})
+
+        usage = reports[0]["usage"]
+        assert "metadata" in usage
+        assert "total_components" in usage["metadata"]
+        assert "timestamp" in usage["metadata"]
+
+    def test_model_with_usage_appears_in_report(self):
+        """A model adapter that reports usage has its tokens in the report."""
+        from conftest import DummyModelAdapter, DummyBenchmark
+
+        class UsageBenchmark(DummyBenchmark):
+            def get_model_adapter(self, model_id, **kwargs):
+                return DummyModelAdapter(
+                    model_id=model_id,
+                    usage={
+                        "input_tokens": 100,
+                        "output_tokens": 50,
+                        "total_tokens": 150,
+                    },
+                )
+
+        tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}])
+        benchmark = UsageBenchmark()
+
+        reports = benchmark.run(tasks, agent_data={"model": "test"})
+
+        # The DummyBenchmark doesn't register a model via register(), so
+        # the model's usage won't appear unless the benchmark hooks it up.
+        # This test verifies the usage structure exists.
+        assert "usage" in reports[0]
+
+    def test_usage_persists_across_task_repetitions(self):
+        """Benchmark.usage accumulates across multiple tasks."""
+        from conftest import DummyBenchmark
+
+        tasks = TaskQueue.from_list([
+            {"query": "Task 1", "environment_data": {}},
+            {"query": "Task 2", "environment_data": {}},
+        ])
+        benchmark = DummyBenchmark()
+        benchmark.run(tasks, agent_data={"model": "test"})
+
+        # Both tasks should have produced reports with usage
+        assert len(benchmark.reports) == 2
+        assert "usage" in benchmark.reports[0]
+        assert "usage" in benchmark.reports[1]
+
+    def test_usage_property_returns_total(self):
+        """benchmark.usage returns the running total."""
+        from conftest import DummyBenchmark
+
+        tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}])
+        benchmark = DummyBenchmark()
+        benchmark.run(tasks, agent_data={"model": "test"})
+
+        # usage property should return a Usage object (even if empty)
+        total = benchmark.usage
+        assert total is not None
+        # cost may be None if DummyModelAdapter doesn't provide usage
diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py
index 59408775..c62db645 100644
--- a/tests/test_core/test_registry.py
+++ b/tests/test_core/test_registry.py
@@ -256,3 +256,230 @@ def worker(worker_id: int):
         for worker_id, traces in results.items():
             assert f"agent_{worker_id}" in traces["agents"]
             assert len(traces["agents"]) == 1
+
+
+# ==================== Usage Tracking Tests ====================
+
+
+class MockUsageComponent(TraceableMixin):
+    """Component that implements UsageTrackableMixin for testing."""
+
+    def __init__(self, name: str, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0):
+        super().__init__()
+        self._name = name
+        self._cost = cost
+        self._input_tokens = input_tokens
+        self._output_tokens = output_tokens
+
+    def gather_traces(self) -> Dict[str, Any]:
+        return {"name": self._name}
+
+    def gather_usage(self):
+        from maseval.core.usage import TokenUsage
+        return TokenUsage(
+            cost=self._cost,
+            input_tokens=self._input_tokens,
+            output_tokens=self._output_tokens,
+            total_tokens=self._input_tokens + self._output_tokens,
+        )
+
+
+class MockBrokenUsageComponent(TraceableMixin):
+    """Component whose gather_usage raises an exception."""
+
+    def __init__(self):
+        super().__init__()
+
+    def gather_traces(self) -> Dict[str, Any]:
+        return {}
+
+    def gather_usage(self):
+        raise RuntimeError("Usage collection failed")
+
+
+# Ensure MockUsageComponent also inherits UsageTrackableMixin
+from maseval.core.usage import UsageTrackableMixin
+
+
+class UsageAwareComponent(TraceableMixin, UsageTrackableMixin):
+    """Component with both tracing and usage tracking."""
+
+    def __init__(self, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0):
+        TraceableMixin.__init__(self)
+        self._cost = cost
+        self._input_tokens = input_tokens
+        self._output_tokens = output_tokens
+
+    def gather_traces(self) -> Dict[str, Any]:
+        return {"traced": True}
+
+    def gather_usage(self):
+        from maseval.core.usage import TokenUsage
+        return TokenUsage(
+            cost=self._cost,
+            input_tokens=self._input_tokens,
+            output_tokens=self._output_tokens,
+            total_tokens=self._input_tokens + self._output_tokens,
+        )
+
+
+class BrokenUsageComponent(TraceableMixin, UsageTrackableMixin):
+    """Component whose gather_usage raises an exception."""
+
+    def __init__(self):
+        TraceableMixin.__init__(self)
+
+    def gather_traces(self) -> Dict[str, Any]:
+        return {}
+
+    def gather_usage(self):
+        raise RuntimeError("Usage collection failed")
+
+
+@pytest.mark.core
+class TestRegistryUsageCollection:
+    """Tests for usage tracking through the component registry."""
+
+    def test_register_usage_trackable_component(self):
+        """UsageTrackableMixin component is registered in the usage registry."""
+        registry = ComponentRegistry()
+        component = UsageAwareComponent(cost=0.05, input_tokens=100, output_tokens=50)
+
+        registry.register("models", "main_model", component)
+
+        assert "models:main_model" in registry._usage_registry
+        assert registry._usage_registry["models:main_model"] is component
+
+    def test_non_usage_component_not_in_usage_registry(self):
+        """Components without UsageTrackableMixin are NOT in the usage registry."""
+        registry = ComponentRegistry()
+        component = MockTraceableComponent("test")
+
+        registry.register("agents", "my_agent", component)
+
+        assert "agents:my_agent" in registry._trace_registry
+        assert "agents:my_agent" not in registry._usage_registry
+
+    def test_collect_usage_basic(self):
+        """collect_usage returns structured dict with usage from registered components."""
+        from maseval.core.usage import TokenUsage
+
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200)
+        registry.register("models", "main_model", model)
+
+        usage = registry.collect_usage()
+
+        assert "metadata" in usage
+        assert "models" in usage
+        assert "main_model" in usage["models"]
+
+        model_usage = usage["models"]["main_model"]
+        assert model_usage["cost"] == 0.10
+        assert model_usage["input_tokens"] == 500
+        assert model_usage["output_tokens"] == 200
+        assert model_usage["total_tokens"] == 700
+
+    def test_collect_usage_multiple_components(self):
+        """Multiple components across categories are all collected."""
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200)
+        tool = UsageAwareComponent(cost=0.05, input_tokens=0, output_tokens=0)
+
+        registry.register("models", "main_model", model)
+        registry.register("tools", "search_tool", tool)
+
+        usage = registry.collect_usage()
+
+        assert "main_model" in usage["models"]
+        assert "search_tool" in usage["tools"]
+        assert usage["models"]["main_model"]["cost"] == 0.10
+        assert usage["tools"]["search_tool"]["cost"] == 0.05
+
+    def test_collect_usage_injects_grouping_fields(self):
+        """Registry injects category and component_name into usage records."""
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50)
+        registry.register("models", "main_model", model)
+
+        usage = registry.collect_usage()
+
+        model_usage = usage["models"]["main_model"]
+        assert model_usage["category"] == "models"
+        assert model_usage["component_name"] == "main_model"
+
+    def test_total_usage_accumulates(self):
+        """total_usage property reflects accumulated usage across collect_usage calls."""
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50)
+        registry.register("models", "main_model", model)
+
+        # First collection
+        registry.collect_usage()
+        total1 = registry.total_usage
+        assert total1.cost == pytest.approx(0.10)
+
+        # Clear and re-register (simulates next repetition)
+        registry.clear()
+        model2 = UsageAwareComponent(cost=0.20, input_tokens=200, output_tokens=100)
+        registry.register("models", "main_model", model2)
+
+        # Second collection
+        registry.collect_usage()
+        total2 = registry.total_usage
+        assert total2.cost == pytest.approx(0.30)
+
+    def test_usage_by_component_accumulates(self):
+        """usage_by_component accumulates per key across repetitions."""
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50)
+        registry.register("models", "main_model", model)
+        registry.collect_usage()
+
+        # Clear and re-register for second repetition
+        registry.clear()
+        model2 = UsageAwareComponent(cost=0.20, input_tokens=200, output_tokens=100)
+        registry.register("models", "main_model", model2)
+        registry.collect_usage()
+
+        by_comp = registry.usage_by_component
+        assert "models:main_model" in by_comp
+
+        total = by_comp["models:main_model"]
+        assert total.input_tokens == 300
+        assert total.output_tokens == 150
+        assert total.cost == pytest.approx(0.30)
+
+    def test_usage_persists_across_clear(self):
+        """clear() does NOT reset total_usage or usage_by_component."""
+        registry = ComponentRegistry()
+        model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50)
+        registry.register("models", "main_model", model)
+        registry.collect_usage()
+
+        # Clear only removes per-repetition state
+        registry.clear()
+
+        assert registry.total_usage.cost == pytest.approx(0.10)
+        assert "models:main_model" in registry.usage_by_component
+
+    def test_collect_usage_handles_error_gracefully(self):
+        """If gather_usage raises, the error is captured in the usage dict."""
+        registry = ComponentRegistry()
+        broken = BrokenUsageComponent()
+        registry.register("models", "bad_model", broken)
+
+        usage = registry.collect_usage()
+
+        assert "bad_model" in usage["models"]
+        assert "error" in usage["models"]["bad_model"]
+        assert "RuntimeError" in usage["models"]["bad_model"]["error_type"]
+
+    def test_collect_usage_empty_registry(self):
+        """collect_usage with no components returns empty structure."""
+        registry = ComponentRegistry()
+        usage = registry.collect_usage()
+
+        assert usage["metadata"]["total_components"] == 0
+        assert usage["models"] == {}
+        assert usage["agents"] == {}

From 742ffb5bfef2a9a6ce9e39fd9cb32ed0e532ad9b Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Fri, 13 Mar 2026 18:13:50 +0100
Subject: [PATCH 08/19] upgraded testing

---
 tests/test_core/test_usage.py | 228 ++++++++++++++++++++++++++++++++++
 1 file changed, 228 insertions(+)

diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py
index fa6da9a6..ca9b551d 100644
--- a/tests/test_core/test_usage.py
+++ b/tests/test_core/test_usage.py
@@ -679,3 +679,231 @@ def test_skips_error_reports(self):
         total = reporter.total()
         assert total.cost is None
         assert isinstance(total, Usage)
+
+    def test_by_task_accumulates_repeats(self):
+        """by_task sums usage when a task_id appears in multiple reports."""
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "models": {
+                        "m": {
+                            "cost": 0.10,
+                            "input_tokens": 100,
+                            "output_tokens": 50,
+                            "total_tokens": 150,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": None,
+                            "category": "models",
+                            "component_name": "m",
+                            "kind": "llm",
+                        }
+                    }
+                },
+            },
+            {
+                "task_id": "task_1",
+                "repeat_idx": 1,
+                "usage": {
+                    "models": {
+                        "m": {
+                            "cost": 0.20,
+                            "input_tokens": 200,
+                            "output_tokens": 100,
+                            "total_tokens": 300,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": None,
+                            "category": "models",
+                            "component_name": "m",
+                            "kind": "llm",
+                        }
+                    }
+                },
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        by_task = reporter.by_task()
+
+        assert len(by_task) == 1
+        assert by_task["task_1"].cost == pytest.approx(0.30)
+        assert by_task["task_1"].input_tokens == 300
+
+    def test_plain_usage_fallback(self):
+        """_usage_from_dict returns plain Usage when no token fields present."""
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "tools": {
+                        "my_tool": {
+                            "cost": 0.05,
+                            "units": {"api_calls": 3},
+                            "provider": None,
+                            "category": "tools",
+                            "component_name": "my_tool",
+                            "kind": "tool",
+                        }
+                    }
+                },
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        total = reporter.total()
+
+        assert total.cost == pytest.approx(0.05)
+        assert isinstance(total, Usage)
+        assert not isinstance(total, TokenUsage)
+
+    def test_metadata_key_skipped(self):
+        """The 'metadata' key in usage dicts is not treated as a component."""
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "metadata": {"timestamp": "2025-01-01", "total_components": 1},
+                    "models": {
+                        "m": {
+                            "cost": 0.10,
+                            "input_tokens": 50,
+                            "output_tokens": 25,
+                            "total_tokens": 75,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": None,
+                            "category": "models",
+                            "component_name": "m",
+                            "kind": "llm",
+                        }
+                    },
+                },
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        total = reporter.total()
+
+        # Only the model's cost, metadata should not contribute
+        assert total.cost == pytest.approx(0.10)
+        assert total.input_tokens == 50
+
+    def test_skips_component_with_error(self):
+        """Components with error dicts are skipped, others still counted."""
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "models": {
+                        "good_model": {
+                            "cost": 0.10,
+                            "input_tokens": 100,
+                            "output_tokens": 50,
+                            "total_tokens": 150,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": None,
+                            "category": "models",
+                            "component_name": "good_model",
+                            "kind": "llm",
+                        },
+                        "bad_model": {
+                            "error": "Failed to gather usage",
+                            "error_type": "RuntimeError",
+                        },
+                    }
+                },
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        total = reporter.total()
+
+        assert total.cost == pytest.approx(0.10)
+        assert total.input_tokens == 100
+
+    def test_environment_direct_usage(self):
+        """Environment/user usage (direct dicts with 'cost') are parsed."""
+        reports = [
+            {
+                "task_id": "task_1",
+                "repeat_idx": 0,
+                "usage": {
+                    "environment": {
+                        "cost": 0.05,
+                        "units": {"steps": 10},
+                        "provider": None,
+                        "category": "environment",
+                        "component_name": "env",
+                        "kind": "env",
+                    },
+                    "models": {
+                        "m": {
+                            "cost": 0.10,
+                            "input_tokens": 100,
+                            "output_tokens": 50,
+                            "total_tokens": 150,
+                            "cached_input_tokens": 0,
+                            "cache_creation_input_tokens": 0,
+                            "reasoning_tokens": 0,
+                            "audio_tokens": 0,
+                            "units": {},
+                            "provider": None,
+                            "category": "models",
+                            "component_name": "m",
+                            "kind": "llm",
+                        }
+                    },
+                },
+            },
+        ]
+        reporter = UsageReporter.from_reports(reports)
+        total = reporter.total()
+
+        assert total.cost == pytest.approx(0.15)
+
+
+# =============================================================================
+# StaticPricingCalculator — Utility Methods
+# =============================================================================
+
+
+class TestStaticPricingCalculatorUtilities:
+    """Tests for add_model, models property, and gather_config."""
+
+    def test_add_model(self):
+        calc = StaticPricingCalculator({})
+        calc.add_model("new-model", {"input": 0.01, "output": 0.02})
+
+        usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+        cost = calc.calculate_cost(usage, "new-model")
+        assert cost == pytest.approx(2.00)
+
+    def test_models_property(self):
+        calc = StaticPricingCalculator({
+            "model-a": {"input": 0.01, "output": 0.02},
+            "model-b": {"input": 0.001, "output": 0.002},
+        })
+        assert sorted(calc.models) == ["model-a", "model-b"]
+
+    def test_gather_config(self):
+        pricing = {"model-a": {"input": 0.01, "output": 0.02}}
+        calc = StaticPricingCalculator(pricing)
+        config = calc.gather_config()
+
+        assert config["type"] == "StaticPricingCalculator"
+        assert config["pricing"] == pricing

From 6af219b9a0ead64b2b6b9a5d91bdb732a8e35d52 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Fri, 13 Mar 2026 18:48:02 +0100
Subject: [PATCH 09/19] fixed linting and tests

---
 maseval/core/model.py                         |   2 +-
 tests/conftest.py                             |   2 +-
 .../test_benchmark/test_usage_collection.py   |  11 +-
 tests/test_core/test_registry.py              |  11 +-
 tests/test_core/test_usage.py                 | 160 +++++++++++-------
 .../test_api_contracts.py                     | 104 ++++++------
 6 files changed, 162 insertions(+), 128 deletions(-)

diff --git a/maseval/core/model.py b/maseval/core/model.py
index 4c4a7f6a..e8bb0a3f 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -100,7 +100,7 @@ class ChatResponse:
     content: Optional[str] = None
     tool_calls: Optional[List[Dict[str, Any]]] = None
     role: str = "assistant"
-    usage: Optional[Dict[str, int]] = None
+    usage: Optional[Dict[str, Any]] = None
     model: Optional[str] = None
     stop_reason: Optional[str] = None
 
diff --git a/tests/conftest.py b/tests/conftest.py
index 6bd1cc54..00407458 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -48,7 +48,7 @@ def __init__(
         model_id: str = "test-model",
         responses: Optional[List[Optional[str]]] = None,
         tool_calls: Optional[List[Optional[List[Dict[str, Any]]]]] = None,
-        usage: Optional[Dict[str, int]] = None,
+        usage: Optional[Dict[str, Any]] = None,
         stop_reason: Optional[str] = None,
         seed: Optional[int] = None,
     ):
diff --git a/tests/test_core/test_benchmark/test_usage_collection.py b/tests/test_core/test_benchmark/test_usage_collection.py
index fdf2a74e..ac42a92a 100644
--- a/tests/test_core/test_benchmark/test_usage_collection.py
+++ b/tests/test_core/test_benchmark/test_usage_collection.py
@@ -6,7 +6,6 @@
 
 import pytest
 from maseval import TaskQueue
-from maseval.core.usage import StaticPricingCalculator
 
 
 @pytest.mark.core
@@ -71,10 +70,12 @@ def test_usage_persists_across_task_repetitions(self):
         """Benchmark.usage accumulates across multiple tasks."""
         from conftest import DummyBenchmark
 
-        tasks = TaskQueue.from_list([
-            {"query": "Task 1", "environment_data": {}},
-            {"query": "Task 2", "environment_data": {}},
-        ])
+        tasks = TaskQueue.from_list(
+            [
+                {"query": "Task 1", "environment_data": {}},
+                {"query": "Task 2", "environment_data": {}},
+            ]
+        )
         benchmark = DummyBenchmark()
         benchmark.run(tasks, agent_data={"model": "test"})
 
diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py
index c62db645..17b30c5b 100644
--- a/tests/test_core/test_registry.py
+++ b/tests/test_core/test_registry.py
@@ -13,6 +13,7 @@
 from maseval.core.registry import ComponentRegistry
 from maseval.core.tracing import TraceableMixin
 from maseval.core.config import ConfigurableMixin
+from maseval.core.usage import UsageTrackableMixin
 
 
 # ==================== Test Components ====================
@@ -276,6 +277,7 @@ def gather_traces(self) -> Dict[str, Any]:
 
     def gather_usage(self):
         from maseval.core.usage import TokenUsage
+
         return TokenUsage(
             cost=self._cost,
             input_tokens=self._input_tokens,
@@ -297,10 +299,6 @@ def gather_usage(self):
         raise RuntimeError("Usage collection failed")
 
 
-# Ensure MockUsageComponent also inherits UsageTrackableMixin
-from maseval.core.usage import UsageTrackableMixin
-
-
 class UsageAwareComponent(TraceableMixin, UsageTrackableMixin):
     """Component with both tracing and usage tracking."""
 
@@ -315,6 +313,7 @@ def gather_traces(self) -> Dict[str, Any]:
 
     def gather_usage(self):
         from maseval.core.usage import TokenUsage
+
         return TokenUsage(
             cost=self._cost,
             input_tokens=self._input_tokens,
@@ -362,7 +361,6 @@ def test_non_usage_component_not_in_usage_registry(self):
 
     def test_collect_usage_basic(self):
         """collect_usage returns structured dict with usage from registered components."""
-        from maseval.core.usage import TokenUsage
 
         registry = ComponentRegistry()
         model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200)
@@ -445,7 +443,10 @@ def test_usage_by_component_accumulates(self):
         by_comp = registry.usage_by_component
         assert "models:main_model" in by_comp
 
+        from maseval.core.usage import TokenUsage
+
         total = by_comp["models:main_model"]
+        assert isinstance(total, TokenUsage)
         assert total.input_tokens == 300
         assert total.output_tokens == 150
         assert total.cost == pytest.approx(0.30)
diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py
index ca9b551d..5348585b 100644
--- a/tests/test_core/test_usage.py
+++ b/tests/test_core/test_usage.py
@@ -31,9 +31,7 @@ class TestTokenUsageConstruction:
 
     def test_from_chat_response_basic(self):
         """Minimal usage dict maps to the right fields."""
-        tu = TokenUsage.from_chat_response_usage(
-            {"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}
-        )
+        tu = TokenUsage.from_chat_response_usage({"input_tokens": 100, "output_tokens": 50, "total_tokens": 150})
         assert tu.input_tokens == 100
         assert tu.output_tokens == 50
         assert tu.total_tokens == 150
@@ -158,6 +156,7 @@ def test_none_cost_propagates(self):
 
         assert total.cost is None
         # Token fields still sum correctly
+        assert isinstance(total, TokenUsage)
         assert total.input_tokens == 300
         assert total.output_tokens == 80
 
@@ -206,9 +205,11 @@ def test_basic_cost(self):
         50 output * $0.02 = $1.00
         Total = $2.00
         """
-        calc = StaticPricingCalculator({
-            "test-model": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "test-model": {"input": 0.01, "output": 0.02},
+            }
+        )
         usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
         cost = calc.calculate_cost(usage, "test-model")
 
@@ -223,13 +224,15 @@ def test_cached_input_tokens(self):
         Output: 100 * $0.015 = $1.50
         Total = $2.34
         """
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5": {
-                "input": 0.003,
-                "output": 0.015,
-                "cached_input": 0.0003,
-            },
-        })
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5": {
+                    "input": 0.003,
+                    "output": 0.015,
+                    "cached_input": 0.0003,
+                },
+            }
+        )
         usage = TokenUsage(input_tokens=1000, output_tokens=100, total_tokens=1100, cached_input_tokens=800)
         cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
 
@@ -245,14 +248,16 @@ def test_cache_creation_tokens(self):
         Output: 100 * $0.015 = $1.50
         Total = $3.03
         """
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5": {
-                "input": 0.003,
-                "output": 0.015,
-                "cached_input": 0.0003,
-                "cache_creation_input": 0.00375,
-            },
-        })
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5": {
+                    "input": 0.003,
+                    "output": 0.015,
+                    "cached_input": 0.0003,
+                    "cache_creation_input": 0.00375,
+                },
+            }
+        )
         usage = TokenUsage(
             input_tokens=1000,
             output_tokens=100,
@@ -273,9 +278,11 @@ def test_cache_creation_defaults_to_input_rate(self):
         Output: 100 * $0.015 = $1.50
         Total = $4.50
         """
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
+            }
+        )
         usage = TokenUsage(
             input_tokens=1000,
             output_tokens=100,
@@ -307,9 +314,11 @@ def test_real_world_anthropic_pricing(self):
         200 output * $0.000015 = $0.003
         Total = $0.0045
         """
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6},
+            }
+        )
         usage = TokenUsage(input_tokens=500, output_tokens=200, total_tokens=700)
         cost = calc.calculate_cost(usage, "claude-sonnet-4-5")
 
@@ -322,9 +331,11 @@ def test_real_world_openai_pricing(self):
         500 output * $0.000010 = $0.005
         Total = $0.0075
         """
-        calc = StaticPricingCalculator({
-            "gpt-4o": {"input": 2.5e-6, "output": 10e-6},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "gpt-4o": {"input": 2.5e-6, "output": 10e-6},
+            }
+        )
         usage = TokenUsage(input_tokens=1000, output_tokens=500, total_tokens=1500)
         cost = calc.calculate_cost(usage, "gpt-4o")
 
@@ -341,7 +352,7 @@ class TestLiteLLMCostCalculator:
 
     def test_passes_cache_tokens_to_cost_per_token(self):
         """Verify cache_read and cache_creation tokens are forwarded."""
-        litellm = pytest.importorskip("litellm")
+        pytest.importorskip("litellm")
         from unittest.mock import patch
         from maseval.interface.usage import LiteLLMCostCalculator
 
@@ -372,9 +383,11 @@ def test_model_id_map_remapping(self):
         from unittest.mock import patch
         from maseval.interface.usage import LiteLLMCostCalculator
 
-        calc = LiteLLMCostCalculator(model_id_map={
-            "gemini-2.0-flash": "gemini/gemini-2.0-flash",
-        })
+        calc = LiteLLMCostCalculator(
+            model_id_map={
+                "gemini-2.0-flash": "gemini/gemini-2.0-flash",
+            }
+        )
         usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
 
         with patch("litellm.cost_per_token", return_value=(0.001, 0.002)) as mock_cpt:
@@ -394,12 +407,14 @@ def test_custom_pricing_overrides_litellm(self):
         from unittest.mock import patch
         from maseval.interface.usage import LiteLLMCostCalculator
 
-        calc = LiteLLMCostCalculator(custom_pricing={
-            "my-model": {
-                "input_cost_per_token": 0.0001,
-                "output_cost_per_token": 0.0002,
-            },
-        })
+        calc = LiteLLMCostCalculator(
+            custom_pricing={
+                "my-model": {
+                    "input_cost_per_token": 0.0001,
+                    "output_cost_per_token": 0.0002,
+                },
+            }
+        )
         usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
 
         with patch("litellm.cost_per_token") as mock_cpt:
@@ -444,9 +459,11 @@ def test_basic_pipeline(self):
         """
         from tests.conftest import DummyModelAdapter
 
-        calc = StaticPricingCalculator({
-            "test-model": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "test-model": {"input": 0.01, "output": 0.02},
+            }
+        )
         adapter = DummyModelAdapter(
             model_id="test-model",
             usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150},
@@ -470,9 +487,11 @@ def test_pipeline_multiple_calls_accumulate(self):
         """
         from tests.conftest import DummyModelAdapter
 
-        calc = StaticPricingCalculator({
-            "test-model": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "test-model": {"input": 0.01, "output": 0.02},
+            }
+        )
         adapter = DummyModelAdapter(
             model_id="test-model",
             usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150},
@@ -483,6 +502,7 @@ def test_pipeline_multiple_calls_accumulate(self):
         adapter.chat([{"role": "user", "content": "World"}])
         total = adapter.gather_usage()
 
+        assert isinstance(total, TokenUsage)
         assert total.input_tokens == 200
         assert total.output_tokens == 100
         assert total.cost == pytest.approx(4.00)
@@ -496,9 +516,11 @@ def test_pipeline_provider_cost_takes_precedence(self):
         """
         from tests.conftest import DummyModelAdapter
 
-        calc = StaticPricingCalculator({
-            "test-model": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "test-model": {"input": 0.01, "output": 0.02},
+            }
+        )
         adapter = DummyModelAdapter(
             model_id="test-model",
             usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150, "cost": 0.99},
@@ -522,6 +544,7 @@ def test_pipeline_no_calculator_no_provider_cost(self):
         adapter.chat([{"role": "user", "content": "Hello"}])
         total = adapter.gather_usage()
 
+        assert isinstance(total, TokenUsage)
         assert total.input_tokens == 100
         assert total.cost is None
 
@@ -536,13 +559,15 @@ def test_pipeline_with_cached_tokens(self):
         """
         from tests.conftest import DummyModelAdapter
 
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5": {
-                "input": 0.003,
-                "output": 0.015,
-                "cached_input": 0.0003,
-            },
-        })
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5": {
+                    "input": 0.003,
+                    "output": 0.015,
+                    "cached_input": 0.0003,
+                },
+            }
+        )
         adapter = DummyModelAdapter(
             model_id="claude-sonnet-4-5",
             usage={
@@ -557,6 +582,7 @@ def test_pipeline_with_cached_tokens(self):
         adapter.chat([{"role": "user", "content": "Hello"}])
         total = adapter.gather_usage()
 
+        assert isinstance(total, TokenUsage)
         assert total.cached_input_tokens == 800
         assert total.cost == pytest.approx(2.34)
 
@@ -625,6 +651,7 @@ def test_total(self, sample_reports):
         reporter = UsageReporter.from_reports(sample_reports)
         total = reporter.total()
 
+        assert isinstance(total, TokenUsage)
         assert total.cost == pytest.approx(0.30)
         assert total.input_tokens == 300
         assert total.output_tokens == 150
@@ -635,8 +662,10 @@ def test_by_task(self, sample_reports):
         by_task = reporter.by_task()
 
         assert len(by_task) == 2
+        assert isinstance(by_task["task_1"], TokenUsage)
         assert by_task["task_1"].cost == pytest.approx(0.10)
         assert by_task["task_1"].input_tokens == 100
+        assert isinstance(by_task["task_2"], TokenUsage)
         assert by_task["task_2"].cost == pytest.approx(0.20)
         assert by_task["task_2"].input_tokens == 200
 
@@ -646,8 +675,10 @@ def test_by_component(self, sample_reports):
 
         assert len(by_comp) == 1
         assert "models:main_model" in by_comp
-        assert by_comp["models:main_model"].cost == pytest.approx(0.30)
-        assert by_comp["models:main_model"].input_tokens == 300
+        total = by_comp["models:main_model"]
+        assert isinstance(total, TokenUsage)
+        assert total.cost == pytest.approx(0.30)
+        assert total.input_tokens == 300
 
     def test_summary_structure(self, sample_reports):
         reporter = UsageReporter.from_reports(sample_reports)
@@ -734,6 +765,7 @@ def test_by_task_accumulates_repeats(self):
         by_task = reporter.by_task()
 
         assert len(by_task) == 1
+        assert isinstance(by_task["task_1"], TokenUsage)
         assert by_task["task_1"].cost == pytest.approx(0.30)
         assert by_task["task_1"].input_tokens == 300
 
@@ -796,6 +828,7 @@ def test_metadata_key_skipped(self):
         total = reporter.total()
 
         # Only the model's cost, metadata should not contribute
+        assert isinstance(total, TokenUsage)
         assert total.cost == pytest.approx(0.10)
         assert total.input_tokens == 50
 
@@ -833,6 +866,7 @@ def test_skips_component_with_error(self):
         reporter = UsageReporter.from_reports(reports)
         total = reporter.total()
 
+        assert isinstance(total, TokenUsage)
         assert total.cost == pytest.approx(0.10)
         assert total.input_tokens == 100
 
@@ -894,10 +928,12 @@ def test_add_model(self):
         assert cost == pytest.approx(2.00)
 
     def test_models_property(self):
-        calc = StaticPricingCalculator({
-            "model-a": {"input": 0.01, "output": 0.02},
-            "model-b": {"input": 0.001, "output": 0.002},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "model-a": {"input": 0.01, "output": 0.02},
+                "model-b": {"input": 0.001, "output": 0.002},
+            }
+        )
         assert sorted(calc.models) == ["model-a", "model-b"]
 
     def test_gather_config(self):
diff --git a/tests/test_interface/test_model_integration/test_api_contracts.py b/tests/test_interface/test_model_integration/test_api_contracts.py
index b32732b2..088d7057 100644
--- a/tests/test_interface/test_model_integration/test_api_contracts.py
+++ b/tests/test_interface/test_model_integration/test_api_contracts.py
@@ -680,14 +680,13 @@ def test_extracts_cached_and_reasoning_tokens(self):
         from openai import OpenAI
         from maseval.interface.inference.openai import OpenAIModelAdapter
 
-        respx.post("https://api.openai.com/v1/chat/completions").respond(
-            200, json=OPENAI_USAGE_RICH_RESPONSE
-        )
+        respx.post("https://api.openai.com/v1/chat/completions").respond(200, json=OPENAI_USAGE_RICH_RESPONSE)
 
         client = OpenAI(api_key="test-key-not-real")
         adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o")
         response = adapter.chat([{"role": "user", "content": "Hello"}])
 
+        assert response.usage is not None
         assert response.usage["input_tokens"] == 500
         assert response.usage["output_tokens"] == 200
         assert response.usage["total_tokens"] == 700
@@ -709,22 +708,20 @@ def test_cost_calculation_with_cached_tokens(self):
         from maseval.interface.inference.openai import OpenAIModelAdapter
         from maseval.core.usage import StaticPricingCalculator, TokenUsage
 
-        respx.post("https://api.openai.com/v1/chat/completions").respond(
-            200, json=OPENAI_USAGE_RICH_RESPONSE
-        )
+        respx.post("https://api.openai.com/v1/chat/completions").respond(200, json=OPENAI_USAGE_RICH_RESPONSE)
 
-        calc = StaticPricingCalculator({
-            "gpt-4o": {
-                "input": 2.5e-6,
-                "output": 10e-6,
-                "cached_input": 1.25e-6,
-            },
-        })
+        calc = StaticPricingCalculator(
+            {
+                "gpt-4o": {
+                    "input": 2.5e-6,
+                    "output": 10e-6,
+                    "cached_input": 1.25e-6,
+                },
+            }
+        )
 
         client = OpenAI(api_key="test-key-not-real")
-        adapter = OpenAIModelAdapter(
-            client=client, model_id="gpt-4o", cost_calculator=calc
-        )
+        adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o", cost_calculator=calc)
         adapter.chat([{"role": "user", "content": "Hello"}])
         total = adapter.gather_usage()
 
@@ -745,16 +742,13 @@ def test_extracts_cache_read_and_creation_tokens(self):
         from anthropic import Anthropic
         from maseval.interface.inference.anthropic import AnthropicModelAdapter
 
-        respx.post("https://api.anthropic.com/v1/messages").respond(
-            200, json=ANTHROPIC_USAGE_RICH_RESPONSE
-        )
+        respx.post("https://api.anthropic.com/v1/messages").respond(200, json=ANTHROPIC_USAGE_RICH_RESPONSE)
 
         client = Anthropic(api_key="test-key-not-real")
-        adapter = AnthropicModelAdapter(
-            client=client, model_id="claude-sonnet-4-5-20250514"
-        )
+        adapter = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5-20250514")
         response = adapter.chat([{"role": "user", "content": "Hello"}])
 
+        assert response.usage is not None
         assert response.usage["input_tokens"] == 1000
         assert response.usage["output_tokens"] == 200
         assert response.usage["total_tokens"] == 1200  # computed by adapter
@@ -777,19 +771,19 @@ def test_cost_calculation_with_cache_creation(self):
         from maseval.interface.inference.anthropic import AnthropicModelAdapter
         from maseval.core.usage import StaticPricingCalculator, TokenUsage
 
-        respx.post("https://api.anthropic.com/v1/messages").respond(
-            200, json=ANTHROPIC_USAGE_RICH_RESPONSE
+        respx.post("https://api.anthropic.com/v1/messages").respond(200, json=ANTHROPIC_USAGE_RICH_RESPONSE)
+
+        calc = StaticPricingCalculator(
+            {
+                "claude-sonnet-4-5-20250514": {
+                    "input": 3e-6,
+                    "output": 15e-6,
+                    "cached_input": 0.3e-6,
+                    "cache_creation_input": 3.75e-6,
+                },
+            }
         )
 
-        calc = StaticPricingCalculator({
-            "claude-sonnet-4-5-20250514": {
-                "input": 3e-6,
-                "output": 15e-6,
-                "cached_input": 0.3e-6,
-                "cache_creation_input": 3.75e-6,
-            },
-        })
-
         client = Anthropic(api_key="test-key-not-real")
         adapter = AnthropicModelAdapter(
             client=client,
@@ -824,11 +818,10 @@ def test_extracts_thoughts_as_reasoning_tokens(self):
             api_key="test-key-not-real",
             http_options={"api_version": "v1beta"},
         )
-        adapter = GoogleGenAIModelAdapter(
-            client=client, model_id="gemini-2.0-flash-thinking"
-        )
+        adapter = GoogleGenAIModelAdapter(client=client, model_id="gemini-2.0-flash-thinking")
         response = adapter.chat([{"role": "user", "content": "Hello"}])
 
+        assert response.usage is not None
         assert response.usage["input_tokens"] == 500
         assert response.usage["output_tokens"] == 200
         assert response.usage["total_tokens"] == 700
@@ -852,12 +845,14 @@ def test_cost_calculation_basic(self):
             url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*",
         ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE)
 
-        calc = StaticPricingCalculator({
-            "gemini-2.0-flash-thinking": {
-                "input": 0.075e-6,
-                "output": 0.3e-6,
-            },
-        })
+        calc = StaticPricingCalculator(
+            {
+                "gemini-2.0-flash-thinking": {
+                    "input": 0.075e-6,
+                    "output": 0.3e-6,
+                },
+            }
+        )
 
         client = genai.Client(
             api_key="test-key-not-real",
@@ -913,6 +908,7 @@ def test_extracts_cached_and_cache_creation_tokens(self):
             adapter = LiteLLMModelAdapter(model_id="claude-sonnet-4-5-20250514")
             response = adapter.chat([{"role": "user", "content": "Hello"}])
 
+        assert response.usage is not None
         assert response.usage["input_tokens"] == 800
         assert response.usage["output_tokens"] == 150
         assert response.usage["total_tokens"] == 950
@@ -948,14 +944,14 @@ def test_provider_cost_from_hidden_params(self):
         mock_response._hidden_params = {"response_cost": 0.0042}
 
         # Calculator would compute a different cost — provider should win
-        calc = StaticPricingCalculator({
-            "gpt-4o": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "gpt-4o": {"input": 0.01, "output": 0.02},
+            }
+        )
 
         with patch("litellm.completion", return_value=mock_response):
-            adapter = LiteLLMModelAdapter(
-                model_id="gpt-4o", cost_calculator=calc
-            )
+            adapter = LiteLLMModelAdapter(model_id="gpt-4o", cost_calculator=calc)
             adapter.chat([{"role": "user", "content": "Hello"}])
             total = adapter.gather_usage()
 
@@ -989,14 +985,14 @@ def test_calculator_used_when_no_provider_cost(self):
         mock_response.usage = mock_usage
         mock_response._hidden_params = {}
 
-        calc = StaticPricingCalculator({
-            "gpt-4o": {"input": 0.01, "output": 0.02},
-        })
+        calc = StaticPricingCalculator(
+            {
+                "gpt-4o": {"input": 0.01, "output": 0.02},
+            }
+        )
 
         with patch("litellm.completion", return_value=mock_response):
-            adapter = LiteLLMModelAdapter(
-                model_id="gpt-4o", cost_calculator=calc
-            )
+            adapter = LiteLLMModelAdapter(model_id="gpt-4o", cost_calculator=calc)
             adapter.chat([{"role": "user", "content": "Hello"}])
             total = adapter.gather_usage()
 

From 3d772bc02096d382dc3a89e43ca1d23e8db323f1 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Sun, 15 Mar 2026 20:30:08 +0100
Subject: [PATCH 10/19] added agent usage tracking

---
 maseval/core/agent.py                         |   3 +-
 maseval/interface/agents/camel.py             |  69 +++--
 maseval/interface/agents/langgraph.py         |  83 ++++++
 maseval/interface/agents/llamaindex.py        |  30 ++
 maseval/interface/agents/smolagents.py        |  66 +++--
 .../test_camel_integration.py                 | 173 ++++++++++-
 .../test_langgraph_integration.py             | 136 +++++++++
 .../test_llamaindex_integration.py            | 207 +++++++++++++
 .../test_smolagents_integration.py            | 280 ++++++++++++++++--
 9 files changed, 974 insertions(+), 73 deletions(-)

diff --git a/maseval/core/agent.py b/maseval/core/agent.py
index 97011527..e481843c 100644
--- a/maseval/core/agent.py
+++ b/maseval/core/agent.py
@@ -5,9 +5,10 @@
 from .history import MessageHistory
 from .tracing import TraceableMixin
 from .config import ConfigurableMixin
+from .usage import UsageTrackableMixin
 
 
-class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin):
+class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
     """Wraps an agent from any framework to provide a standard interface.
 
     This Adapter provides:
diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py
index 6166d108..b6ccebba 100644
--- a/maseval/interface/agents/camel.py
+++ b/maseval/interface/agents/camel.py
@@ -19,6 +19,7 @@
 from maseval import AgentAdapter, MessageHistory, LLMUser, User
 from maseval.core.tracing import TraceableMixin
 from maseval.core.config import ConfigurableMixin
+from maseval.core.usage import TokenUsage, Usage
 
 __all__ = [
     "CamelAgentAdapter",
@@ -135,9 +136,12 @@ class CamelAgentAdapter(AgentAdapter):
             for msg in agent_adapter.get_messages():
                 print(f"{msg['role']}: {msg['content']}")
 
-            # Gather execution traces with token usage and tool calls
+            # Gather aggregated usage
+            usage = agent_adapter.gather_usage()
+            print(f"Total tokens: {usage.total_tokens}")
+
+            # Gather execution traces with tool call counts
             traces = agent_adapter.gather_traces()
-            print(f"Total tokens: {traces['total_tokens']}")
             print(f"Tool calls: {traces['total_tool_calls']}")
 
             # Gather configuration
@@ -383,25 +387,16 @@ def _convert_base_message(self, msg) -> Dict[str, Any]:
     def gather_traces(self) -> Dict[str, Any]:
         """Gather execution traces from this CAMEL agent.
 
-        Extends the base class to include CAMEL-specific execution data
-        with aggregated statistics from all responses.
+        Extends the base class to include CAMEL-specific per-step execution
+        data. Aggregated usage totals are available via ``gather_usage()``.
 
         Returns:
-            Dictionary containing:
-            - Base traces (type, gathered_at, name, messages, logs, etc.)
-            - total_steps: Number of step() calls made
-            - total_input_tokens: Aggregated input tokens
-            - total_output_tokens: Aggregated output tokens
-            - total_tokens: Aggregated total tokens
-            - total_tool_calls: Total number of tool calls made
-            - last_terminated: Whether the last response indicated termination
+            Dictionary containing base traces plus step count, tool call count,
+            and termination status.
         """
         base_traces = super().gather_traces()
         _check_camel_installed()
 
-        # Calculate aggregated statistics from responses
-        total_input_tokens = 0
-        total_output_tokens = 0
         total_tool_calls = 0
         last_terminated = False
 
@@ -411,24 +406,12 @@ def gather_traces(self) -> Dict[str, Any]:
 
             if hasattr(response, "info") and response.info:
                 info = response.info
-
-                # Aggregate token usage
-                if "usage" in info and isinstance(info["usage"], dict):
-                    usage = info["usage"]
-                    total_input_tokens += usage.get("prompt_tokens", 0)
-                    total_output_tokens += usage.get("completion_tokens", 0)
-
-                # Count tool calls
                 if "tool_calls" in info and info["tool_calls"]:
                     total_tool_calls += len(info["tool_calls"])
 
-        # Add aggregated statistics
         base_traces.update(
             {
                 "total_steps": len(self._responses),
-                "total_input_tokens": total_input_tokens,
-                "total_output_tokens": total_output_tokens,
-                "total_tokens": total_input_tokens + total_output_tokens,
                 "total_tool_calls": total_tool_calls,
                 "last_terminated": last_terminated,
             }
@@ -436,6 +419,38 @@ def gather_traces(self) -> Dict[str, Any]:
 
         return base_traces
 
+    def gather_usage(self) -> Usage:
+        """Gather aggregated token usage across all CAMEL agent responses.
+
+        Walks stored ``ChatAgentResponse`` objects and sums their
+        ``info["usage"]`` dicts (which contain ``prompt_tokens`` and
+        ``completion_tokens``).
+
+        Returns:
+            Aggregated token usage, or empty ``Usage`` if no responses or no usage data.
+        """
+        total_input = 0
+        total_output = 0
+        has_usage = False
+
+        for response in self._responses:
+            if hasattr(response, "info") and response.info:
+                info = response.info
+                if "usage" in info and isinstance(info["usage"], dict):
+                    usage_dict = info["usage"]
+                    total_input += usage_dict.get("prompt_tokens", 0)
+                    total_output += usage_dict.get("completion_tokens", 0)
+                    has_usage = True
+
+        if not has_usage:
+            return Usage()
+
+        return TokenUsage(
+            input_tokens=total_input,
+            output_tokens=total_output,
+            total_tokens=total_input + total_output,
+        )
+
     def gather_config(self) -> Dict[str, Any]:
         """Gather configuration from this CAMEL agent.
 
diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py
index 5831c81d..6749e549 100644
--- a/maseval/interface/agents/langgraph.py
+++ b/maseval/interface/agents/langgraph.py
@@ -9,6 +9,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
+from maseval.core.usage import TokenUsage, Usage
 
 __all__ = ["LangGraphAgentAdapter", "LangGraphLLMUser"]
 
@@ -213,6 +214,88 @@ def gather_config(self) -> dict[str, Any]:
 
         return base_config
 
+    def gather_usage(self) -> Usage:
+        """Gather aggregated token usage from LangGraph message metadata.
+
+        Walks messages from the last graph execution (or persistent state)
+        and sums their ``usage_metadata``, including detailed token breakdowns
+        for caching, reasoning, and audio tokens when available.
+
+        Returns:
+            Aggregated token usage, or empty ``Usage`` if no messages or no usage data.
+        """
+        _check_langgraph_installed()
+        messages = self._get_usage_messages()
+        if not messages:
+            return Usage()
+
+        total_input = 0
+        total_output = 0
+        cached_input = 0
+        cache_creation_input = 0
+        reasoning = 0
+        audio_input = 0
+        audio_output = 0
+        has_usage = False
+
+        for msg in messages:
+            if not (hasattr(msg, "usage_metadata") and msg.usage_metadata):
+                continue
+            meta = msg.usage_metadata
+            has_usage = True
+
+            # Core counts — usage_metadata is a TypedDict (dict-like)
+            if isinstance(meta, dict):
+                total_input += meta.get("input_tokens", 0)
+                total_output += meta.get("output_tokens", 0)
+                # Detailed breakdowns (optional)
+                input_details = meta.get("input_token_details", {}) or {}
+                output_details = meta.get("output_token_details", {}) or {}
+            else:
+                total_input += getattr(meta, "input_tokens", 0)
+                total_output += getattr(meta, "output_tokens", 0)
+                input_details = getattr(meta, "input_token_details", {}) or {}
+                output_details = getattr(meta, "output_token_details", {}) or {}
+
+            if isinstance(input_details, dict):
+                cached_input += input_details.get("cache_read", 0)
+                cache_creation_input += input_details.get("cache_creation", 0)
+                audio_input += input_details.get("audio", 0)
+            if isinstance(output_details, dict):
+                reasoning += output_details.get("reasoning", 0)
+                audio_output += output_details.get("audio", 0)
+
+        if not has_usage:
+            return Usage()
+
+        return TokenUsage(
+            input_tokens=total_input,
+            output_tokens=total_output,
+            total_tokens=total_input + total_output,
+            cached_input_tokens=cached_input,
+            cache_creation_input_tokens=cache_creation_input,
+            reasoning_tokens=reasoning,
+            audio_tokens=audio_input + audio_output,
+        )
+
+    def _get_usage_messages(self) -> list:
+        """Get messages for usage extraction, preferring persistent state."""
+        # Try persistent state first
+        if self._langgraph_config and hasattr(self.agent, "get_state"):
+            try:
+                state = self.agent.get_state(self._langgraph_config)
+                messages = state.values.get("messages", [])
+                if messages:
+                    return messages
+            except Exception:
+                pass
+
+        # Fall back to cached result
+        if self._last_result and isinstance(self._last_result, dict):
+            return self._last_result.get("messages", [])
+
+        return []
+
     def _run_agent(self, query: str) -> Any:
         _check_langgraph_installed()
         from langchain_core.messages import HumanMessage
diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py
index facf3794..e0e063d4 100644
--- a/maseval/interface/agents/llamaindex.py
+++ b/maseval/interface/agents/llamaindex.py
@@ -10,6 +10,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
+from maseval.core.usage import TokenUsage, Usage
 
 __all__ = ["LlamaIndexAgentAdapter", "LlamaIndexLLMUser"]
 
@@ -215,6 +216,35 @@ def gather_config(self) -> Dict[str, Any]:
 
         return base_config
 
+    def gather_usage(self) -> Usage:
+        """Gather aggregated token usage from LlamaIndex execution logs.
+
+        Sums token counts recorded in ``self.logs`` during agent execution.
+        LlamaIndex does not provide built-in cumulative usage tracking, so
+        this aggregates per-call usage extracted from LLM responses.
+
+        Returns:
+            Aggregated token usage, or empty ``Usage`` if no usage data was recorded.
+        """
+        total_input = 0
+        total_output = 0
+        has_usage = False
+
+        for log_entry in self.logs:
+            if "input_tokens" in log_entry or "output_tokens" in log_entry:
+                total_input += log_entry.get("input_tokens", 0)
+                total_output += log_entry.get("output_tokens", 0)
+                has_usage = True
+
+        if not has_usage:
+            return Usage()
+
+        return TokenUsage(
+            input_tokens=total_input,
+            output_tokens=total_output,
+            total_tokens=total_input + total_output,
+        )
+
     def _run_agent(self, query: str) -> Any:
         """Run the LlamaIndex agent and cache execution state.
 
diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py
index 9e4169ef..4fcfe1db 100644
--- a/maseval/interface/agents/smolagents.py
+++ b/maseval/interface/agents/smolagents.py
@@ -7,6 +7,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
+from maseval.core.usage import TokenUsage, Usage
 
 __all__ = ["SmolAgentAdapter", "SmolAgentLLMUser"]
 
@@ -67,9 +68,12 @@ class SmolAgentAdapter(AgentAdapter):
             for msg in agent_adapter.get_messages():
                 print(f"{msg['role']}: {msg['content']}")
 
-            # Gather execution traces with timing and token usage
+            # Gather aggregated usage
+            usage = agent_adapter.gather_usage()
+            print(f"Total tokens: {usage.total_tokens}")
+
+            # Gather execution traces with timing
             traces = agent_adapter.gather_traces()
-            print(f"Total tokens: {traces['total_tokens']}")
             print(f"Total duration: {traces['total_duration_seconds']}s")
 
             # Use in benchmark
@@ -254,13 +258,12 @@ def logs(self) -> List[Dict[str, Any]]:  # type: ignore[override]
     def gather_traces(self) -> dict:
         """Gather traces including message history and monitoring data.
 
-        Extends the base class to include smolagents' built-in monitoring data:
-        - Token usage (input, output, total) per step and aggregated
-        - Timing/duration per step and aggregated
-        - Step-level details including actions and observations
+        Extends the base class to include smolagents' per-step monitoring data
+        (token usage, timing, actions, observations). Aggregated usage totals
+        are available via ``gather_usage()``.
 
         Returns:
-            Dict containing messages and monitoring statistics
+            Dict containing messages and per-step monitoring statistics.
         """
         base_logs = super().gather_traces()
         _check_smolagents_installed()
@@ -268,17 +271,14 @@ def gather_traces(self) -> dict:
         # Extract monitoring data from agent's memory steps
         if hasattr(self.agent, "memory") and hasattr(self.agent.memory, "steps"):
             steps_stats = []
-            total_input_tokens = 0
-            total_output_tokens = 0
             total_duration = 0.0
 
-            # Import ActionStep for type checking
             from smolagents.memory import ActionStep, PlanningStep
 
             for step in self.agent.memory.steps:
                 # Process ActionStep and PlanningStep (both have token_usage and timing)
                 if isinstance(step, (ActionStep, PlanningStep)):
-                    step_info = {
+                    step_info: Dict[str, Any] = {
                         "step_number": getattr(step, "step_number", None),
                     }
 
@@ -288,13 +288,11 @@ def gather_traces(self) -> dict:
                         if step.timing.duration is not None:
                             total_duration += step.timing.duration
 
-                    # Add token usage information
+                    # Add per-step token usage
                     if hasattr(step, "token_usage") and step.token_usage:
                         step_info["input_tokens"] = step.token_usage.input_tokens
                         step_info["output_tokens"] = step.token_usage.output_tokens
                         step_info["total_tokens"] = step.token_usage.total_tokens
-                        total_input_tokens += step.token_usage.input_tokens
-                        total_output_tokens += step.token_usage.output_tokens
 
                     # Add action details for ActionStep
                     if isinstance(step, ActionStep):
@@ -312,13 +310,9 @@ def gather_traces(self) -> dict:
 
                     steps_stats.append(step_info)
 
-            # Add aggregated statistics
             base_logs.update(
                 {
                     "total_steps": len(steps_stats),
-                    "total_input_tokens": total_input_tokens,
-                    "total_output_tokens": total_output_tokens,
-                    "total_tokens": total_input_tokens + total_output_tokens,
                     "total_duration_seconds": total_duration,
                     "steps_detail": steps_stats,
                 }
@@ -326,6 +320,42 @@ def gather_traces(self) -> dict:
 
         return base_logs
 
+    def gather_usage(self) -> Usage:
+        """Gather aggregated token usage across all agent steps.
+
+        Walks smolagents' memory steps (ActionStep and PlanningStep) and sums
+        their ``token_usage`` into a single ``TokenUsage``.
+
+        Returns:
+            Aggregated token usage, or empty ``Usage`` if no steps or no usage data.
+        """
+        _check_smolagents_installed()
+
+        if not (hasattr(self.agent, "memory") and hasattr(self.agent.memory, "steps")):
+            return Usage()
+
+        from smolagents.memory import ActionStep, PlanningStep
+
+        total_input = 0
+        total_output = 0
+        has_usage = False
+
+        for step in self.agent.memory.steps:
+            if isinstance(step, (ActionStep, PlanningStep)):
+                if hasattr(step, "token_usage") and step.token_usage:
+                    total_input += step.token_usage.input_tokens
+                    total_output += step.token_usage.output_tokens
+                    has_usage = True
+
+        if not has_usage:
+            return Usage()
+
+        return TokenUsage(
+            input_tokens=total_input,
+            output_tokens=total_output,
+            total_tokens=total_input + total_output,
+        )
+
     def gather_config(self) -> dict[str, Any]:
         """Gather configuration from this SmolAgent.
 
diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py
index 46c55620..86be5722 100644
--- a/tests/test_interface/test_agent_integration/test_camel_integration.py
+++ b/tests/test_interface/test_agent_integration/test_camel_integration.py
@@ -84,13 +84,11 @@ def test_camel_adapter_gather_traces_with_response():
 
     traces = adapter.gather_traces()
 
-    # New API uses last_terminated and aggregated stats
+    # New API uses last_terminated and aggregated stats (token totals moved to gather_usage)
     assert "last_terminated" in traces
     assert traces["last_terminated"] is True
     assert "total_steps" in traces
     assert traces["total_steps"] == 1
-    assert "total_tokens" in traces
-    assert traces["total_tokens"] == 15
 
 
 def test_camel_adapter_gather_config_basic():
@@ -1079,3 +1077,172 @@ def test_workforce_tracer_truncates_long_content():
 
     assert len(traces["completed_tasks"][0]["content"]) == 200
     assert len(traces["completed_tasks"][0]["result"]) == 200
+
+
+# =============================================================================
+# gather_usage() Tests
+# =============================================================================
+
+
+def test_camel_adapter_gather_usage_with_responses():
+    """Test that gather_usage() aggregates token usage across CAMEL responses."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    mock_agent = create_mock_camel_agent()
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    # Simulate responses with usage data
+    adapter._responses.append(
+        MockCamelResponse(
+            content="Response 1",
+            terminated=False,
+            info={"usage": {"prompt_tokens": 100, "completion_tokens": 50, "total_tokens": 150}},
+        )
+    )
+    adapter._responses.append(
+        MockCamelResponse(
+            content="Response 2",
+            terminated=True,
+            info={"usage": {"prompt_tokens": 200, "completion_tokens": 80, "total_tokens": 280}},
+        )
+    )
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 300  # 100 + 200
+    assert usage.output_tokens == 130  # 50 + 80
+    assert usage.total_tokens == 430
+
+
+def test_camel_adapter_gather_usage_no_responses():
+    """Test that gather_usage() returns empty Usage when no responses exist."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import Usage
+
+    mock_agent = create_mock_camel_agent()
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+def test_camel_adapter_gather_usage_responses_without_usage():
+    """Test that gather_usage() handles responses without usage info."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import Usage
+
+    mock_agent = create_mock_camel_agent()
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    # Response without info or usage
+    adapter._responses.append(MockCamelResponse(content="Response", terminated=True, info={}))
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+# =============================================================================
+# End-to-End Usage Collection Tests
+# (real ChatAgent + StubModel execution, not pre-populated mock data)
+# =============================================================================
+
+
+def test_e2e_camel_gather_usage_single_step():
+    """Run a real CAMEL ChatAgent with StubModel, verify gather_usage() returns real token counts."""
+    from camel.agents import ChatAgent
+    from camel.models import StubModel
+    from camel.types import ModelType
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    # StubModel returns CompletionUsage(prompt_tokens=10, completion_tokens=10)
+    stub = StubModel(model_type=ModelType.STUB)
+    agent = ChatAgent(system_message="You are a helpful assistant.", model=stub)
+    adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test")
+
+    result = adapter.run("What is 2+2?")
+
+    assert result == "Lorem Ipsum"
+    assert len(adapter._responses) == 1
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 10
+    assert usage.output_tokens == 10
+    assert usage.total_tokens == 20
+
+
+def test_e2e_camel_gather_usage_multi_step():
+    """Run a real CAMEL ChatAgent multiple times, verify usage aggregation."""
+    from camel.agents import ChatAgent
+    from camel.models import StubModel
+    from camel.types import ModelType
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    stub = StubModel(model_type=ModelType.STUB)
+    agent = ChatAgent(system_message="You are helpful.", model=stub)
+    adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test")
+
+    adapter.run("First query")
+    adapter.run("Second query")
+    adapter.run("Third query")
+
+    assert len(adapter._responses) == 3
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 30  # 10 * 3
+    assert usage.output_tokens == 30  # 10 * 3
+    assert usage.total_tokens == 60  # 20 * 3
+
+
+def test_e2e_camel_gather_usage_empty_before_run():
+    """Verify gather_usage() returns empty Usage before run, real TokenUsage after."""
+    from camel.agents import ChatAgent
+    from camel.models import StubModel
+    from camel.types import ModelType
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage
+
+    stub = StubModel(model_type=ModelType.STUB)
+    agent = ChatAgent(system_message="You are helpful.", model=stub)
+    adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test")
+
+    # Before run: no usage
+    usage_before = adapter.gather_usage()
+    assert isinstance(usage_before, Usage)
+    assert not isinstance(usage_before, MasevalTokenUsage)
+
+    # After run: real usage from the StubModel pipeline
+    adapter.run("test query")
+    usage_after = adapter.gather_usage()
+    assert isinstance(usage_after, MasevalTokenUsage)
+    assert usage_after.input_tokens > 0
+    assert usage_after.output_tokens > 0
+
+
+def test_e2e_camel_logs_contain_usage():
+    """Verify adapter.logs also contain usage data from real execution."""
+    from camel.agents import ChatAgent
+    from camel.models import StubModel
+    from camel.types import ModelType
+    from maseval.interface.agents.camel import CamelAgentAdapter
+
+    stub = StubModel(model_type=ModelType.STUB)
+    agent = ChatAgent(system_message="You are helpful.", model=stub)
+    adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test")
+
+    adapter.run("Tell me something")
+
+    assert len(adapter.logs) == 1
+    assert adapter.logs[0]["status"] == "success"
+    assert adapter.logs[0]["input_tokens"] == 10
+    assert adapter.logs[0]["output_tokens"] == 10
+    assert adapter.logs[0]["total_tokens"] == 20
diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
index c972a464..b39c4791 100644
--- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py
+++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
@@ -240,3 +240,139 @@ def agent_node(state: State) -> State:
     assert log_entry.get("input_tokens") in [None, 0]
     assert log_entry.get("output_tokens") in [None, 0]
     assert log_entry.get("total_tokens") in [None, 0]
+
+
+# =============================================================================
+# gather_usage() Tests
+# =============================================================================
+
+
+def test_langgraph_adapter_gather_usage_with_metadata():
+    """Test that gather_usage() extracts token usage from message metadata."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+    from langgraph.graph import StateGraph, END
+    from typing_extensions import TypedDict
+    from langchain_core.messages import AIMessage
+    from langchain_core.messages.ai import UsageMetadata
+
+    class State(TypedDict):
+        messages: list
+
+    def agent_node(state: State) -> State:
+        messages = state["messages"]
+        response = AIMessage(
+            content="Response",
+            usage_metadata=UsageMetadata(
+                input_tokens=150,
+                output_tokens=60,
+                total_tokens=210,
+            ),
+        )
+        return {"messages": messages + [response]}
+
+    graph = StateGraph(State)  # type: ignore[arg-type]
+    graph.add_node("agent", agent_node)
+    graph.set_entry_point("agent")
+    graph.add_edge("agent", END)
+    compiled = graph.compile()
+
+    adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent")
+    adapter.run("Test query")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 150
+    assert usage.output_tokens == 60
+    assert usage.total_tokens == 210
+
+
+def test_langgraph_adapter_gather_usage_with_token_details():
+    """Test that gather_usage() extracts detailed token breakdowns."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+    from langgraph.graph import StateGraph, END
+    from typing_extensions import TypedDict
+    from langchain_core.messages import AIMessage
+    from langchain_core.messages.ai import UsageMetadata
+
+    class State(TypedDict):
+        messages: list
+
+    def agent_node(state: State) -> State:
+        messages = state["messages"]
+        response = AIMessage(
+            content="Response",
+            usage_metadata=UsageMetadata(
+                input_tokens=200,
+                output_tokens=100,
+                total_tokens=300,
+                input_token_details={"cache_read": 50, "cache_creation": 30},
+                output_token_details={"reasoning": 40},
+            ),
+        )
+        return {"messages": messages + [response]}
+
+    graph = StateGraph(State)  # type: ignore[arg-type]
+    graph.add_node("agent", agent_node)
+    graph.set_entry_point("agent")
+    graph.add_edge("agent", END)
+    compiled = graph.compile()
+
+    adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent")
+    adapter.run("Test query")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 200
+    assert usage.output_tokens == 100
+    assert usage.cached_input_tokens == 50
+    assert usage.cache_creation_input_tokens == 30
+    assert usage.reasoning_tokens == 40
+
+
+def test_langgraph_adapter_gather_usage_no_metadata():
+    """Test that gather_usage() returns empty Usage when no usage_metadata."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import Usage
+    from langgraph.graph import StateGraph, END
+    from typing_extensions import TypedDict
+    from langchain_core.messages import AIMessage
+
+    class State(TypedDict):
+        messages: list
+
+    def agent_node(state: State) -> State:
+        messages = state["messages"]
+        response = AIMessage(content="Response")  # No usage_metadata
+        return {"messages": messages + [response]}
+
+    graph = StateGraph(State)  # type: ignore[arg-type]
+    graph.add_node("agent", agent_node)
+    graph.set_entry_point("agent")
+    graph.add_edge("agent", END)
+    compiled = graph.compile()
+
+    adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent")
+    adapter.run("Test query")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+def test_langgraph_adapter_gather_usage_before_run():
+    """Test that gather_usage() returns empty Usage before any run."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import Usage
+    from unittest.mock import Mock
+
+    adapter = LangGraphAgentAdapter(agent_instance=Mock(), name="test_agent")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
index 52779f65..49998292 100644
--- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
+++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
@@ -393,3 +393,210 @@ def test_llamaindex_adapter_error_logging():
     assert log_entry["error"] == "Test error"
     assert log_entry["error_type"] == "ValueError"
     assert "duration_seconds" in log_entry
+
+
+# =============================================================================
+# gather_usage() Tests
+# =============================================================================
+
+
+def test_llamaindex_adapter_gather_usage_with_logs():
+    """Test that gather_usage() aggregates token usage from execution logs."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+    from unittest.mock import Mock
+
+    adapter = LlamaIndexAgentAdapter(Mock(), "test_agent")
+
+    # Simulate logs with token usage (as populated by _run_agent)
+    adapter.logs.append(
+        {
+            "timestamp": "2026-01-01T00:00:00",
+            "query": "Query 1",
+            "status": "success",
+            "duration_seconds": 1.0,
+            "input_tokens": 100,
+            "output_tokens": 50,
+            "total_tokens": 150,
+        }
+    )
+    adapter.logs.append(
+        {
+            "timestamp": "2026-01-01T00:00:01",
+            "query": "Query 2",
+            "status": "success",
+            "duration_seconds": 0.5,
+            "input_tokens": 200,
+            "output_tokens": 80,
+            "total_tokens": 280,
+        }
+    )
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 300  # 100 + 200
+    assert usage.output_tokens == 130  # 50 + 80
+    assert usage.total_tokens == 430
+
+
+def test_llamaindex_adapter_gather_usage_no_logs():
+    """Test that gather_usage() returns empty Usage with no logs."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from maseval.core.usage import Usage
+    from unittest.mock import Mock
+
+    adapter = LlamaIndexAgentAdapter(Mock(), "test_agent")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+def test_llamaindex_adapter_gather_usage_logs_without_tokens():
+    """Test that gather_usage() returns empty Usage when logs have no token fields."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from maseval.core.usage import Usage
+    from unittest.mock import Mock
+
+    adapter = LlamaIndexAgentAdapter(Mock(), "test_agent")
+
+    # Log without token fields (error case, or model didn't report usage)
+    adapter.logs.append(
+        {
+            "timestamp": "2026-01-01T00:00:00",
+            "query": "Query",
+            "status": "error",
+            "duration_seconds": 0.1,
+            "error": "Something went wrong",
+        }
+    )
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+# =============================================================================
+# End-to-End Usage Collection Tests
+# (real ReActAgent + AgentWorkflow execution, not pre-populated mock data)
+# =============================================================================
+
+
+def _make_llamaindex_adapter(prompt_tokens_per_call: int = 10, completion_tokens_per_call: int = 20):
+    """Create adapter wrapping a ReActAgent + AgentWorkflow with a mock LLM that reports usage.
+
+    The mock LLM returns CompletionResponse with token usage in raw, using
+    ReAct format ("Thought: ... Answer: ...") so the output parser works.
+    """
+    from types import SimpleNamespace
+    from typing import Any
+    from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen
+    from llama_index.core.llms.custom import CustomLLM
+    from llama_index.core.llms import LLMMetadata
+    from llama_index.core.agent.workflow.react_agent import ReActAgent
+    from llama_index.core.agent.workflow import AgentWorkflow
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+
+    pt = prompt_tokens_per_call
+    ct = completion_tokens_per_call
+
+    class _MockLLMWithUsage(CustomLLM):
+        prompt_tokens: int = pt
+        completion_tokens: int = ct
+
+        @property
+        def metadata(self) -> LLMMetadata:
+            return LLMMetadata(num_output=256)
+
+        def complete(self, prompt: str, formatted: bool = False, **kwargs: Any) -> CompletionResponse:
+            usage = SimpleNamespace(
+                prompt_tokens=self.prompt_tokens,
+                completion_tokens=self.completion_tokens,
+                total_tokens=self.prompt_tokens + self.completion_tokens,
+            )
+            return CompletionResponse(
+                text="Thought: I can answer directly.\nAnswer: Mock answer.",
+                raw=SimpleNamespace(usage=usage),
+            )
+
+        def stream_complete(self, prompt: str, formatted: bool = False, **kwargs: Any) -> CompletionResponseGen:
+            raise NotImplementedError
+
+    llm = _MockLLMWithUsage()
+    agent = ReActAgent(name="test_agent", description="test", llm=llm, tools=[], streaming=False)
+    workflow = AgentWorkflow(agents=[agent], root_agent="test_agent")
+    return LlamaIndexAgentAdapter(workflow, "test_agent")
+
+
+def test_e2e_llamaindex_gather_usage_single_run():
+    """Run a real ReActAgent → adapter.run() → gather_usage() returns real token counts."""
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    adapter = _make_llamaindex_adapter(prompt_tokens_per_call=10, completion_tokens_per_call=20)
+
+    result = adapter.run("Hello?")
+    assert isinstance(result, str)
+    assert len(result) > 0
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 10
+    assert usage.output_tokens == 20
+    assert usage.total_tokens == 30
+
+
+def test_e2e_llamaindex_gather_usage_accumulates():
+    """Multiple adapter.run() calls accumulate in gather_usage()."""
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    adapter = _make_llamaindex_adapter(prompt_tokens_per_call=15, completion_tokens_per_call=25)
+
+    adapter.run("First query")
+    adapter.run("Second query")
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 30  # 15 + 15
+    assert usage.output_tokens == 50  # 25 + 25
+    assert usage.total_tokens == 80
+
+
+def test_e2e_llamaindex_gather_usage_empty_before_run():
+    """Verify gather_usage() returns empty Usage before run, real TokenUsage after."""
+    from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage
+
+    adapter = _make_llamaindex_adapter(prompt_tokens_per_call=50, completion_tokens_per_call=100)
+
+    # Before run: no usage
+    usage_before = adapter.gather_usage()
+    assert isinstance(usage_before, Usage)
+    assert not isinstance(usage_before, MasevalTokenUsage)
+
+    # After run: real usage from the LLM
+    adapter.run("test query")
+    usage_after = adapter.gather_usage()
+    assert isinstance(usage_after, MasevalTokenUsage)
+    assert usage_after.input_tokens == 50
+    assert usage_after.output_tokens == 100
+    assert usage_after.total_tokens == 150
+
+
+def test_e2e_llamaindex_logs_populated_by_real_execution():
+    """Verify adapter.logs is populated by _run_agent, not manually."""
+    adapter = _make_llamaindex_adapter(prompt_tokens_per_call=50, completion_tokens_per_call=100)
+
+    assert len(adapter.logs) == 0
+
+    adapter.run("Test query")
+
+    assert len(adapter.logs) == 1
+    log = adapter.logs[0]
+    assert log["status"] == "success"
+    assert log["input_tokens"] == 50
+    assert log["output_tokens"] == 100
+    assert log["total_tokens"] == 150
+    assert "timestamp" in log
+    assert "duration_seconds" in log
diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
index d3c40400..cc605fa0 100644
--- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py
+++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
@@ -113,19 +113,10 @@ def test_smolagents_adapter_gather_traces_with_monitoring():
     # Call gather_traces
     traces = agent_adapter.gather_traces()
 
-    # Verify aggregated statistics
+    # Verify aggregated statistics (token totals moved to gather_usage)
     assert "total_steps" in traces
     assert traces["total_steps"] == 2
 
-    assert "total_input_tokens" in traces
-    assert traces["total_input_tokens"] == 300  # 100 + 200
-
-    assert "total_output_tokens" in traces
-    assert traces["total_output_tokens"] == 150  # 50 + 100
-
-    assert "total_tokens" in traces
-    assert traces["total_tokens"] == 450  # 300 + 150
-
     assert "total_duration_seconds" in traces
     assert traces["total_duration_seconds"] == pytest.approx(1.2, abs=0.01)  # 0.5 + 0.7
 
@@ -171,19 +162,10 @@ def test_smolagents_adapter_gather_traces_without_monitoring():
     # Call gather_traces
     traces = agent_adapter.gather_traces()
 
-    # Verify aggregated statistics show zero usage
+    # Verify aggregated statistics (token totals moved to gather_usage)
     assert "total_steps" in traces
     assert traces["total_steps"] == 0
 
-    assert "total_input_tokens" in traces
-    assert traces["total_input_tokens"] == 0
-
-    assert "total_output_tokens" in traces
-    assert traces["total_output_tokens"] == 0
-
-    assert "total_tokens" in traces
-    assert traces["total_tokens"] == 0
-
     assert "total_duration_seconds" in traces
     assert traces["total_duration_seconds"] == 0.0
 
@@ -224,11 +206,8 @@ def test_smolagents_adapter_gather_traces_with_planning_step():
     # Call gather_traces
     traces = agent_adapter.gather_traces()
 
-    # Verify aggregated statistics
+    # Verify aggregated statistics (token totals moved to gather_usage)
     assert traces["total_steps"] == 1
-    assert traces["total_input_tokens"] == 500
-    assert traces["total_output_tokens"] == 200
-    assert traces["total_tokens"] == 700
     assert traces["total_duration_seconds"] == pytest.approx(1.0, abs=0.01)
 
     # Verify step details
@@ -478,3 +457,256 @@ def test_smolagents_adapter_logs_empty_when_no_steps():
     # Should be empty
     assert isinstance(logs, list)
     assert len(logs) == 0
+
+
+# =============================================================================
+# gather_usage() Tests
+# =============================================================================
+
+
+def test_smolagents_adapter_gather_usage_with_steps():
+    """Test that gather_usage() aggregates token usage across all memory steps."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+    from smolagents.memory import ActionStep, PlanningStep, AgentMemory
+    from smolagents.monitoring import TokenUsage, Timing
+    from smolagents.models import ChatMessage, MessageRole
+    from unittest.mock import Mock
+    import time
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+
+    start = time.time()
+
+    # ActionStep with usage
+    step1 = ActionStep(
+        step_number=1,
+        timing=Timing(start_time=start, end_time=start + 0.5),
+        observations_images=[],
+    )
+    step1.token_usage = TokenUsage(input_tokens=100, output_tokens=50)
+    mock_agent.memory.steps.append(step1)
+
+    # PlanningStep with usage
+    step2 = PlanningStep(
+        timing=Timing(start_time=start + 0.5, end_time=start + 1.0),
+        model_input_messages=[],
+        model_output_message=ChatMessage(role=MessageRole.ASSISTANT, content="Plan"),
+        plan="My plan",
+    )
+    step2.token_usage = TokenUsage(input_tokens=200, output_tokens=80)
+    mock_agent.memory.steps.append(step2)
+
+    mock_agent.write_memory_to_messages = Mock(return_value=[])
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 300  # 100 + 200
+    assert usage.output_tokens == 130  # 50 + 80
+    assert usage.total_tokens == 430
+
+
+def test_smolagents_adapter_gather_usage_no_steps():
+    """Test that gather_usage() returns empty Usage when no steps exist."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import Usage
+    from smolagents.memory import AgentMemory
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+    mock_agent.write_memory_to_messages = Mock(return_value=[])
+
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+    assert usage.input_tokens == 0 if hasattr(usage, "input_tokens") else True
+
+
+def test_smolagents_adapter_gather_usage_steps_without_token_usage():
+    """Test that gather_usage() returns empty Usage when steps have no token_usage."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import Usage
+    from smolagents.memory import ActionStep, AgentMemory
+    from smolagents.monitoring import Timing
+    from unittest.mock import Mock
+    import time
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+
+    start = time.time()
+    step = ActionStep(
+        step_number=1,
+        timing=Timing(start_time=start, end_time=start + 0.5),
+        observations_images=[],
+    )
+    # No token_usage set — defaults to None
+    mock_agent.memory.steps.append(step)
+    mock_agent.write_memory_to_messages = Mock(return_value=[])
+
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent")
+
+    usage = adapter.gather_usage()
+
+    # Should return plain Usage (not TokenUsage) since no usage data
+    assert isinstance(usage, Usage)
+    assert usage.cost is None
+
+
+# =============================================================================
+# End-to-End Usage Collection Tests
+# (real ToolCallingAgent execution, not pre-populated mock data)
+# =============================================================================
+
+
+class _FakeModelForUsageTest:
+    """Deterministic fake model that returns canned responses with token usage.
+
+    Subclasses smolagents Model via duck-typing (same generate() signature).
+    Uses a list of responses; cycles the last one if calls exceed the list.
+    """
+
+    def __init__(self, responses=None):
+        from smolagents.models import (
+            ChatMessage,
+            ChatMessageToolCall,
+            ChatMessageToolCallFunction,
+            MessageRole,
+        )
+        from smolagents.monitoring import TokenUsage
+
+        self.model_id = "fake-model-for-test"
+        self._call_count = 0
+        self._responses = responses or [
+            ChatMessage(
+                role=MessageRole.ASSISTANT,
+                content="Here is the answer.",
+                tool_calls=[
+                    ChatMessageToolCall(
+                        function=ChatMessageToolCallFunction(
+                            name="final_answer",
+                            arguments={"answer": "42"},
+                        ),
+                        id="call_001",
+                        type="function",
+                    )
+                ],
+                token_usage=TokenUsage(input_tokens=150, output_tokens=30),
+            )
+        ]
+
+    def generate(self, messages, stop_sequences=None, response_format=None, tools_to_call_from=None, **kwargs):
+        idx = min(self._call_count, len(self._responses) - 1)
+        self._call_count += 1
+        return self._responses[idx]
+
+
+def test_e2e_smolagents_gather_usage_single_step():
+    """Run a real ToolCallingAgent → adapter.run() → gather_usage() returns real token counts."""
+    from smolagents import ToolCallingAgent
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    agent = ToolCallingAgent(tools=[], model=_FakeModelForUsageTest(), max_steps=3, verbosity_level=0)
+    adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent")
+
+    result = adapter.run("What is the meaning of life?")
+    assert result == "42"
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 150
+    assert usage.output_tokens == 30
+    assert usage.total_tokens == 180
+
+
+def test_e2e_smolagents_gather_usage_multi_step():
+    """Run a real agent through tool call + final answer, verify usage aggregation."""
+    from smolagents import ToolCallingAgent
+    from smolagents.models import (
+        ChatMessage,
+        ChatMessageToolCall,
+        ChatMessageToolCallFunction,
+        MessageRole,
+    )
+    from smolagents.monitoring import TokenUsage
+    from smolagents.tools import Tool
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+
+    class AddTool(Tool):
+        name = "add_numbers"
+        description = "Adds two numbers"
+        inputs = {"a": {"type": "number", "description": "First"}, "b": {"type": "number", "description": "Second"}}
+        output_type = "number"
+
+        def forward(self, a, b):
+            return a + b
+
+    responses = [
+        ChatMessage(
+            role=MessageRole.ASSISTANT,
+            content="Let me add.",
+            tool_calls=[
+                ChatMessageToolCall(
+                    function=ChatMessageToolCallFunction(name="add_numbers", arguments={"a": 20, "b": 22}),
+                    id="call_001",
+                    type="function",
+                )
+            ],
+            token_usage=TokenUsage(input_tokens=200, output_tokens=40),
+        ),
+        ChatMessage(
+            role=MessageRole.ASSISTANT,
+            content="The sum is 42.",
+            tool_calls=[
+                ChatMessageToolCall(
+                    function=ChatMessageToolCallFunction(name="final_answer", arguments={"answer": "42"}),
+                    id="call_002",
+                    type="function",
+                )
+            ],
+            token_usage=TokenUsage(input_tokens=350, output_tokens=20),
+        ),
+    ]
+
+    agent = ToolCallingAgent(tools=[AddTool()], model=_FakeModelForUsageTest(responses), max_steps=5, verbosity_level=0)
+    adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent")
+
+    result = adapter.run("What is 20 + 22?")
+    assert result == "42"
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 550  # 200 + 350
+    assert usage.output_tokens == 60  # 40 + 20
+    assert usage.total_tokens == 610
+
+
+def test_e2e_smolagents_gather_usage_empty_before_run():
+    """Verify gather_usage() returns empty Usage before run, real TokenUsage after."""
+    from smolagents import ToolCallingAgent
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage
+
+    agent = ToolCallingAgent(tools=[], model=_FakeModelForUsageTest(), max_steps=3, verbosity_level=0)
+    adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent")
+
+    # Before run: no usage
+    usage_before = adapter.gather_usage()
+    assert isinstance(usage_before, Usage)
+    assert not isinstance(usage_before, MasevalTokenUsage)
+
+    # After run: real usage from the model
+    adapter.run("test query")
+    usage_after = adapter.gather_usage()
+    assert isinstance(usage_after, MasevalTokenUsage)
+    assert usage_after.input_tokens > 0
+    assert usage_after.output_tokens > 0

From 8b9a374f5666c524135dc11df3d21342de600354 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Sun, 15 Mar 2026 20:37:32 +0100
Subject: [PATCH 11/19] updated usage tracking guide

---
 docs/guides/usage-tracking.md | 260 +++++++++++++++-------------------
 1 file changed, 112 insertions(+), 148 deletions(-)

diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md
index 6b249a05..fba1915e 100644
--- a/docs/guides/usage-tracking.md
+++ b/docs/guides/usage-tracking.md
@@ -2,12 +2,7 @@
 
 ## Overview
 
-MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for:
-
-- **Cost control**: Track how much each benchmark run costs across providers
-- **Budgeting**: Compare cost across models, tasks, and components
-- **Billing**: Support custom credit systems (university clusters, internal APIs)
-- **Analysis**: Understand token usage patterns per task, agent, or model
+MASEval tracks how much each benchmark run consumes — tokens, API calls, dollars — so you can compare models, stay within budget, and explain where money went.
 
 !!! info "Usage vs Cost"
 
@@ -17,39 +12,52 @@ MASEval provides first-class usage and cost tracking to monitor resource consump
 
     Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator.
 
-## Core Concepts
-
-**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata.
+## What Gets Tracked Automatically
 
-**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.).
+**Model adapters** track every `chat()` call — input tokens, output tokens, cached tokens, reasoning tokens. No setup needed.
 
-**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`.
+**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures — memory steps, response metadata, message annotations, or execution logs respectively.
 
-**`CostCalculator`**: Protocol for pluggable cost computation from token counts.
+**Benchmarks** collect usage from all registered components after each task and include it in reports.
 
-## Automatic LLM Usage Tracking
+## Getting Started
 
-All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally.
+### Reading Model Usage
 
 ```python
 from maseval.interface.inference import OpenAIModelAdapter
 
 model = OpenAIModelAdapter(client=client, model_id="gpt-4")
 
-# Make some calls
 model.chat([{"role": "user", "content": "Hello"}])
 model.chat([{"role": "user", "content": "How are you?"}])
 
-# Inspect accumulated usage
+# Accumulated usage across both calls
 usage = model.gather_usage()
-print(usage.input_tokens)   # e.g., 25
-print(usage.output_tokens)  # e.g., 42
-print(usage.cost)           # None (no cost calculator configured)
+print(f"{usage.input_tokens} in, {usage.output_tokens} out")
+print(f"Cost: ${usage.cost}")  # None if no cost calculator configured
+```
+
+### Reading Agent Usage
+
+Agent adapters expose the same `gather_usage()` interface. Each adapter knows how to extract usage from its framework's internals:
+
+```python
+from maseval.interface.agents import SmolAgentAdapter
+
+adapter = SmolAgentAdapter(agent, name="researcher")
+adapter.run("What's the capital of France?")
+
+# Usage is aggregated from the agent's memory steps
+usage = adapter.gather_usage()
+print(f"{usage.input_tokens} in, {usage.output_tokens} out")
 ```
 
+This works across all supported frameworks — smolagents, CAMEL, LangGraph, and LlamaIndex. The adapter handles the framework-specific extraction; you always call `gather_usage()`.
+
 ### In Benchmarks
 
-Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key:
+Usage is collected automatically alongside traces and configs after each task. Each report includes a `"usage"` key:
 
 ```python
 results = benchmark.run()
@@ -65,29 +73,18 @@ benchmark.usage               # -> Usage (grand total across all tasks)
 benchmark.usage_by_component  # -> Dict[str, Usage] (per-component totals)
 ```
 
-## Cost Calculation
-
-Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones.
-
-### Cost Priority
-
-When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order:
+## Adding Cost Tracking
 
-1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
-2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
-3. **None** — if neither source provides cost, `Usage.cost` stays `None`.
+Most LLM APIs return token counts but not cost. MASEval provides two built-in cost calculators.
 
-### StaticPricingCalculator
+### Quick Start: LiteLLM Pricing
 
-Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`.
+The easiest path — uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more:
 
 ```python
-from maseval import StaticPricingCalculator
+from maseval.interface.usage import LiteLLMCostCalculator
 
-calculator = StaticPricingCalculator({
-    "gpt-4": {"input": 0.00003, "output": 0.00006},
-    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
-})
+calculator = LiteLLMCostCalculator()
 
 model = OpenAIModelAdapter(
     client=client,
@@ -96,78 +93,90 @@ model = OpenAIModelAdapter(
 )
 
 response = model.chat([{"role": "user", "content": "Hello"}])
-print(model.gather_usage().cost)  # e.g., 0.00234
+print(f"Cost: ${model.gather_usage().cost:.4f}")
 ```
 
-Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate:
+!!! tip "LiteLLMModelAdapter already reports cost"
+
+    If you're using `LiteLLMModelAdapter`, it extracts provider-reported cost automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup.
+
+If your model ID doesn't match LiteLLM's naming (e.g., using Google's OpenAI-compatible endpoint), remap it:
 
 ```python
-calculator = StaticPricingCalculator({
-    "claude-sonnet-4-5": {
-        "input": 0.000003,
-        "output": 0.000015,
-        "cached_input": 0.0000003,  # 10x cheaper for cached tokens
-    },
+calculator = LiteLLMCostCalculator(model_id_map={
+    "gemini-2.0-flash": "gemini/gemini-2.0-flash",
 })
 ```
 
-For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents:
+You can also override pricing for specific models while using LiteLLM's database for the rest:
 
 ```python
-calculator = StaticPricingCalculator({
-    "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
+calculator = LiteLLMCostCalculator(custom_pricing={
+    "my-finetuned-gpt4": {
+        "input_cost_per_token": 0.00006,
+        "output_cost_per_token": 0.00012,
+    },
 })
 ```
 
-### LiteLLMCostCalculator
+### Manual Pricing
 
-Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more.
+When you know your rates, use `StaticPricingCalculator` — zero dependencies, fully explicit:
 
 ```python
-from maseval.interface.usage import LiteLLMCostCalculator
-
-calculator = LiteLLMCostCalculator()
+from maseval import StaticPricingCalculator
 
-model = OpenAIModelAdapter(
-    client=client,
-    model_id="gpt-4",
-    cost_calculator=calculator,
-)
+calculator = StaticPricingCalculator({
+    "gpt-4": {"input": 0.00003, "output": 0.00006},
+    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
+})
 ```
 
-!!! tip "LiteLLMModelAdapter already reports cost"
-
-    If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup.
+Pricing is **per token** (not per 1K or 1M). For cached tokens, add a `"cached_input"` rate:
 
-#### Custom Pricing Overrides
+```python
+calculator = StaticPricingCalculator({
+    "claude-sonnet-4-5": {
+        "input": 0.000003,
+        "output": 0.000015,
+        "cached_input": 0.0000003,  # 10x cheaper
+    },
+})
+```
 
-Override pricing for specific models while using LiteLLM's database for the rest:
+The cost unit is whatever your pricing represents — USD, EUR, university credits:
 
 ```python
-calculator = LiteLLMCostCalculator(custom_pricing={
-    "my-finetuned-gpt4": {
-        "input_cost_per_token": 0.00006,
-        "output_cost_per_token": 0.00012,
-    },
+calculator = StaticPricingCalculator({
+    "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
 })
 ```
 
-#### Model ID Remapping
+### Sharing a Calculator Across Models
 
-When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap:
+A single calculator instance works for multiple model adapters — the `model_id` is passed on each cost computation:
 
 ```python
-calculator = LiteLLMCostCalculator(model_id_map={
-    "gemini-2.0-flash": "gemini/gemini-2.0-flash",
-    "my-custom-gpt4": "gpt-4",
+calculator = StaticPricingCalculator({
+    "gpt-4": {"input": 0.00003, "output": 0.00006},
+    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
 })
+
+model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
+model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator)
 ```
 
-The map is applied before both custom pricing and LiteLLM lookup.
+### How Cost Is Resolved
+
+When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order:
+
+1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
+2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
+3. **None** — if neither source provides cost, `usage.cost` stays `None`.
 
-### Custom Cost Calculator
+### Writing a Custom Calculator
 
-Implement the `CostCalculator` protocol for custom pricing logic:
+Implement the `CostCalculator` protocol — a single method:
 
 ```python
 from maseval import CostCalculator, TokenUsage
@@ -181,29 +190,40 @@ class MyCostCalculator:
         return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
 ```
 
-The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model.
+Return `None` if you don't have pricing for the given model.
 
-### Sharing Calculators Across Adapters
+## Post-hoc Analysis
 
-A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing:
+After a benchmark completes, `UsageReporter` lets you slice usage by task, component, or both:
 
 ```python
-calculator = StaticPricingCalculator({
-    "gpt-4": {"input": 0.00003, "output": 0.00006},
-    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
-})
+from maseval import UsageReporter
 
-model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
-model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator)
+reporter = UsageReporter.from_reports(benchmark.reports)
+
+# Grand total
+total = reporter.total()
+print(f"Total cost: ${total.cost:.4f}")
+print(f"Total tokens: {total.input_tokens + total.output_tokens}")
+
+# Where did the money go?
+for component, usage in reporter.by_component().items():
+    print(f"  {component}: ${usage.cost:.4f}")
+
+# Which tasks were expensive?
+for task_id, usage in reporter.by_task().items():
+    print(f"  {task_id}: ${usage.cost:.4f}")
+
+# Full nested summary dict
+summary = reporter.summary()
 ```
 
-## Non-LLM Usage Tracking
+## Tracking Non-LLM Resources
 
-Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`:
+Tools, environments, and other components can track arbitrary usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`. Here's an example for a paid API:
 
 ```python
 from maseval import Usage, UsageTrackableMixin
-from maseval.core.tracing import TraceableMixin
 
 class BloombergEnvironment(Environment, UsageTrackableMixin):
     def __init__(self, task_data):
@@ -226,67 +246,11 @@ class BloombergEnvironment(Environment, UsageTrackableMixin):
         return sum(self._usage_records, Usage())
 ```
 
-Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model.
-
-## Post-hoc Analysis with UsageReporter
-
-`UsageReporter` provides sliced analysis across all benchmark reports:
-
-```python
-from maseval import UsageReporter
-
-reporter = UsageReporter.from_reports(benchmark.reports)
-
-# Grand total
-total = reporter.total()
-print(f"Total cost: ${total.cost:.4f}")
-print(f"Total tokens: {total.input_tokens + total.output_tokens}")
-
-# Per-task breakdown
-for task_id, usage in reporter.by_task().items():
-    print(f"  {task_id}: ${usage.cost:.4f}")
-
-# Per-component breakdown
-for component, usage in reporter.by_component().items():
-    print(f"  {component}: ${usage.cost:.4f}")
-
-# Full nested summary dict
-summary = reporter.summary()
-```
-
-## Usage Data Model
-
-### Usage
-
-Generic record for any billable resource:
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. |
-| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). |
-| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). |
-| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). |
-| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). |
-| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). |
-
-`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch.
-
-### TokenUsage
-
-Extends `Usage` with LLM-specific token counts:
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `input_tokens` | `int` | Input/prompt tokens. |
-| `output_tokens` | `int` | Output/completion tokens. |
-| `total_tokens` | `int` | Total tokens. |
-| `cached_input_tokens` | `int` | Tokens served from cache. |
-| `reasoning_tokens` | `int` | Reasoning/thinking tokens. |
-| `audio_tokens` | `int` | Audio processing tokens. |
+Non-LLM components set cost directly — there is no calculator involvement. Each component knows its own billing model.
 
 ## Evaluator Usage
 
-Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically:
+Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register it in the benchmark and its usage is collected separately from agent usage:
 
 ```python
 class MyBenchmark(Benchmark):
@@ -296,11 +260,11 @@ class MyBenchmark(Benchmark):
         return [MyLLMEvaluator(judge_model)]
 ```
 
-The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage.
+The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, so you can distinguish evaluation cost from agent cost.
 
 ## Tips
 
-**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates.
+**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` when you know your rates.
 
 **For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming.
 

From 276ee662449e63d59d080fa961d21869ef4977c0 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Sun, 15 Mar 2026 21:23:06 +0100
Subject: [PATCH 12/19] fixed tests and usage tracking

---
 docs/guides/usage-tracking.md                 | 86 ++++++++++++++++---
 maseval/core/model.py                         | 13 +--
 maseval/core/registry.py                      |  7 +-
 maseval/core/usage.py                         | 59 ++++++++-----
 tests/test_core/test_usage.py                 | 25 ++++--
 .../test_camel_integration.py                 |  4 +-
 .../test_langgraph_integration.py             |  4 +-
 .../test_llamaindex_integration.py            |  4 +-
 .../test_smolagents_integration.py            |  4 +-
 9 files changed, 144 insertions(+), 62 deletions(-)

diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md
index fba1915e..b59005c4 100644
--- a/docs/guides/usage-tracking.md
+++ b/docs/guides/usage-tracking.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-MASEval tracks how much each benchmark run consumes — tokens, API calls, dollars — so you can compare models, stay within budget, and explain where money went.
+MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars) so you can compare models, stay within budget, and explain where money went.
 
 !!! info "Usage vs Cost"
 
@@ -14,9 +14,9 @@ MASEval tracks how much each benchmark run consumes — tokens, API calls, dolla
 
 ## What Gets Tracked Automatically
 
-**Model adapters** track every `chat()` call — input tokens, output tokens, cached tokens, reasoning tokens. No setup needed.
+**Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed.
 
-**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures — memory steps, response metadata, message annotations, or execution logs respectively.
+**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively).
 
 **Benchmarks** collect usage from all registered components after each task and include it in reports.
 
@@ -35,7 +35,7 @@ model.chat([{"role": "user", "content": "How are you?"}])
 # Accumulated usage across both calls
 usage = model.gather_usage()
 print(f"{usage.input_tokens} in, {usage.output_tokens} out")
-print(f"Cost: ${usage.cost}")  # None if no cost calculator configured
+print(f"Cost: ${usage.cost}")  # $0.0 if no cost calculator configured
 ```
 
 ### Reading Agent Usage
@@ -53,7 +53,7 @@ usage = adapter.gather_usage()
 print(f"{usage.input_tokens} in, {usage.output_tokens} out")
 ```
 
-This works across all supported frameworks — smolagents, CAMEL, LangGraph, and LlamaIndex. The adapter handles the framework-specific extraction; you always call `gather_usage()`.
+This works across all supported frameworks (smolagents, CAMEL, LangGraph, and LlamaIndex). The adapter handles the framework-specific extraction; you always call `gather_usage()`.
 
 ### In Benchmarks
 
@@ -79,7 +79,7 @@ Most LLM APIs return token counts but not cost. MASEval provides two built-in co
 
 ### Quick Start: LiteLLM Pricing
 
-The easiest path — uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more:
+The easiest path. Uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more:
 
 ```python
 from maseval.interface.usage import LiteLLMCostCalculator
@@ -121,7 +121,7 @@ calculator = LiteLLMCostCalculator(custom_pricing={
 
 ### Manual Pricing
 
-When you know your rates, use `StaticPricingCalculator` — zero dependencies, fully explicit:
+When you know your rates, use `StaticPricingCalculator`. Zero dependencies, fully explicit:
 
 ```python
 from maseval import StaticPricingCalculator
@@ -144,7 +144,7 @@ calculator = StaticPricingCalculator({
 })
 ```
 
-The cost unit is whatever your pricing represents — USD, EUR, university credits:
+The cost unit is whatever your pricing represents (USD, EUR, university credits):
 
 ```python
 calculator = StaticPricingCalculator({
@@ -154,7 +154,7 @@ calculator = StaticPricingCalculator({
 
 ### Sharing a Calculator Across Models
 
-A single calculator instance works for multiple model adapters — the `model_id` is passed on each cost computation:
+A single calculator instance works for multiple model adapters. The `model_id` is passed on each cost computation:
 
 ```python
 calculator = StaticPricingCalculator({
@@ -170,13 +170,13 @@ model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cos
 
 When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order:
 
-1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
-2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
-3. **None** — if neither source provides cost, `usage.cost` stays `None`.
+1. **Provider-reported cost**: e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
+2. **CostCalculator**: if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
+3. **Zero**: if neither source provides cost, `usage.cost` stays `0.0`.
 
 ### Writing a Custom Calculator
 
-Implement the `CostCalculator` protocol — a single method:
+Implement the `CostCalculator` protocol (a single method):
 
 ```python
 from maseval import CostCalculator, TokenUsage
@@ -218,6 +218,64 @@ for task_id, usage in reporter.by_task().items():
 summary = reporter.summary()
 ```
 
+## How Usage Addition Works
+
+`Usage` records can be added together with `+` or `sum()`. Understanding how fields combine helps you interpret aggregated results.
+
+### Cost
+
+`cost` defaults to `0.0`. Addition is straightforward numeric addition:
+
+```python
+from maseval import Usage
+
+a = Usage(cost=0.05)
+b = Usage(cost=0.03)
+a + b  # cost=0.08
+
+# Components without cost tracking default to 0.0, so they don't affect the total
+agent_usage = Usage()  # cost=0.0 (default)
+model_usage = Usage(cost=0.12)
+agent_usage + model_usage  # cost=0.12
+
+# sum() works with Usage() as the starting value
+records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)]
+sum(records, Usage())  # cost=0.35
+```
+
+### Units
+
+`units` dicts are merged by key. Matching keys are summed, new keys are added:
+
+```python
+a = Usage(units={"api_calls": 3, "data_points": 100})
+b = Usage(units={"api_calls": 2, "images": 5})
+total = a + b
+# total.units == {"api_calls": 5, "data_points": 100, "images": 5}
+```
+
+### Grouping Fields
+
+`provider`, `category`, `component_name`, and `kind` track where a record came from. When two records are added:
+
+- **Same value** → preserved
+- **Different values** → becomes `None` (meaning "aggregated across multiple")
+
+```python
+a = Usage(cost=0.05, provider="openai", kind="llm")
+b = Usage(cost=0.03, provider="openai", kind="llm")
+total = a + b
+# total.provider == "openai"  (both match)
+# total.kind == "llm"         (both match)
+
+c = Usage(cost=0.10, provider="anthropic", kind="llm")
+mixed = a + c
+# mixed.provider is None  (openai ≠ anthropic → aggregated over)
+# mixed.kind == "llm"     (both match)
+```
+
+This lets you tell at a glance whether a summed record came from one source or many.
+
 ## Tracking Non-LLM Resources
 
 Tools, environments, and other components can track arbitrary usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`. Here's an example for a paid API:
@@ -246,7 +304,7 @@ class BloombergEnvironment(Environment, UsageTrackableMixin):
         return sum(self._usage_records, Usage())
 ```
 
-Non-LLM components set cost directly — there is no calculator involvement. Each component knows its own billing model.
+Non-LLM components set cost directly. There is no calculator involvement; each component knows its own billing model.
 
 ## Evaluator Usage
 
diff --git a/maseval/core/model.py b/maseval/core/model.py
index e8bb0a3f..86399d7f 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -308,12 +308,15 @@ def chat(
 
             # Record token usage if available
             if result.usage:
-                cost = result.usage.get("cost") if isinstance(result.usage.get("cost"), (int, float)) else None
+                raw_cost = result.usage.get("cost")
+                cost = raw_cost if isinstance(raw_cost, (int, float)) else 0.0
                 token_usage = TokenUsage.from_chat_response_usage(result.usage, cost=cost, kind="llm")
 
                 # If no provider-reported cost, try the cost calculator
-                if token_usage.cost is None and self._cost_calculator is not None:
-                    token_usage.cost = self._cost_calculator.calculate_cost(token_usage, self.model_id)
+                if token_usage.cost == 0.0 and self._cost_calculator is not None:
+                    calculated = self._cost_calculator.calculate_cost(token_usage, self.model_id)
+                    if calculated is not None:
+                        token_usage.cost = calculated
 
                 self._usage_records.append(token_usage)
 
@@ -398,10 +401,10 @@ def gather_usage(self) -> Usage:
         """Gather accumulated token usage from all chat calls.
 
         Returns:
-            Summed TokenUsage across all calls, or empty Usage if no calls were made.
+            Summed TokenUsage across all calls, or empty TokenUsage if no calls were made.
         """
         if not self._usage_records:
-            return Usage()
+            return TokenUsage()
         result = self._usage_records[0]
         for record in self._usage_records[1:]:
             result = result + record
diff --git a/maseval/core/registry.py b/maseval/core/registry.py
index cf2193e8..d5e3813c 100644
--- a/maseval/core/registry.py
+++ b/maseval/core/registry.py
@@ -320,13 +320,8 @@ def collect_usage(self) -> Dict[str, Any]:
                     usage[category][comp_name] = usage_dict
 
                 # Accumulate into persistent aggregates (thread-safe).
-                # _usage_total starts as Usage(cost=None); adding to it would
-                # poison the cost (None + X = None).  Assign directly on first use.
                 with self._usage_lock:
-                    if self._usage_total.cost is None and not self._usage_total.units:
-                        self._usage_total = component_usage
-                    else:
-                        self._usage_total = self._usage_total + component_usage
+                    self._usage_total = self._usage_total + component_usage
                     if key in self._usage_by_component:
                         self._usage_by_component[key] = self._usage_by_component[key] + component_usage
                     else:
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index 2a53931c..651e4d6d 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -12,10 +12,13 @@
 inherit `UsageTrackableMixin` have their usage automatically collected by the
 registry via `gather_usage()`.
 
-Cost calculators are optional — if no calculator is provided to a
-``ModelAdapter``, cost is only set when the provider reports it directly
-(e.g., LiteLLM's ``response._hidden_params.response_cost``). For automatic
-pricing via LiteLLM's bundled model database, see ``maseval.interface.usage``.
+``Usage.cost`` defaults to ``0.0``, so ``Usage()`` works as a starting value
+for accumulation (e.g., ``sum(records, Usage())``). Cost calculators are
+optional — if no calculator is provided to a ``ModelAdapter``, cost stays
+at ``0.0`` unless the provider reports it directly (e.g., LiteLLM's
+``response._hidden_params.response_cost``).
+For automatic pricing via LiteLLM's bundled model database, see
+``maseval.interface.usage``.
 """
 
 from __future__ import annotations
@@ -29,13 +32,26 @@ class Usage:
     """Generic usage record for any billable resource.
 
     Represents accumulated cost and countable units for a component or
-    aggregated group. Grouping fields (`provider`, `category`,
-    `component_name`, `kind`) identify what scope the record covers.
-    When two records are summed, matching grouping fields are preserved;
-    mismatches become `None` (meaning "aggregated over").
+    aggregated group. All fields default to zero, so ``Usage()`` can be
+    used as a starting value for accumulation with ``+`` and ``sum()``.
+
+    Note:
+        ``cost`` defaults to ``0.0``. This means adding a ``Usage()``
+        to another record never changes the cost:
+        ``Usage() + Usage(cost=0.05)`` gives ``cost=0.05``.
+        Components that track cost start at ``0.0`` and accumulate upward.
+        Components that *do not* track cost (e.g., agent adapters that only
+        count tokens) also default to ``0.0`` — their cost simply has no
+        effect when summed with components that do report cost.
+
+    Grouping fields (``provider``, ``category``, ``component_name``, ``kind``)
+    identify what scope the record covers. When two records are summed,
+    matching grouping fields are preserved; mismatches become ``None``
+    (meaning "aggregated over").
 
     Attributes:
-        cost: Total cost in USD. `None` means unknown/not reported.
+        cost: Total cost in USD (or whatever unit your calculator uses).
+            Defaults to ``0.0``.
         units: Arbitrary countable units (e.g., ``{"api_calls": 3}``).
         provider: Provider identifier (e.g., ``"anthropic"``, ``"bloomberg"``).
         category: Registry category (e.g., ``"models"``, ``"tools"``).
@@ -52,14 +68,21 @@ class Usage:
         assert total.units == {"api_calls": 3}
         assert total.provider == "bloomberg"
 
-        # Mismatched fields become None
+        # Usage() is the zero element
+        assert (usage + Usage()).cost == 0.05
+
+        # Accumulate with sum()
+        records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)]
+        assert sum(records, Usage()).cost == 0.35
+
+        # Mismatched grouping fields become None
         mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm")
         assert mixed.provider is None  # aggregated over
         assert mixed.kind is None      # aggregated over
         ```
     """
 
-    cost: Optional[float] = None
+    cost: float = 0.0
     units: Dict[str, int | float] = field(default_factory=dict)
     provider: Optional[str] = None
     category: Optional[str] = None
@@ -70,11 +93,7 @@ def __add__(self, other: Usage) -> Usage:
         if not isinstance(other, Usage):
             return NotImplemented
 
-        # Sum costs: both known -> sum, either unknown -> None
-        if self.cost is not None and other.cost is not None:
-            cost = self.cost + other.cost
-        else:
-            cost = None
+        cost = self.cost + other.cost
 
         # Sum units
         units: Dict[str, int | float] = dict(self.units)
@@ -212,7 +231,7 @@ def from_chat_response_usage(
         cls,
         usage_dict: Dict[str, int],
         *,
-        cost: Optional[float] = None,
+        cost: float = 0.0,
         provider: Optional[str] = None,
         category: Optional[str] = None,
         component_name: Optional[str] = None,
@@ -224,7 +243,7 @@ def from_chat_response_usage(
 
         Args:
             usage_dict: The usage dict from ``ChatResponse.usage``.
-            cost: Cost in USD if known (e.g., from provider-reported cost).
+            cost: Cost in USD (e.g., from provider-reported cost). Defaults to ``0.0``.
             provider: Provider identifier.
             category: Registry category.
             component_name: Component name.
@@ -507,7 +526,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage:
         has_tokens = "input_tokens" in d
         if has_tokens:
             return TokenUsage(
-                cost=d.get("cost"),
+                cost=d.get("cost", 0.0),
                 units=d.get("units", {}),
                 provider=d.get("provider"),
                 category=d.get("category"),
@@ -522,7 +541,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage:
                 audio_tokens=d.get("audio_tokens", 0),
             )
         return Usage(
-            cost=d.get("cost"),
+            cost=d.get("cost", 0.0),
             units=d.get("units", {}),
             provider=d.get("provider"),
             category=d.get("category"),
diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py
index 5348585b..383aa0f1 100644
--- a/tests/test_core/test_usage.py
+++ b/tests/test_core/test_usage.py
@@ -39,7 +39,7 @@ def test_from_chat_response_basic(self):
         assert tu.cache_creation_input_tokens == 0
         assert tu.reasoning_tokens == 0
         assert tu.audio_tokens == 0
-        assert tu.cost is None
+        assert tu.cost == 0.0
 
     def test_from_chat_response_all_fields(self):
         """All optional fields are mapped when present."""
@@ -148,18 +148,26 @@ def test_sum_multiple(self):
         assert total.output_tokens == 30
         assert total.total_tokens == 90
 
-    def test_none_cost_propagates(self):
-        """If either cost is None, sum cost is None."""
+    def test_zero_cost_preserves_known(self):
+        """Adding a zero-cost usage preserves the known cost."""
         a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150)
-        b = TokenUsage(cost=None, input_tokens=200, output_tokens=30, total_tokens=230)
+        b = TokenUsage(input_tokens=200, output_tokens=30, total_tokens=230)
         total = a + b
 
-        assert total.cost is None
+        assert total.cost == pytest.approx(0.10)
         # Token fields still sum correctly
         assert isinstance(total, TokenUsage)
         assert total.input_tokens == 300
         assert total.output_tokens == 80
 
+    def test_both_zero_cost_stays_zero(self):
+        """Summing two zero-cost usages gives zero cost."""
+        a = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150)
+        b = TokenUsage(input_tokens=200, output_tokens=30, total_tokens=230)
+        total = a + b
+
+        assert total.cost == 0.0
+
     def test_grouping_fields_match(self):
         """Matching grouping fields are preserved."""
         a = TokenUsage(cost=0.10, provider="anthropic", kind="llm", input_tokens=100, output_tokens=50, total_tokens=150)
@@ -546,7 +554,7 @@ def test_pipeline_no_calculator_no_provider_cost(self):
 
         assert isinstance(total, TokenUsage)
         assert total.input_tokens == 100
-        assert total.cost is None
+        assert total.cost == 0.0
 
     def test_pipeline_with_cached_tokens(self):
         """Pipeline correctly handles cached tokens in cost calculation.
@@ -694,8 +702,7 @@ def test_empty_reports(self):
         reporter = UsageReporter.from_reports([])
         total = reporter.total()
 
-        # Empty reports return a plain Usage with no cost
-        assert total.cost is None
+        assert total.cost == 0.0
         assert isinstance(total, Usage)
 
     def test_skips_error_reports(self):
@@ -708,7 +715,7 @@ def test_skips_error_reports(self):
         ]
         reporter = UsageReporter.from_reports(reports)
         total = reporter.total()
-        assert total.cost is None
+        assert total.cost == 0.0
         assert isinstance(total, Usage)
 
     def test_by_task_accumulates_repeats(self):
diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py
index 86be5722..af5726f7 100644
--- a/tests/test_interface/test_agent_integration/test_camel_integration.py
+++ b/tests/test_interface/test_agent_integration/test_camel_integration.py
@@ -1127,7 +1127,7 @@ def test_camel_adapter_gather_usage_no_responses():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 def test_camel_adapter_gather_usage_responses_without_usage():
@@ -1144,7 +1144,7 @@ def test_camel_adapter_gather_usage_responses_without_usage():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 # =============================================================================
diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
index b39c4791..228d9758 100644
--- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py
+++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
@@ -361,7 +361,7 @@ def agent_node(state: State) -> State:
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 def test_langgraph_adapter_gather_usage_before_run():
@@ -375,4 +375,4 @@ def test_langgraph_adapter_gather_usage_before_run():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
index 49998292..b738f7d3 100644
--- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
+++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
@@ -451,7 +451,7 @@ def test_llamaindex_adapter_gather_usage_no_logs():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 def test_llamaindex_adapter_gather_usage_logs_without_tokens():
@@ -476,7 +476,7 @@ def test_llamaindex_adapter_gather_usage_logs_without_tokens():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 # =============================================================================
diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
index cc605fa0..9b8eaba4 100644
--- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py
+++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
@@ -525,7 +525,7 @@ def test_smolagents_adapter_gather_usage_no_steps():
     usage = adapter.gather_usage()
 
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
     assert usage.input_tokens == 0 if hasattr(usage, "input_tokens") else True
 
 
@@ -557,7 +557,7 @@ def test_smolagents_adapter_gather_usage_steps_without_token_usage():
 
     # Should return plain Usage (not TokenUsage) since no usage data
     assert isinstance(usage, Usage)
-    assert usage.cost is None
+    assert usage.cost == 0.0
 
 
 # =============================================================================

From f1565b55078767dfc97223a03d04c883dc401c43 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Sun, 15 Mar 2026 21:40:15 +0100
Subject: [PATCH 13/19] updated example

---
 .../five_a_day_benchmark.ipynb                |  2 +-
 .../five_a_day_benchmark.py                   | 39 +++++++++++++++----
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
index 2e4a6375..457c9a6b 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
@@ -660,7 +660,7 @@
   {
    "cell_type": "code",
    "id": "amrylkbxkb7",
-   "source": "from maseval import UsageReporter\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\"  Total cost:   {f'${total.cost:.6f}' if total.cost is not None else 'N/A (no cost calculator)'}\")\nprint(f\"  Total units:  {dict(total.units) if total.units else '{}'}\")\nprint()\n\n# Per-component breakdown\nprint(\"Per-Component Breakdown\")\nprint(\"-\" * 60)\nfor component_key, usage in benchmark.usage_by_component.items():\n    cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n    units_str = dict(usage.units) if usage.units else \"\"\n    print(f\"  {component_key:<35} cost={cost_str}  units={units_str}\")\nprint()\n\n# --- Post-hoc analysis with UsageReporter ---\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n    cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n    print(f\"  {task_id:<35} cost={cost_str}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))",
+   "source": "from collections import defaultdict\nfrom maseval import UsageReporter, TokenUsage\n\n\ndef _fmt_usage(usage):\n    \"\"\"Format a Usage record for display.\"\"\"\n    parts = [f\"cost=${usage.cost:.6f}\"]\n    if isinstance(usage, TokenUsage):\n        parts.append(f\"in={usage.input_tokens}  out={usage.output_tokens}\")\n    if usage.units:\n        parts.append(f\"units={dict(usage.units)}\")\n    return \"  \".join(parts)\n\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\"  Total: {_fmt_usage(total)}\")\n\n# Group components by category\nby_category = defaultdict(dict)\nfor key, usage in benchmark.usage_by_component.items():\n    category, name = key.split(\":\", 1)\n    by_category[category][name] = usage\n\nfor category in [\"agents\", \"models\", \"tools\", \"simulators\", \"callbacks\"]:\n    if category not in by_category:\n        continue\n    print(f\"\\n{category.capitalize()}:\")\n    for name, usage in by_category[category].items():\n        print(f\"  {name:<35} {_fmt_usage(usage)}\")\n\n# Print any remaining categories not in the standard list\nfor category, components in by_category.items():\n    if category in {\"agents\", \"models\", \"tools\", \"simulators\", \"callbacks\"}:\n        continue\n    print(f\"\\n{category.capitalize()}:\")\n    for name, usage in components.items():\n        print(f\"  {name:<35} {_fmt_usage(usage)}\")\n\n# --- Post-hoc analysis with UsageReporter ---\nprint()\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n    print(f\"  {task_id:<35} {_fmt_usage(usage)}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))",
    "metadata": {},
    "execution_count": null,
    "outputs": []
diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py
index a3972a9c..0b0e78de 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.py
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py
@@ -961,24 +961,49 @@ def load_benchmark_data(
     results = benchmark.run(tasks=tasks, agent_data=agent_configs)
 
     # --- Usage summary ---
+    from collections import defaultdict
+    from maseval import TokenUsage
+
+    def _fmt_usage(usage):
+        parts = [f"cost=${usage.cost:.6f}"]
+        if isinstance(usage, TokenUsage):
+            parts.append(f"in={usage.input_tokens}  out={usage.output_tokens}")
+        if usage.units:
+            parts.append(f"units={dict(usage.units)}")
+        return "  ".join(parts)
+
     print("\n--- Usage Summary ---")
     total = benchmark.usage
-    cost_str = f"${total.cost:.6f}" if total.cost is not None else "N/A (no cost calculator)"
-    print(f"Total cost: {cost_str}")
+    print(f"Total: {_fmt_usage(total)}")
 
+    # Group components by category
     if benchmark.usage_by_component:
-        print("\nPer-component:")
+        by_category: dict[str, dict[str, object]] = defaultdict(dict)
         for key, usage in benchmark.usage_by_component.items():
-            c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A"
-            print(f"  {key:<35} cost={c}  units={dict(usage.units) if usage.units else '{}'}")
+            category, name = key.split(":", 1)
+            by_category[category][name] = usage
+
+        for category in ["agents", "models", "tools", "simulators", "callbacks"]:
+            if category not in by_category:
+                continue
+            print(f"\n{category.capitalize()}:")
+            for name, usage in by_category[category].items():
+                print(f"  {name:<35} {_fmt_usage(usage)}")
+
+        # Print any remaining categories not in the standard list
+        for category, components in by_category.items():
+            if category in {"agents", "models", "tools", "simulators", "callbacks"}:
+                continue
+            print(f"\n{category.capitalize()}:")
+            for name, usage in components.items():
+                print(f"  {name:<35} {_fmt_usage(usage)}")
 
     reporter = UsageReporter.from_reports(results)
     by_task = reporter.by_task()
     if by_task:
         print("\nPer-task:")
         for task_id, usage in by_task.items():
-            c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A"
-            print(f"  {task_id:<35} cost={c}")
+            print(f"  {task_id:<35} {_fmt_usage(usage)}")
 
     print("\n--- Benchmark Complete ---")
     print(f"Total tasks: {len(tasks)}")

From 15687d9fdf3238a7c21df62272245a437769692f Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Sun, 15 Mar 2026 21:41:49 +0100
Subject: [PATCH 14/19] added dependency for restricted python

---
 pyproject.toml |  1 +
 uv.lock        | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/pyproject.toml b/pyproject.toml
index 51227d46..a352a908 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -121,6 +121,7 @@ examples = [
     "ipykernel>=6.0.0",
     "ipywidgets>=8.0.0",
     "accelerate>=1.11.0",
+    "RestrictedPython>=7.0",
 ]
 
 # Complete installation with absolutely everything (uses self-reference for DRY)
diff --git a/uv.lock b/uv.lock
index 689debe3..229b4bc8 100644
--- a/uv.lock
+++ b/uv.lock
@@ -3490,6 +3490,7 @@ all = [
     { name = "python-dotenv" },
     { name = "pyyaml" },
     { name = "requests" },
+    { name = "restrictedpython" },
     { name = "ruamel-yaml" },
     { name = "scikit-learn", version = "1.7.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
     { name = "scikit-learn", version = "1.8.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
@@ -3567,6 +3568,7 @@ examples = [
     { name = "meta-agents-research-environments" },
     { name = "openai" },
     { name = "python-dotenv" },
+    { name = "restrictedpython" },
     { name = "scikit-learn", version = "1.7.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
     { name = "scikit-learn", version = "1.8.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
     { name = "scipy", version = "1.15.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
@@ -3712,6 +3714,7 @@ requires-dist = [
     { name = "python-dotenv", marker = "extra == 'examples'", specifier = ">=1.0.0" },
     { name = "pyyaml", marker = "extra == 'multiagentbench'", specifier = ">=6.0" },
     { name = "requests", marker = "extra == 'multiagentbench'", specifier = ">=2.28.0" },
+    { name = "restrictedpython", marker = "extra == 'examples'", specifier = ">=7.0" },
     { name = "rich", specifier = ">=14.1.0" },
     { name = "ruamel-yaml", marker = "extra == 'multiagentbench'", specifier = ">=0.17.0" },
     { name = "scikit-learn", marker = "extra == 'disco'", specifier = ">=1.7.2" },
@@ -6558,6 +6561,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/8e/67/afbb0978d5399bc9ea200f1d4489a23c9a1dad4eee6376242b8182389c79/respx-0.22.0-py2.py3-none-any.whl", hash = "sha256:631128d4c9aba15e56903fb5f66fb1eff412ce28dd387ca3a81339e52dbd3ad0", size = 25127, upload-time = "2024-12-19T22:33:57.837Z" },
 ]
 
+[[package]]
+name = "restrictedpython"
+version = "8.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/5f/1c/aec08bcb4ab14a1521579fbe21ceff2a634bb1f737f11cf7f9c8bb96e680/restrictedpython-8.1.tar.gz", hash = "sha256:4a69304aceacf6bee74bdf153c728221d4e3109b39acbfe00b3494927080d898", size = 838331, upload-time = "2025-10-19T14:11:32.531Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1a/c0/3848f4006f7e164ee20833ca984067e4b3fc99fe7f1dfa88b4927e681299/restrictedpython-8.1-py3-none-any.whl", hash = "sha256:4769449c6cdb10f2071649ba386902befff0eff2a8fd6217989fa7b16aeae926", size = 27651, upload-time = "2025-10-19T14:11:30.201Z" },
+]
+
 [[package]]
 name = "rfc3339-validator"
 version = "0.1.4"

From 4b235d45ece5a1d2dc3fc01f8d5b34b711f3096d Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Mon, 16 Mar 2026 01:32:17 +0100
Subject: [PATCH 15/19] fixed cost tracking of agent frameworks

---
 CHANGELOG.md                                  |   1 +
 docs/guides/usage-tracking.md                 |  47 +-
 .../five_a_day_benchmark.py                   |   2 +-
 maseval/core/agent.py                         |  99 +++-
 maseval/interface/agents/camel.py             |  43 +-
 maseval/interface/agents/langgraph.py         |  43 +-
 maseval/interface/agents/llamaindex.py        |  45 +-
 maseval/interface/agents/smolagents.py        |  45 +-
 .../test_camel_integration.py                 |  65 +++
 .../test_langgraph_integration.py             |  74 +++
 .../test_llamaindex_integration.py            |  47 ++
 .../test_smolagents_integration.py            | 111 ++++
 usage_tracking/PLAN.md                        | 372 -------------
 usage_tracking/api_usage_results.json         | 523 ------------------
 usage_tracking/api_usage_test.py              | 154 ------
 15 files changed, 606 insertions(+), 1065 deletions(-)
 delete mode 100644 usage_tracking/PLAN.md
 delete mode 100644 usage_tracking/api_usage_results.json
 delete mode 100644 usage_tracking/api_usage_test.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 98a6f0d3..61e35d06 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45)
 - Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45)
 - `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45)
+- Cost calculation for agent adapters. `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both the model ID and cost calculator are auto-detected (model ID from the framework's agent object, calculator via `LiteLLMCostCalculator` if litellm is installed). For LangGraph, `model_id` must be passed explicitly since graphs can contain multiple models. Explicit `cost_calculator` and `model_id` always override auto-detection. (PR: #45)
 - `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45)
 - Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45)
 - `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45)
diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md
index b59005c4..b2adc5c8 100644
--- a/docs/guides/usage-tracking.md
+++ b/docs/guides/usage-tracking.md
@@ -16,7 +16,7 @@ MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars)
 
 **Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed.
 
-**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively).
+**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below).
 
 **Benchmarks** collect usage from all registered components after each task and include it in reports.
 
@@ -55,6 +55,51 @@ print(f"{usage.input_tokens} in, {usage.output_tokens} out")
 
 This works across all supported frameworks (smolagents, CAMEL, LangGraph, and LlamaIndex). The adapter handles the framework-specific extraction; you always call `gather_usage()`.
 
+### Agent Cost Tracking
+
+Agent adapters auto-detect cost when possible. For smolagents, CAMEL, and LlamaIndex, the adapter reads the model ID from the framework's agent object and uses `LiteLLMCostCalculator` if litellm is installed. No configuration needed:
+
+```python
+# Cost tracking works automatically if litellm is installed
+adapter = SmolAgentAdapter(agent, name="researcher")
+adapter.run("What's the capital of France?")
+print(f"Cost: ${adapter.gather_usage().cost:.4f}")
+```
+
+For **LangGraph**, the model ID cannot be auto-detected because a graph can contain multiple models across its nodes. Pass `model_id` explicitly:
+
+```python
+adapter = LangGraphAgentAdapter(
+    compiled_graph, "agent",
+    model_id="gpt-4o-mini",  # Required for cost tracking
+)
+```
+
+To override auto-detection or use custom pricing, pass `cost_calculator` and/or `model_id`:
+
+```python
+from maseval import StaticPricingCalculator
+
+calculator = StaticPricingCalculator({
+    "my-model": {"input": 0.001, "output": 0.002},
+})
+
+adapter = SmolAgentAdapter(
+    agent, name="researcher",
+    cost_calculator=calculator,
+    model_id="my-model",
+)
+```
+
+| Framework | Model ID | Cost Calculator |
+|-----------|----------|-----------------|
+| smolagents | Auto (`agent.model.model_id`) | Auto (`LiteLLMCostCalculator`) |
+| CAMEL | Auto (`agent.model_backend.model_type`) | Auto (`LiteLLMCostCalculator`) |
+| LlamaIndex | Auto (`agent.llm.metadata.model_name`) | Auto (`LiteLLMCostCalculator`) |
+| LangGraph | **Manual** (`model_id=...`) | Auto (`LiteLLMCostCalculator`) |
+
+If litellm is not installed, auto-creation of the calculator is skipped and cost stays at `0.0`. Tokens are always tracked regardless.
+
 ### In Benchmarks
 
 Usage is collected automatically alongside traces and configs after each task. Each report includes a `"usage"` key:
diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py
index 0b0e78de..77bc00ee 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.py
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py
@@ -967,7 +967,7 @@ def load_benchmark_data(
     def _fmt_usage(usage):
         parts = [f"cost=${usage.cost:.6f}"]
         if isinstance(usage, TokenUsage):
-            parts.append(f"in={usage.input_tokens}  out={usage.output_tokens}")
+            parts.append(f"in={usage.input_tokens:>10,}  out={usage.output_tokens:>10,}")
         if usage.units:
             parts.append(f"units={dict(usage.units)}")
         return "  ".join(parts)
diff --git a/maseval/core/agent.py b/maseval/core/agent.py
index e481843c..19b68f0e 100644
--- a/maseval/core/agent.py
+++ b/maseval/core/agent.py
@@ -5,25 +5,56 @@
 from .history import MessageHistory
 from .tracing import TraceableMixin
 from .config import ConfigurableMixin
-from .usage import UsageTrackableMixin
+from .usage import Usage, TokenUsage, UsageTrackableMixin, CostCalculator
 
 
 class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
     """Wraps an agent from any framework to provide a standard interface.
 
     This Adapter provides:
+
     - Unified execution interface via `run()`
     - Callback hooks for monitoring
     - Message history management via getter/setter
     - Framework-agnostic tracing
+    - Automatic cost calculation from token usage (when a cost calculator is available)
+
+    Cost Tracking:
+        Agent adapters track token usage from the underlying framework. To also
+        compute cost, you can pass a ``cost_calculator`` and optionally a ``model_id``.
+
+        Most framework adapters auto-detect both the model ID (from the framework's
+        agent object) and the cost calculator (using ``LiteLLMCostCalculator`` if
+        litellm is installed). This means cost tracking often works with zero
+        configuration.
+
+        To override or disable auto-detection, pass explicit values::
+
+            adapter = SmolAgentAdapter(
+                agent, name="researcher",
+                cost_calculator=StaticPricingCalculator({...}),
+                model_id="my-custom-model",
+            )
+
+        Pass ``cost_calculator=None`` explicitly to disable cost calculation
+        even when auto-detection would otherwise enable it.
     """
 
-    def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[AgentCallback]] = None):
+    def __init__(
+        self,
+        agent_instance: Any,
+        name: str,
+        callbacks: Optional[List[AgentCallback]] = None,
+        cost_calculator: Optional[CostCalculator] = None,
+        model_id: Optional[str] = None,
+    ):
         self.agent = agent_instance
         self.name = name
         self.callbacks = callbacks or []
         self.messages: Optional[MessageHistory] = None
         self.logs: List[Dict[str, Any]] = []
+        self._cost_calculator = cost_calculator
+        self._model_id = model_id
 
     def run(self, query: str) -> Any:
         """Executes the agent and returns the result."""
@@ -105,6 +136,70 @@ def get_messages(self) -> MessageHistory:
         """
         return self.messages if self.messages is not None else MessageHistory()
 
+    def gather_usage(self) -> Usage:
+        """Gather usage with automatic cost calculation.
+
+        Calls ``_gather_usage()`` for raw token counts, then applies
+        the cost calculator if one is available and cost is still ``0.0``.
+
+        The ``model_id`` used for cost calculation is resolved in order:
+
+        1. Explicit ``model_id`` passed to ``__init__``
+        2. Auto-detected from the framework agent via ``_resolve_model_id()``
+
+        Subclasses should override ``_gather_usage()`` (not this method)
+        to provide framework-specific token extraction.
+
+        Returns:
+            Usage (or TokenUsage) with cost filled in when possible.
+        """
+        usage = self._gather_usage()
+        if isinstance(usage, TokenUsage) and usage.cost == 0.0:
+            calculator = self._resolve_cost_calculator()
+            if calculator is not None:
+                mid = self._model_id or self._resolve_model_id()
+                if mid:
+                    cost = calculator.calculate_cost(usage, mid)
+                    if cost is not None:
+                        usage.cost = cost
+        return usage
+
+    def _gather_usage(self) -> Usage:
+        """Gather raw token usage from the framework.
+
+        Override this in subclasses to extract token counts from the
+        framework's native data structures.
+
+        Returns:
+            Usage or TokenUsage with token counts (cost may be 0.0).
+        """
+        return Usage()
+
+    def _resolve_model_id(self) -> Optional[str]:
+        """Auto-detect the model ID from the framework agent.
+
+        Override in subclasses to extract the model identifier from
+        the framework's agent object (e.g., ``self.agent.model.model_id``
+        for smolagents).
+
+        Returns:
+            Model ID string, or ``None`` if not detectable.
+        """
+        return None
+
+    def _resolve_cost_calculator(self) -> Optional[CostCalculator]:
+        """Resolve the cost calculator to use.
+
+        Returns the explicit calculator if one was provided, otherwise
+        returns ``None``. Framework-specific subclasses can override this
+        to auto-create a calculator (e.g., ``LiteLLMCostCalculator``)
+        when the required dependencies are available.
+
+        Returns:
+            A CostCalculator, or ``None`` if cost calculation is not available.
+        """
+        return self._cost_calculator
+
     def gather_traces(self) -> Dict[str, Any]:
         """Gather execution traces from this agent.
 
diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py
index b6ccebba..1c440687 100644
--- a/maseval/interface/agents/camel.py
+++ b/maseval/interface/agents/camel.py
@@ -175,7 +175,9 @@ class CamelAgentAdapter(AgentAdapter):
         camel-ai to be installed: `pip install maseval[camel]`
     """
 
-    def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None):
+    def __init__(
+        self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, cost_calculator: Any = None, model_id: Optional[str] = None
+    ):
         """Initialize the CAMEL adapter.
 
         Note: We don't call super().__init__() to avoid initializing self.logs as a list,
@@ -185,11 +187,19 @@ def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any]
             agent_instance: CAMEL ChatAgent instance
             name: Agent name for identification
             callbacks: Optional list of AgentCallback instances
+            cost_calculator: Optional cost calculator. If not provided, a
+                ``LiteLLMCostCalculator`` is created automatically when litellm
+                is available.
+            model_id: Optional model ID for cost calculation. If not provided,
+                auto-detected from ``agent.model_backend.model_type``.
         """
         self.agent = agent_instance
         self.name = name
         self.callbacks = callbacks or []
         self.messages = None
+        self._cost_calculator = cost_calculator
+        self._model_id = model_id
+        self._auto_calculator = None  # Lazy-initialized
         # Store responses from each step() call
         self._responses: List[Any] = []
         # Store errors that occur during execution (for comprehensive logging)
@@ -419,7 +429,36 @@ def gather_traces(self) -> Dict[str, Any]:
 
         return base_traces
 
-    def gather_usage(self) -> Usage:
+    def _resolve_model_id(self):
+        """Auto-detect model ID from CAMEL agent.
+
+        CAMEL's ChatAgent stores the model backend in ``model_backend``
+        (or ``model`` for older versions). The backend has a ``model_type``
+        enum whose ``.value`` is the model ID string.
+        """
+        try:
+            backend = getattr(self.agent, "model_backend", None) or getattr(self.agent, "model", None)
+            if backend is not None and hasattr(backend, "model_type"):
+                model_type = backend.model_type
+                return model_type.value if hasattr(model_type, "value") else str(model_type)
+        except Exception:
+            pass
+        return None
+
+    def _resolve_cost_calculator(self):
+        """Return the cost calculator, auto-creating one if litellm is available."""
+        if self._cost_calculator is not None:
+            return self._cost_calculator
+        if self._auto_calculator is None:
+            try:
+                from maseval.interface.usage import LiteLLMCostCalculator
+
+                self._auto_calculator = LiteLLMCostCalculator()
+            except (ImportError, Exception):
+                self._auto_calculator = False
+        return self._auto_calculator if self._auto_calculator is not False else None
+
+    def _gather_usage(self) -> Usage:
         """Gather aggregated token usage across all CAMEL agent responses.
 
         Walks stored ``ChatAgentResponse`` objects and sums their
diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py
index 6749e549..a2e9fd06 100644
--- a/maseval/interface/agents/langgraph.py
+++ b/maseval/interface/agents/langgraph.py
@@ -117,19 +117,39 @@ def chatbot(state: MessagesState):
         langgraph to be installed: `pip install maseval[langgraph]`
     """
 
-    def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, config: Optional[Dict[str, Any]] = None):
+    def __init__(
+        self,
+        agent_instance: Any,
+        name: str,
+        callbacks: Optional[List[Any]] = None,
+        config: Optional[Dict[str, Any]] = None,
+        cost_calculator: Any = None,
+        model_id: Optional[str] = None,
+    ):
         """Initialize the LangGraph adapter.
 
         Args:
             agent_instance: Compiled LangGraph graph
             name: Agent name
             callbacks: Optional list of callbacks
-            config: Optional LangGraph config dict (for stateful graphs with checkpointer)
-                   Should include 'configurable': {'thread_id': '...'} for persistent state
+            config: Optional LangGraph config dict (for stateful graphs with checkpointer).
+                Should include ``'configurable': {'thread_id': '...'}`` for persistent state.
+            cost_calculator: Optional cost calculator. If not provided, a
+                ``LiteLLMCostCalculator`` is created automatically when litellm
+                is available.
+            model_id: Model ID for cost calculation. LangGraph graphs can contain
+                multiple models across nodes, so the model ID cannot be auto-detected.
+                Pass the primary model's ID here to enable cost tracking::
+
+                    LangGraphAgentAdapter(
+                        graph, "agent",
+                        model_id="gpt-4o-mini",
+                    )
         """
-        super().__init__(agent_instance, name, callbacks)
+        super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id)
         self._langgraph_config = config
         self._last_result = None
+        self._auto_calculator = None  # Lazy-initialized
 
     def get_messages(self) -> MessageHistory:
         """Get message history from LangGraph.
@@ -214,7 +234,20 @@ def gather_config(self) -> dict[str, Any]:
 
         return base_config
 
-    def gather_usage(self) -> Usage:
+    def _resolve_cost_calculator(self):
+        """Return the cost calculator, auto-creating one if litellm is available."""
+        if self._cost_calculator is not None:
+            return self._cost_calculator
+        if self._auto_calculator is None:
+            try:
+                from maseval.interface.usage import LiteLLMCostCalculator
+
+                self._auto_calculator = LiteLLMCostCalculator()
+            except (ImportError, Exception):
+                self._auto_calculator = False
+        return self._auto_calculator if self._auto_calculator is not False else None
+
+    def _gather_usage(self) -> Usage:
         """Gather aggregated token usage from LangGraph message metadata.
 
         Walks messages from the last graph execution (or persistent state)
diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py
index e0e063d4..5c1de402 100644
--- a/maseval/interface/agents/llamaindex.py
+++ b/maseval/interface/agents/llamaindex.py
@@ -118,6 +118,8 @@ def __init__(
         name: str,
         callbacks: Optional[List[Any]] = None,
         max_iterations: Optional[int] = None,
+        cost_calculator: Any = None,
+        model_id: Optional[str] = None,
     ):
         """Initialize the LlamaIndex adapter.
 
@@ -131,11 +133,17 @@ def __init__(
                 passing max_steps to it is silently swallowed by **kwargs. The actual
                 iteration limit must be passed here so the adapter forwards it to
                 AgentWorkflow.run(max_iterations=...).
+            cost_calculator: Optional cost calculator. If not provided, a
+                ``LiteLLMCostCalculator`` is created automatically when litellm
+                is available.
+            model_id: Optional model ID for cost calculation. If not provided,
+                auto-detected from ``agent.llm.metadata.model_name``.
         """
-        super().__init__(agent_instance, name, callbacks)
+        super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id)
         self._last_result = None
         self._message_cache: List[Dict[str, Any]] = []
         self._max_iterations = max_iterations
+        self._auto_calculator = None  # Lazy-initialized
 
     def get_messages(self) -> MessageHistory:
         """Get message history from LlamaIndex.
@@ -216,7 +224,40 @@ def gather_config(self) -> Dict[str, Any]:
 
         return base_config
 
-    def gather_usage(self) -> Usage:
+    def _resolve_model_id(self):
+        """Auto-detect model ID from LlamaIndex agent.
+
+        LlamaIndex agents store their LLM in ``self.llm``, which has a
+        ``metadata`` property exposing ``LLMMetadata.model_name``.
+        """
+        try:
+            return self.agent.llm.metadata.model_name
+        except AttributeError:
+            pass
+        # For AgentWorkflow, try the first agent's LLM
+        try:
+            if hasattr(self.agent, "agents"):
+                for agent in self.agent.agents:
+                    if hasattr(agent, "llm"):
+                        return agent.llm.metadata.model_name
+        except (AttributeError, StopIteration):
+            pass
+        return None
+
+    def _resolve_cost_calculator(self):
+        """Return the cost calculator, auto-creating one if litellm is available."""
+        if self._cost_calculator is not None:
+            return self._cost_calculator
+        if self._auto_calculator is None:
+            try:
+                from maseval.interface.usage import LiteLLMCostCalculator
+
+                self._auto_calculator = LiteLLMCostCalculator()
+            except (ImportError, Exception):
+                self._auto_calculator = False
+        return self._auto_calculator if self._auto_calculator is not False else None
+
+    def _gather_usage(self) -> Usage:
         """Gather aggregated token usage from LlamaIndex execution logs.
 
         Sums token counts recorded in ``self.logs`` during agent execution.
diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py
index 4fcfe1db..d0ae1905 100644
--- a/maseval/interface/agents/smolagents.py
+++ b/maseval/interface/agents/smolagents.py
@@ -4,7 +4,7 @@
     pip install maseval[smolagents]
 """
 
-from typing import TYPE_CHECKING, Any, Dict, List
+from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
 from maseval.core.usage import TokenUsage, Usage
@@ -102,16 +102,29 @@ class SmolAgentAdapter(AgentAdapter):
         smolagents to be installed: `pip install maseval[smolagents]`
     """
 
-    def __init__(self, agent_instance, name: str, callbacks=None):
+    def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_calculator: Any = None, model_id: Optional[str] = None):
         """Initialize the Smolagent adapter.
 
         Note: We don't call super().__init__() to avoid initializing self.logs as a list,
         since we override it as a property that dynamically fetches from agent.memory.
+
+        Args:
+            agent_instance: smolagents MultiStepAgent or similar
+            name: Agent name for identification
+            callbacks: Optional list of AgentCallback instances
+            cost_calculator: Optional cost calculator. If not provided, a
+                ``LiteLLMCostCalculator`` is created automatically when litellm
+                is available.
+            model_id: Optional model ID for cost calculation. If not provided,
+                auto-detected from ``agent.model.model_id``.
         """
         self.agent = agent_instance
         self.name = name
         self.callbacks = callbacks or []
         self.messages = None
+        self._cost_calculator = cost_calculator
+        self._model_id = model_id
+        self._auto_calculator = None  # Lazy-initialized
 
     @property
     def logs(self) -> List[Dict[str, Any]]:  # type: ignore[override]
@@ -320,7 +333,33 @@ def gather_traces(self) -> dict:
 
         return base_logs
 
-    def gather_usage(self) -> Usage:
+    def _resolve_model_id(self):
+        """Auto-detect model ID from smolagents agent.
+
+        All smolagents model classes (LiteLLMModel, OpenAIServerModel,
+        TransformersModel, etc.) inherit from ``Model`` which stores
+        ``model_id`` on the instance.
+        """
+        try:
+            return self.agent.model.model_id
+        except AttributeError:
+            return None
+
+    def _resolve_cost_calculator(self):
+        """Return the cost calculator, auto-creating one if litellm is available."""
+        if self._cost_calculator is not None:
+            return self._cost_calculator
+        # Lazy auto-create: try LiteLLMCostCalculator once
+        if self._auto_calculator is None:
+            try:
+                from maseval.interface.usage import LiteLLMCostCalculator
+
+                self._auto_calculator = LiteLLMCostCalculator()
+            except (ImportError, Exception):
+                self._auto_calculator = False  # Sentinel: don't retry
+        return self._auto_calculator if self._auto_calculator is not False else None
+
+    def _gather_usage(self) -> Usage:
         """Gather aggregated token usage across all agent steps.
 
         Walks smolagents' memory steps (ActionStep and PlanningStep) and sums
diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py
index af5726f7..263c6b66 100644
--- a/tests/test_interface/test_agent_integration/test_camel_integration.py
+++ b/tests/test_interface/test_agent_integration/test_camel_integration.py
@@ -1228,6 +1228,71 @@ def test_e2e_camel_gather_usage_empty_before_run():
     assert usage_after.output_tokens > 0
 
 
+# =============================================================================
+# Cost Calculation Tests
+# =============================================================================
+
+
+def test_camel_adapter_cost_with_explicit_calculator():
+    """Test that passing a cost_calculator computes cost from token usage."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_response = Mock()
+    mock_response.info = {"usage": {"prompt_tokens": 1000, "completion_tokens": 500}}
+    mock_response.terminated = False
+    mock_response.msgs = [Mock(content="response")]
+
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test")
+    adapter._responses = [mock_response]
+
+    calculator = StaticPricingCalculator({"gpt-4o-mini": {"input": 0.00001, "output": 0.00002}})
+    adapter._cost_calculator = calculator
+    adapter._model_id = "gpt-4o-mini"
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.cost == pytest.approx(1000 * 0.00001 + 500 * 0.00002)
+
+
+def test_camel_adapter_resolve_model_id():
+    """Test that _resolve_model_id() reads from agent.model_backend.model_type."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_agent.model_backend.model_type.value = "gpt-4o-mini"
+
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test")
+    assert adapter._resolve_model_id() == "gpt-4o-mini"
+
+
+def test_camel_adapter_cost_auto_detect_model_id():
+    """Test that cost calculation works with auto-detected model_id."""
+    from maseval.interface.agents.camel import CamelAgentAdapter
+    from maseval.core.usage import StaticPricingCalculator
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_agent.model_backend.model_type.value = "gpt-4o"
+
+    mock_response = Mock()
+    mock_response.info = {"usage": {"prompt_tokens": 100, "completion_tokens": 50}}
+    mock_response.terminated = False
+
+    adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test")
+    adapter._responses = [mock_response]
+
+    calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.001, "output": 0.002}})
+    adapter._cost_calculator = calculator
+    # model_id not set — should be auto-detected
+
+    usage = adapter.gather_usage()
+    assert usage.cost == pytest.approx(100 * 0.001 + 50 * 0.002)
+
+
 def test_e2e_camel_logs_contain_usage():
     """Verify adapter.logs also contain usage data from real execution."""
     from camel.agents import ChatAgent
diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
index 228d9758..4a4d5e1b 100644
--- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py
+++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
@@ -376,3 +376,77 @@ def test_langgraph_adapter_gather_usage_before_run():
 
     assert isinstance(usage, Usage)
     assert usage.cost == 0.0
+
+
+# =============================================================================
+# Cost Calculation Tests
+# =============================================================================
+
+
+def test_langgraph_adapter_cost_with_explicit_model_id():
+    """Test that passing model_id + calculator computes cost for LangGraph."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator
+    from langgraph.graph import StateGraph, END
+    from typing_extensions import TypedDict
+    from langchain_core.messages import AIMessage
+    from langchain_core.messages.ai import UsageMetadata
+
+    class State(TypedDict):
+        messages: list
+
+    def agent_node(state: State) -> State:
+        response = AIMessage(
+            content="Response",
+            usage_metadata=UsageMetadata(input_tokens=1000, output_tokens=500, total_tokens=1500),
+        )
+        return {"messages": state["messages"] + [response]}
+
+    graph = StateGraph(State)  # type: ignore[arg-type]
+    graph.add_node("agent", agent_node)
+    graph.set_entry_point("agent")
+    graph.add_edge("agent", END)
+    compiled = graph.compile()
+
+    calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.00001, "output": 0.00002}})
+    adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test", model_id="gpt-4o", cost_calculator=calculator)
+    adapter.run("Test")
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.cost == pytest.approx(1000 * 0.00001 + 500 * 0.00002)
+
+
+def test_langgraph_adapter_no_cost_without_model_id():
+    """Test that LangGraph adapter cannot auto-detect model_id (by design)."""
+    from maseval.interface.agents.langgraph import LangGraphAgentAdapter
+    from maseval.core.usage import StaticPricingCalculator
+    from langgraph.graph import StateGraph, END
+    from typing_extensions import TypedDict
+    from langchain_core.messages import AIMessage
+    from langchain_core.messages.ai import UsageMetadata
+
+    class State(TypedDict):
+        messages: list
+
+    def agent_node(state: State) -> State:
+        response = AIMessage(
+            content="Response",
+            usage_metadata=UsageMetadata(input_tokens=100, output_tokens=50, total_tokens=150),
+        )
+        return {"messages": state["messages"] + [response]}
+
+    graph = StateGraph(State)  # type: ignore[arg-type]
+    graph.add_node("agent", agent_node)
+    graph.set_entry_point("agent")
+    graph.add_edge("agent", END)
+    compiled = graph.compile()
+
+    calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.001, "output": 0.002}})
+    # No model_id passed — cost should stay 0.0 despite calculator being available
+    adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test", cost_calculator=calculator)
+    adapter.run("Test")
+
+    usage = adapter.gather_usage()
+    assert usage.cost == 0.0
+    assert usage.input_tokens == 100
diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
index b738f7d3..cc9c701b 100644
--- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
+++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py
@@ -600,3 +600,50 @@ def test_e2e_llamaindex_logs_populated_by_real_execution():
     assert log["total_tokens"] == 150
     assert "timestamp" in log
     assert "duration_seconds" in log
+
+
+# =============================================================================
+# Cost Calculation Tests
+# =============================================================================
+
+
+def test_llamaindex_adapter_cost_with_explicit_calculator():
+    """Test that passing a cost_calculator computes cost from token usage."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator
+    from unittest.mock import Mock
+
+    mock_agent = Mock(spec=[])
+    adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test")
+    # Simulate logs from a run
+    adapter.logs = [{"input_tokens": 1000, "output_tokens": 500, "status": "success"}]
+
+    calculator = StaticPricingCalculator({"gpt-4": {"input": 0.00003, "output": 0.00006}})
+    adapter._cost_calculator = calculator
+    adapter._model_id = "gpt-4"
+
+    usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.cost == pytest.approx(1000 * 0.00003 + 500 * 0.00006)
+
+
+def test_llamaindex_adapter_resolve_model_id():
+    """Test that _resolve_model_id() reads from agent.llm.metadata.model_name."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_agent.llm.metadata.model_name = "gpt-4o-mini"
+
+    adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test")
+    assert adapter._resolve_model_id() == "gpt-4o-mini"
+
+
+def test_llamaindex_adapter_resolve_model_id_missing():
+    """Test that _resolve_model_id() returns None when agent has no llm."""
+    from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
+    from unittest.mock import Mock
+
+    mock_agent = Mock(spec=[])
+    adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test")
+    assert adapter._resolve_model_id() is None
diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
index 9b8eaba4..562107ac 100644
--- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py
+++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py
@@ -710,3 +710,114 @@ def test_e2e_smolagents_gather_usage_empty_before_run():
     assert isinstance(usage_after, MasevalTokenUsage)
     assert usage_after.input_tokens > 0
     assert usage_after.output_tokens > 0
+
+
+# --- Cost calculation tests ---
+
+
+def test_smolagents_adapter_cost_with_explicit_calculator():
+    """Test that passing a cost_calculator computes cost from token usage."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator
+    from smolagents.memory import ActionStep, AgentMemory
+    from smolagents.monitoring import TokenUsage, Timing
+    from unittest.mock import Mock
+    import time
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+    mock_agent.model.model_id = "gpt-4o-mini"
+
+    start = time.time()
+    step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[])
+    step.token_usage = TokenUsage(input_tokens=1000, output_tokens=500)
+    mock_agent.memory.steps.append(step)
+
+    calculator = StaticPricingCalculator({"gpt-4o-mini": {"input": 0.00001, "output": 0.00002}})
+
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent", cost_calculator=calculator)
+    usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.input_tokens == 1000
+    assert usage.output_tokens == 500
+    # Cost = 1000 * 0.00001 + 500 * 0.00002 = 0.01 + 0.01 = 0.02
+    assert usage.cost == pytest.approx(0.02)
+
+
+def test_smolagents_adapter_cost_with_explicit_model_id():
+    """Test that explicit model_id overrides auto-detected one."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import StaticPricingCalculator
+    from smolagents.memory import ActionStep, AgentMemory
+    from smolagents.monitoring import TokenUsage, Timing
+    from unittest.mock import Mock
+    import time
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+    mock_agent.model.model_id = "wrong-model"  # Auto-detected, but overridden
+
+    start = time.time()
+    step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[])
+    step.token_usage = TokenUsage(input_tokens=100, output_tokens=50)
+    mock_agent.memory.steps.append(step)
+
+    calculator = StaticPricingCalculator({"my-model": {"input": 0.001, "output": 0.002}})
+
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test", cost_calculator=calculator, model_id="my-model")
+    usage = adapter.gather_usage()
+
+    # Should use "my-model" pricing, not "wrong-model"
+    assert usage.cost == pytest.approx(0.001 * 100 + 0.002 * 50)
+
+
+def test_smolagents_adapter_resolve_model_id():
+    """Test that _resolve_model_id() reads from agent.model.model_id."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from unittest.mock import Mock
+
+    mock_agent = Mock()
+    mock_agent.model.model_id = "gpt-4o"
+    mock_agent.write_memory_to_messages = Mock(return_value=[])
+
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test")
+    assert adapter._resolve_model_id() == "gpt-4o"
+
+
+def test_smolagents_adapter_resolve_model_id_missing():
+    """Test that _resolve_model_id() returns None when model has no model_id."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from unittest.mock import Mock
+
+    mock_agent = Mock(spec=[])  # No attributes at all
+    adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test")
+    assert adapter._resolve_model_id() is None
+
+
+def test_smolagents_adapter_no_cost_without_calculator():
+    """Test that cost stays 0.0 when no calculator is available and auto-create fails."""
+    from maseval.interface.agents.smolagents import SmolAgentAdapter
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage
+    from smolagents.memory import ActionStep, AgentMemory
+    from smolagents.monitoring import TokenUsage, Timing
+    from unittest.mock import Mock, patch
+    import time
+
+    mock_agent = Mock()
+    mock_agent.memory = AgentMemory(system_prompt="Test")
+    mock_agent.model.model_id = "some-model"
+
+    start = time.time()
+    step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[])
+    step.token_usage = TokenUsage(input_tokens=100, output_tokens=50)
+    mock_agent.memory.steps.append(step)
+
+    # Patch LiteLLMCostCalculator to simulate litellm not installed
+    with patch("maseval.interface.agents.smolagents.SmolAgentAdapter._resolve_cost_calculator", return_value=None):
+        adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test")
+        usage = adapter.gather_usage()
+
+    assert isinstance(usage, MasevalTokenUsage)
+    assert usage.cost == 0.0
+    assert usage.input_tokens == 100
diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md
deleted file mode 100644
index 91683ee0..00000000
--- a/usage_tracking/PLAN.md
+++ /dev/null
@@ -1,372 +0,0 @@
-# Usage & Cost Tracking — Implementation Plan
-
-## Motivation
-
-Benchmarking multi-agent systems incurs real costs: LLM API calls (the primary driver), but also external service calls (e.g., Bloomberg data API, geocoding services, paid search APIs). MASEval currently extracts basic token counts into `ChatResponse.usage` but does not persist, enrich, or aggregate this data. We want first-class usage tracking that:
-
-- Captures token usage and cost per LLM call with provider-specific detail
-- Supports non-token costs (external service calls billed per-request or per-unit)
-- Aggregates across provider, task, component role, and total
-- Is queryable live during benchmark execution (not just post-hoc)
-- Captures usage even for failed tasks
-- Requires zero changes from benchmark implementers for the common LLM case
-
-## Design Principles
-
-1. **LLM-first, not LLM-only.** The base abstraction is generic (cost + arbitrary units), with an LLM-specific subclass that adds token semantics.
-2. **No hardcoded prices.** Pricing changes constantly. Users supply pricing or rely on provider-reported cost (e.g., OpenRouter). If neither is available, cost is `None`.
-3. **Automatic for models, opt-in for tools.** ModelAdapter tracks usage automatically via the base `chat()` method. Tool/environment authors opt in via `UsageTrackableMixin`.
-4. **Non-breaking.** `ChatResponse.usage` stays a `Dict[str, int]` with additional optional keys. Existing code that reads `usage["input_tokens"]` continues to work.
-5. **First-class collection axis.** Usage is collected via `gather_usage()` / `collect_usage()`, parallel to `gather_traces()` / `collect_traces()` and `gather_config()` / `collect_configs()`. It is not embedded inside traces.
-6. **Live queryable.** The registry maintains a running usage total across repetitions, queryable at any time via `benchmark.usage`.
-
----
-
-## Data Model
-
-### `Usage` (base)
-
-Generic usage record for any billable resource. Stored as a simple dataclass.
-
-```
-Usage
-  cost: Optional[float]            # Total cost in USD (None = unknown)
-  units: Dict[str, int | float]    # Countable units (e.g., {"api_calls": 3, "bytes": 1024})
-  provider: Optional[str]          # e.g., "anthropic", "openai", "bloomberg"
-  category: Optional[str]          # e.g., "models", "evaluator_models", "tools"
-  component_name: Optional[str]    # e.g., "main_model", "judge", "bloomberg_api"
-  kind: Optional[str]              # e.g., "llm", "service", "local"
-```
-
-Supports `__add__`: costs sum (if both known, else None), units sum. Grouping fields (`provider`, `category`, `component_name`, `kind`) are preserved when they match, set to `None` on mismatch. `None` means "aggregated over" — e.g., `provider=None, category="models"` represents all models summed across providers. A fully `None` grouping is a grand total.
-
-### `TokenUsage(Usage)` (LLM-specific)
-
-Extends `Usage` with token fields that every LLM provider reports.
-
-```
-TokenUsage(Usage)
-  input_tokens: int
-  output_tokens: int
-  total_tokens: int
-  # Optional provider-specific detail
-  cached_input_tokens: int        # Anthropic cache_read, OpenAI cached_tokens
-  reasoning_tokens: int           # OpenAI reasoning, Google thoughts
-  audio_tokens: int               # OpenAI audio
-```
-
-`TokenUsage.__add__` sums all token fields plus delegates to `Usage.__add__` for cost/units.
-
-Class method `TokenUsage.from_chat_response_usage(usage_dict) -> TokenUsage` maps the dict returned by adapters today into a `TokenUsage` instance, handling provider-specific key names.
-
----
-
-## UsageTrackableMixin
-
-Follows the established mixin pattern (`TraceableMixin`, `ConfigurableMixin`). Any component that inherits `UsageTrackableMixin` will have its usage automatically collected by the registry when registered.
-
-```python
-class UsageTrackableMixin:
-    """Mixin that provides usage tracking capability to any component."""
-
-    def gather_usage(self) -> Usage:
-        """Return accumulated usage for this component.
-
-        Subclasses must override this to return their accumulated Usage.
-        Base implementation returns an empty Usage.
-        """
-        return Usage()
-```
-
-Components internally accumulate `Usage` records however they see fit (typically a list + sum). The mixin only defines the collection protocol — `gather_usage() -> Usage`.
-
-### Usage in components
-
-**ModelAdapter** (automatic):
-
-```python
-class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
-    def __init__(self, seed=None):
-        super().__init__()
-        self._usage_records: List[Usage] = []
-
-    def chat(self, messages, ...):
-        response = self._chat_impl(messages, ...)
-        if response.usage:
-            self._usage_records.append(
-                TokenUsage.from_chat_response_usage(response.usage)
-            )
-        return response
-
-    def gather_usage(self) -> Usage:
-        if not self._usage_records:
-            return Usage()
-        return sum(self._usage_records[1:], self._usage_records[0])
-```
-
-**Non-model components** (opt-in):
-
-```python
-class BloombergEnvironment(Environment, UsageTrackableMixin):
-    def __init__(self, task_data):
-        super().__init__(task_data)
-        self._usage_records: List[Usage] = []
-
-    def _call_bloomberg(self, query):
-        result = bloomberg_client.query(query)
-        self._usage_records.append(Usage(
-            cost=result.billed_amount,
-            units={"api_calls": 1, "data_points": result.count},
-        ))
-        return result
-
-    def gather_usage(self) -> Usage:
-        if not self._usage_records:
-            return Usage()
-        return sum(self._usage_records[1:], self._usage_records[0])
-```
-
----
-
-## Registry Integration
-
-The `ComponentRegistry` gains a third collection axis for usage, parallel to traces and configs.
-
-### Per-repetition collection
-
-`collect_usage()` walks all registered `UsageTrackableMixin` components and calls `gather_usage()` on each. Returns a structured dict (same shape as `collect_traces()`/`collect_configs()`). This goes into `report["usage"]`.
-
-```python
-def collect_usage(self) -> Dict[str, Any]:
-    """Collect usage from all registered UsageTrackableMixin components."""
-    usage = {
-        "metadata": {...},
-        "agents": {},
-        "models": {},
-        "tools": {},
-        ...
-        "environment": None,
-        "user": None,
-    }
-
-    for key, component in self._usage_registry.items():
-        category, comp_name = key.split(":", 1)
-        component_usage = component.gather_usage()
-
-        # Store in structured dict (same pattern as traces/configs)
-        ...
-
-        # Accumulate into persistent aggregates
-        self._usage_total += component_usage
-        self._usage_by_component[key] += component_usage
-
-    return usage
-```
-
-### Persistent aggregates (survive `clear()`)
-
-The registry maintains running totals that persist across task repetitions:
-
-```python
-class ComponentRegistry:
-    def __init__(self):
-        # ... existing per-repetition state ...
-
-        # Persistent usage aggregates (NOT cleared between repetitions)
-        self._usage_total: Usage = Usage()
-        self._usage_by_component: Dict[str, Usage] = {}
-
-    def clear(self):
-        # Clears per-repetition registrations
-        # Does NOT clear _usage_total or _usage_by_component
-
-    @property
-    def total_usage(self) -> Usage:
-        """Running total across all repetitions. Queryable at any time."""
-        return self._usage_total
-
-    @property
-    def usage_by_component(self) -> Dict[str, Usage]:
-        """Per-component running totals across all repetitions."""
-        return dict(self._usage_by_component)
-```
-
-### Registration
-
-The `register()` method gains an `isinstance(component, UsageTrackableMixin)` check, parallel to the existing `TraceableMixin` and `ConfigurableMixin` checks:
-
-```python
-def register(self, category, name, component):
-    # ... existing trace/config registration ...
-
-    if isinstance(component, UsageTrackableMixin):
-        self._usage_registry[key] = component
-        self._usage_component_id_map[component_id] = key
-```
-
-`RegisterableComponent` type alias is updated to include `UsageTrackableMixin`.
-
----
-
-## Benchmark Integration
-
-### Report structure
-
-Each report gains a top-level `"usage"` key alongside `"traces"` and `"config"`:
-
-```python
-report = {
-    "task_id": str(task.id),
-    "repeat_idx": repeat_idx,
-    "status": execution_status.value,
-    "traces": execution_traces,
-    "config": execution_configs,
-    "usage": execution_usage,      # <-- new
-    "eval": eval_results,
-    "task": {...},
-}
-```
-
-### Live usage access
-
-```python
-benchmark.usage        # -> Usage (running grand total, delegates to registry)
-benchmark.usage_by_component  # -> Dict[str, Usage] (per-component totals)
-```
-
-### Failed task usage
-
-`collect_usage()` is called alongside `collect_all_traces()` and `collect_all_configs()` — before error status is determined. If a task fails mid-execution, whatever usage was accumulated up to the failure point is still collected and aggregated.
-
----
-
-## Adapter `_chat_impl` Enrichment (per-provider)
-
-Each adapter enriches the `ChatResponse.usage` dict with provider-specific fields beyond the basic three. The base class `TokenUsage.from_chat_response_usage()` handles mapping.
-
-| Adapter | Extra fields to extract |
-|---------|------------------------|
-| OpenAI | `reasoning_tokens` from `completion_tokens_details`, `cached_input_tokens` from `prompt_tokens_details.cached_tokens` |
-| Anthropic | `cached_input_tokens` from `cache_read_input_tokens` |
-| Google | `reasoning_tokens` from `thoughts_token_count` |
-| LiteLLM | `reasoning_tokens` + `cached_input_tokens` from details; `cost` from `response._hidden_params` if available |
-| HuggingFace | No change (local inference, no API cost) |
-
----
-
-## UsageReporter (post-hoc)
-
-Post-run utility that walks `report["usage"]` across all reports for sliced analysis.
-
-```
-UsageReporter
-  @staticmethod from_reports(reports: List[Dict]) -> UsageReporter
-
-  by_task() -> Dict[str, Usage]           # keyed by task_id
-  by_component() -> Dict[str, Usage]      # keyed by registry key (e.g., "models:main_model")
-  by_model() -> Dict[str, TokenUsage]     # keyed by model_id (LLM-only)
-  total() -> Usage                        # grand total
-
-  summary() -> Dict[str, Any]             # nested dict with all breakdowns
-```
-
-Unlike the registry's live aggregates, `UsageReporter` can slice by task (since it sees the full report list with task IDs).
-
----
-
-## Evaluators
-
-Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. That model should be registered in the benchmark via `self.register("evaluator_models", "judge", model)` inside `setup_evaluators()`. Since `ModelAdapter` now inherits `UsageTrackableMixin`, its usage is automatically collected under `usage.evaluator_models.judge`.
-
-No changes to the `Evaluator` base class. This is a registration convention.
-
-## LLMUser / AgenticLLMUser
-
-These already hold a `ModelAdapter`. Their model's usage is collected automatically (since `ModelAdapter` inherits `UsageTrackableMixin` and `chat()` accumulates records). The model is already registered by the benchmark. No changes needed.
-
----
-
-## File Plan
-
-| File | Action | Content |
-|------|--------|---------|
-| `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` |
-| `maseval/core/cost.py` | **Create** | `CostCalculator` protocol, `StaticPricingCalculator` |
-| `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property |
-| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()`, accept `cost_calculator` param |
-| `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict |
-| `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility |
-| `maseval/interface/cost.py` | **Create** | `LiteLLMCostCalculator` (optional `litellm` dependency) |
-| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens`; accept `cost_calculator` |
-| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens`; accept `cost_calculator` |
-| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens`; accept `cost_calculator` |
-| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost`; accept `cost_calculator` |
-| `maseval/interface/inference/huggingface.py` | **Edit** | Accept `cost_calculator` |
-| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `CostCalculator`, `StaticPricingCalculator`, `UsageReporter` |
-| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation, cost calculators |
-
-No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`.
-
----
-
-## Cost Calculation
-
-Most LLM APIs return token counts but **not** cost. Cost calculation is a client-side concern.
-
-### CostCalculator protocol
-
-A `CostCalculator` is a simple protocol with one method:
-
-```python
-class CostCalculator(Protocol):
-    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: ...
-```
-
-`ModelAdapter` accepts an optional `cost_calculator` parameter. After each `chat()` call, if the provider didn't report cost and a calculator is present, the calculator fills in `TokenUsage.cost`. Provider-reported cost always takes precedence.
-
-### Built-in implementations
-
-| Calculator | Location | Dependencies | Use case |
-|-----------|----------|-------------|----------|
-| `StaticPricingCalculator` | `maseval.core.cost` | None | User-supplied per-model rates. Supports custom units (USD, EUR, credits). |
-| `LiteLLMCostCalculator` | `maseval.interface.cost` | `litellm` | Automatic pricing via LiteLLM's bundled model database. Covers OpenAI, Anthropic, Google, Mistral, etc. |
-
-### Cost flow (priority order)
-
-1. **Provider-reported cost** — e.g., LiteLLM's `response._hidden_params.response_cost`. Set directly in `ChatResponse.usage["cost"]`.
-2. **CostCalculator** — if no provider cost, `ModelAdapter.chat()` calls `calculator.calculate_cost(token_usage, model_id)`.
-3. **None** — if neither source provides cost, `Usage.cost` stays `None`.
-
-### Examples
-
-```python
-# Static pricing for a university cluster (credits per token)
-calculator = StaticPricingCalculator({
-    "llama-3-70b": {"input": 0.5, "output": 1.0},
-})
-
-# Automatic pricing via LiteLLM's database
-from maseval.interface.cost import LiteLLMCostCalculator
-calculator = LiteLLMCostCalculator()
-
-# Pass to any model adapter
-model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
-```
-
-### Non-LLM components
-
-Non-LLM components (tools, environments) set cost directly in their `gather_usage()` implementation — there is no calculator involvement. Each component knows its own billing model.
-
----
-
-## Non-goals
-
-- **Hardcoded pricing tables** — prices change too often; delegated to LiteLLM or user-supplied.
-- **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now.
-- **Billing integration** — no webhook/billing system integration.
-- **Streaming usage** — not supported yet (usage is captured after completion).
-- **Currency conversion** — `Usage.cost` is a bare float in whatever unit the calculator uses. Mixing units in one benchmark is a user error.
-
-## Open Questions
-
-1. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1.
diff --git a/usage_tracking/api_usage_results.json b/usage_tracking/api_usage_results.json
deleted file mode 100644
index 4dcd9b8e..00000000
--- a/usage_tracking/api_usage_results.json
+++ /dev/null
@@ -1,523 +0,0 @@
-{
-  "direct__openai__gpt5_mini": {
-    "id": "chatcmpl-DFJoysUJJeWtuOVc5UIq3EnA580Ok",
-    "choices": [
-      {
-        "finish_reason": "length",
-        "index": 0,
-        "logprobs": null,
-        "message": {
-          "content": "",
-          "refusal": null,
-          "role": "assistant",
-          "annotations": [],
-          "audio": null,
-          "function_call": null,
-          "tool_calls": null
-        }
-      }
-    ],
-    "created": 1772543484,
-    "model": "gpt-5-mini-2025-08-07",
-    "object": "chat.completion",
-    "service_tier": "default",
-    "system_fingerprint": null,
-    "usage": {
-      "completion_tokens": 64,
-      "prompt_tokens": 10,
-      "total_tokens": 74,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": 0,
-        "audio_tokens": 0,
-        "reasoning_tokens": 64,
-        "rejected_prediction_tokens": 0
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0
-      }
-    }
-  },
-  "direct__anthropic__claude_haiku": {
-    "id": "msg_01UDvWsS78tyf4xQ1wwDsNop",
-    "content": [
-      {
-        "citations": null,
-        "text": "# Hello! \ud83d\udc4b\n\nWelcome! I'm Claude, an AI assistant made by Anthropic. How can I help you today?",
-        "type": "text"
-      }
-    ],
-    "model": "claude-haiku-4-5-20251001",
-    "role": "assistant",
-    "stop_reason": "end_turn",
-    "stop_sequence": null,
-    "type": "message",
-    "usage": {
-      "cache_creation": {
-        "ephemeral_1h_input_tokens": 0,
-        "ephemeral_5m_input_tokens": 0
-      },
-      "cache_creation_input_tokens": 0,
-      "cache_read_input_tokens": 0,
-      "input_tokens": 11,
-      "output_tokens": 32,
-      "server_tool_use": null,
-      "service_tier": "standard",
-      "inference_geo": "not_available"
-    }
-  },
-  "direct__google__gemini3_flash": {
-    "sdk_http_response": {
-      "headers": {
-        "content-type": "application/json; charset=UTF-8",
-        "vary": "Origin, X-Origin, Referer",
-        "content-encoding": "gzip",
-        "date": "Tue, 03 Mar 2026 13:11:32 GMT",
-        "server": "scaffolding on HTTPServer2",
-        "x-xss-protection": "0",
-        "x-frame-options": "SAMEORIGIN",
-        "x-content-type-options": "nosniff",
-        "server-timing": "gfet4t7; dur=4579",
-        "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
-        "transfer-encoding": "chunked"
-      },
-      "body": null
-    },
-    "candidates": [
-      {
-        "content": {
-          "parts": [
-            {
-              "media_resolution": null,
-              "code_execution_result": null,
-              "executable_code": null,
-              "file_data": null,
-              "function_call": null,
-              "function_response": null,
-              "inline_data": null,
-              "text": "Hello",
-              "thought": null,
-              "thought_signature": "EqwCCqkCAb4-9vtlfRlETMXR13Bw3xpBm-D3EzoUlVhmePvHy720UANX0hdyBGaq1d8FfiHVTMccuBl5r7sg3fy_2GoTexpytWLm15I7GfRloHt278ioOMDH3Ua8SVuGCIiRyIVSye1vkQw7p0KwZMzJ51fjhuBH-4_weZe24FglHg0p3eo79cKZIMz8eiWpcGtK6Xb25Gk1mXuKCi7GaifkKaOmhXTSjVZ-P-w5qERlTscMv-2YMD26Th8MUEg13PwFlz385A9RnHLH_oXkdr0lXAHemNj7dHdJEfNzjSgqCJdeVT3PCwH0v6-AIqdQuqD6jnvODLDPms5liN7VAVAAOZiq8tLDE771c3Xc-7eIPFdD3h9_cdvb82hjefYjEwC-aWNXQrl1SlVw0Un0",
-              "video_metadata": null
-            }
-          ],
-          "role": "model"
-        },
-        "citation_metadata": null,
-        "finish_message": null,
-        "token_count": null,
-        "finish_reason": "MAX_TOKENS",
-        "avg_logprobs": null,
-        "grounding_metadata": null,
-        "index": 0,
-        "logprobs_result": null,
-        "safety_ratings": null,
-        "url_context_metadata": null
-      }
-    ],
-    "create_time": null,
-    "model_version": "gemini-3-flash-preview",
-    "prompt_feedback": null,
-    "response_id": "BN6maev1HoqB7M8P3fvBkAo",
-    "usage_metadata": {
-      "cache_tokens_details": null,
-      "cached_content_token_count": null,
-      "candidates_token_count": 1,
-      "candidates_tokens_details": null,
-      "prompt_token_count": 5,
-      "prompt_tokens_details": [
-        {
-          "modality": "TEXT",
-          "token_count": 5
-        }
-      ],
-      "thoughts_token_count": 59,
-      "tool_use_prompt_token_count": null,
-      "tool_use_prompt_tokens_details": null,
-      "total_token_count": 65,
-      "traffic_type": null
-    },
-    "automatic_function_calling_history": [],
-    "parsed": null
-  },
-  "litellm__openai__gpt5_mini": {
-    "id": "chatcmpl-DFJp7YfyGQH1HtypDlKDucycpUAdk",
-    "created": 1772543493,
-    "model": "gpt-5-mini-2025-08-07",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "length",
-        "index": 0,
-        "message": {
-          "content": "",
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "provider_specific_fields": {
-            "refusal": null
-          },
-          "annotations": []
-        },
-        "provider_specific_fields": {}
-      }
-    ],
-    "usage": {
-      "completion_tokens": 64,
-      "prompt_tokens": 10,
-      "total_tokens": 74,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": 0,
-        "audio_tokens": 0,
-        "reasoning_tokens": 64,
-        "rejected_prediction_tokens": 0,
-        "text_tokens": null
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null
-      }
-    },
-    "service_tier": "default"
-  },
-  "litellm__anthropic__claude_haiku": {
-    "id": "chatcmpl-d251dec3-5b1a-424c-a432-cb71ea3d600f",
-    "created": 1772543495,
-    "model": "claude-haiku-4-5-20251001",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "stop",
-        "index": 0,
-        "message": {
-          "content": "Hello! \ud83d\udc4b How can I help you today?",
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "provider_specific_fields": {
-            "citations": null,
-            "thinking_blocks": null
-          }
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 16,
-      "prompt_tokens": 11,
-      "total_tokens": 27,
-      "completion_tokens_details": null,
-      "prompt_tokens_details": {
-        "audio_tokens": null,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null,
-        "cache_creation_tokens": 0,
-        "cache_creation_token_details": {
-          "ephemeral_5m_input_tokens": 0,
-          "ephemeral_1h_input_tokens": 0
-        }
-      },
-      "cache_creation_input_tokens": 0,
-      "cache_read_input_tokens": 0
-    }
-  },
-  "litellm__google__gemini3_flash": {
-    "id": "DN6macurC97hnsEPvs-FmA0",
-    "created": 1772543495,
-    "model": "gemini-3-flash-preview",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "length",
-        "index": 0,
-        "message": {
-          "content": "Hello",
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "images": [],
-          "thinking_blocks": [
-            {
-              "type": "thinking",
-              "thinking": "{\"text\": \"Hello\"}",
-              "signature": "EpoCCpcCAb4+9vtdR++YPo/XeAmaLKPKkk7+YyeGjuHP9w646HEu9lG0xhb6qHOfkUTcH7xh08RlU6QXrTKAkXwfBAsSbiBfIBCGlzygFq+QGAS4LzUFaCLOD73MmSk7WiB393VWRw04NsxbhNtTH5aM9JFaxb7yvZMwWMckTON8L9Rv7gFlo6NmYjn01ct+kBKxleJzyD8d2AnAA4wMw9zqz8pLSAU9swKxmuqs0JkHt8WNRzwtw11xGt5zR909g/v/swLY/Oh+lcHiO7PMBsPHtBvzmPHTMM/ecn1VdA9sWqmoc8suFfzTaOPeegvtkhaytoZnaNZ/FoV9y9qVex5r8R0zvPd4ennA9/asI5P1i9HL0NedNJ78avW4"
-            }
-          ],
-          "provider_specific_fields": null
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 60,
-      "prompt_tokens": 5,
-      "total_tokens": 65,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": null,
-        "audio_tokens": null,
-        "reasoning_tokens": 59,
-        "rejected_prediction_tokens": null,
-        "text_tokens": 1
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": null,
-        "cached_tokens": null,
-        "text_tokens": 5,
-        "image_tokens": null
-      }
-    },
-    "vertex_ai_grounding_metadata": [],
-    "vertex_ai_url_context_metadata": [],
-    "vertex_ai_safety_results": [],
-    "vertex_ai_citation_metadata": []
-  },
-  "openrouter__openai__gpt5_mini": {
-    "id": "gen-1772543500-cToh8SauCW1u8pGlb4qQ",
-    "created": 1772543500,
-    "model": "openai/gpt-5-mini-2025-08-07",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "length",
-        "index": 0,
-        "message": {
-          "content": null,
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "provider_specific_fields": {
-            "refusal": null,
-            "reasoning": null,
-            "reasoning_details": [
-              {
-                "type": "reasoning.summary",
-                "summary": "**Greet and respond warmly**\n\nThe user says, \"Hello, World!\" \u2014 it feels like a friendly greeting. I should definitely greet back warmly and ask how I can help them! It\u2019s a classic reference in programming, but they might just want a simple hello. I'm thinking of keeping my response concise. So, I\u2019ll reply with a friendly greeting and a question about what they\u2019d like to know or discuss. That seems like a nice approach!**Greet and engage**\n\nThe user is saying \"Hello, World!\" which feels like a classic greeting\u2014possibly a nod to programming culture. I think it\u2019s best to respond warmly, so I\u2019ll greet back with enthusiasm. Since they might just be saying hello, I\u2019ll keep it concise and friendly by asking how I can help. It\u2019s always nice to invite a conversation!",
-                "format": "openai-responses-v1",
-                "index": 0
-              },
-              {
-                "type": "reasoning.encrypted",
-                "data": "gAAAAABppt4SyBWDwFHFtAmspfjQOATVwEjInh21J297wGrovStyUcpp_QsIc3E18qz3EreGtoiQVrUwb4UnffV87xCLDfmoBtxDaSxzbYEJUgYNtjZ6hPr8peySEgtsGPypJmtVJJQ2In9BN-57EeNEqAifTsmKxtCPCm4KHRRAmiXI3zXpokxr8IldC8LYXFM4stVdsWBJxwBYWM6G_vV4VWgmJr15jHIk0tVhx2Rtoca5JQ-0MAf0mQQQbLHBAFnGKhNgBoi_Qnq06A87xSejoUkb7Lb8N6_1u9nFYyixciACYaJIqMeRU5timTRIivBsypP8GPgx-6HyCfqRGhi2nbd5HvTKw4vLTFbtDBR2lRUUsFfJXnbLZvBZO2jbWhYAPvQnnQjpbU5jXE6jPM8z-J4eyGvUg49u3n7fFqe-Nxph4Fuophnbz1-ZCdboejHXfbz9-zKcX-FaVhCkuT82gUWNBq09lLpmOjGERQr5EHguZbhGC1QKkSG59iXMTfRlPssV42xYDpWL1ci0Jbg96TAq8sEnlaY9AMtJTh0NH14Ou8rX3g-g7U2MDomJbcZtY8oNZtyY_3s7ENSatmmaCsX6eQsRuuhSOrxXZSz1l4Zyxes-TseYCQya0YPu3eCNA7-7qhYBWDbtxdBqaTyN9krqM9rkC_p4fQn3q4-2S-Wt9kElCX-SrdMR_qXYZPz8O4BsJwM1aA8gQQji5X8CnYFWTBkBLEQuv2MaR6dDuwvZUsWuCf41YJGw4GJHmdBdDbZflvgpVmuBRwk476MqDac6jXl2VlOgQ0v0zk4M6j5Hb29uCgUFDv0aTyf24wqAZdYRsKQOSLV2Wke38K1qLvUfn99yqkBllsFpdk0DsJJBG4axiK4Kr10BhhApNJokRqIjkT8HU7w5PDRLPryFoc6kuMIuS72RhOKXxZrDu7D_fuWHseOMyVrDULSYhf_GfZEIcnFwGBcIRhhQZG-lzSs_wssCojIGjRX0J67fOZk8YCCvjeabRCbbGTbHDXZxhRL_5Niwz0V4Jgd_97pOlIsVVOgS2-IuIc4445WjpkqGk6mRplBTZEPwZV2ny3v9w3aq-W6_lasXOOmv342RTXXo-pSKaZrowkI3rQUJ_fR5y7mumdDI82C-2onxbNfWI65PUgRW5KUXVgL4RXPu0yI0wu2z7LTyNaVoLSaF9wOtOzEtLux9Pf50EYjqlfD7niQoVR8Pv9D-1fhvrFDmeAzmgBdaqmCWhJWJgZUvtN43Wv2UNjk=",
-                "format": "openai-responses-v1",
-                "id": "rs_0f89d92952e937610169a6de0d3f28819085143727619d92cb",
-                "index": 1
-              }
-            ]
-          }
-        },
-        "provider_specific_fields": {
-          "native_finish_reason": "max_output_tokens"
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 64,
-      "prompt_tokens": 10,
-      "total_tokens": 74,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": null,
-        "audio_tokens": 0,
-        "reasoning_tokens": 64,
-        "rejected_prediction_tokens": null,
-        "text_tokens": null,
-        "image_tokens": 0
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null,
-        "cache_write_tokens": 0,
-        "video_tokens": 0
-      },
-      "cost": 0.0001305,
-      "is_byok": false,
-      "cost_details": {
-        "upstream_inference_cost": 0.0001305,
-        "upstream_inference_prompt_cost": 2.5e-06,
-        "upstream_inference_completions_cost": 0.000128
-      }
-    },
-    "provider": "OpenAI"
-  },
-  "openrouter__anthropic__claude_haiku": {
-    "id": "gen-1772543509-FAaKhTwazzoJmVDDd3ih",
-    "created": 1772543509,
-    "model": "anthropic/claude-4.5-haiku-20251001",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "stop",
-        "index": 0,
-        "message": {
-          "content": "Hello! \ud83d\udc4b How can I help you today?",
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "provider_specific_fields": {
-            "refusal": null,
-            "reasoning": null
-          }
-        },
-        "provider_specific_fields": {
-          "native_finish_reason": "stop"
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 16,
-      "prompt_tokens": 11,
-      "total_tokens": 27,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": null,
-        "audio_tokens": 0,
-        "reasoning_tokens": 0,
-        "rejected_prediction_tokens": null,
-        "text_tokens": null,
-        "image_tokens": 0
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null,
-        "cache_write_tokens": 0,
-        "video_tokens": 0
-      },
-      "cost": 9.1e-05,
-      "is_byok": false,
-      "cost_details": {
-        "upstream_inference_cost": 9.1e-05,
-        "upstream_inference_prompt_cost": 1.1e-05,
-        "upstream_inference_completions_cost": 8e-05
-      }
-    },
-    "provider": "Google"
-  },
-  "openrouter__google__gemini3_flash": {
-    "id": "gen-1772543512-Mxn343CzRXITNLaWa3uw",
-    "created": 1772543512,
-    "model": "google/gemini-3-flash-preview-20251217",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "stop",
-        "index": 0,
-        "message": {
-          "content": "Hello, World! How can I help you today?",
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "provider_specific_fields": {
-            "refusal": null,
-            "reasoning": null,
-            "reasoning_details": [
-              {
-                "type": "reasoning.encrypted",
-                "data": "CiEBjz1rX5mfLGj1Fml96xozj3K4fv7JeTBdOSaUUlxd96c=",
-                "format": "google-gemini-v1",
-                "index": 0
-              }
-            ]
-          }
-        },
-        "provider_specific_fields": {
-          "native_finish_reason": "STOP"
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 11,
-      "prompt_tokens": 4,
-      "total_tokens": 15,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": null,
-        "audio_tokens": 0,
-        "reasoning_tokens": 0,
-        "rejected_prediction_tokens": null,
-        "text_tokens": null,
-        "image_tokens": 0
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null,
-        "cache_write_tokens": 0,
-        "video_tokens": 0
-      },
-      "cost": 3.5e-05,
-      "is_byok": false,
-      "cost_details": {
-        "upstream_inference_cost": 3.5e-05,
-        "upstream_inference_prompt_cost": 2e-06,
-        "upstream_inference_completions_cost": 3.3e-05
-      }
-    },
-    "provider": "Google"
-  },
-  "openrouter__qwen__qwen3_30b": {
-    "id": "gen-1772543515-76qFgjV9ySYOE8mtplV6",
-    "created": 1772543515,
-    "model": "qwen/qwen3-30b-a3b-04-28",
-    "object": "chat.completion",
-    "system_fingerprint": null,
-    "choices": [
-      {
-        "finish_reason": "length",
-        "index": 0,
-        "message": {
-          "content": null,
-          "role": "assistant",
-          "tool_calls": null,
-          "function_call": null,
-          "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,",
-          "provider_specific_fields": {
-            "refusal": null,
-            "reasoning": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,",
-            "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,"
-          }
-        },
-        "provider_specific_fields": {
-          "native_finish_reason": "length"
-        }
-      }
-    ],
-    "usage": {
-      "completion_tokens": 64,
-      "prompt_tokens": 13,
-      "total_tokens": 77,
-      "completion_tokens_details": {
-        "accepted_prediction_tokens": null,
-        "audio_tokens": 0,
-        "reasoning_tokens": 75,
-        "rejected_prediction_tokens": null,
-        "text_tokens": null,
-        "image_tokens": 0
-      },
-      "prompt_tokens_details": {
-        "audio_tokens": 0,
-        "cached_tokens": 0,
-        "text_tokens": null,
-        "image_tokens": null,
-        "cache_write_tokens": 0,
-        "video_tokens": 0
-      },
-      "cost": 1.896e-05,
-      "is_byok": false,
-      "cost_details": {
-        "upstream_inference_cost": 1.896e-05,
-        "upstream_inference_prompt_cost": 1.04e-06,
-        "upstream_inference_completions_cost": 1.792e-05
-      }
-    },
-    "provider": "DeepInfra"
-  }
-}
\ No newline at end of file
diff --git a/usage_tracking/api_usage_test.py b/usage_tracking/api_usage_test.py
deleted file mode 100644
index 1c0a34b2..00000000
--- a/usage_tracking/api_usage_test.py
+++ /dev/null
@@ -1,154 +0,0 @@
-"""
-Test script that calls GPT-5 mini, Claude Haiku 4.5, and Gemini 3 Flash in three
-conditions each — (1) native client, (2) LiteLLM, (3) LiteLLM via OpenRouter —
-plus Qwen 3 via LiteLLM+OpenRouter. Saves full response dicts to JSON for
-usage/cost analysis.
-"""
-
-import json
-import os
-import time
-from pathlib import Path
-
-import anthropic
-import litellm
-import requests
-from dotenv import load_dotenv
-from google import genai
-from google.genai import types
-from openai import OpenAI
-
-load_dotenv()
-
-OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
-ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]
-GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
-OPENROUTER_API_KEY = os.environ["OPENROUTER_API_KEY"]
-
-# LiteLLM reads OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY from env.
-# For Gemini it expects GEMINI_API_KEY, so alias it.
-os.environ.setdefault("GEMINI_API_KEY", GOOGLE_API_KEY)
-
-PROMPT = "Hello, World!"
-MAX_TOKENS = 64
-TOTAL = 10
-
-results = {}
-
-
-def step(n: int, label: str):
-    print(f"{n}/{TOTAL}  {label} ...")
-
-
-# =========================================================================== #
-#  CONDITION 1 — Native SDKs (direct)
-# =========================================================================== #
-
-# -- 1. GPT-5 mini (OpenAI) ------------------------------------------------ #
-step(1, "GPT-5 mini — direct (OpenAI SDK)")
-openai_client = OpenAI(api_key=OPENAI_API_KEY)
-resp = openai_client.chat.completions.create(
-    model="gpt-5-mini",
-    messages=[{"role": "user", "content": PROMPT}],
-    max_completion_tokens=MAX_TOKENS,
-)
-results["direct__openai__gpt5_mini"] = resp.model_dump()
-print(f"       done — {resp.usage.total_tokens} tokens")
-
-# -- 2. Claude Haiku 4.5 (Anthropic) --------------------------------------- #
-step(2, "Claude Haiku 4.5 — direct (Anthropic SDK)")
-anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
-resp = anthropic_client.messages.create(
-    model="claude-haiku-4-5-20251001",
-    max_tokens=MAX_TOKENS,
-    messages=[{"role": "user", "content": PROMPT}],
-)
-results["direct__anthropic__claude_haiku"] = resp.model_dump()
-print(f"       done — {resp.usage.input_tokens + resp.usage.output_tokens} tokens")
-
-# -- 3. Gemini 3 Flash (Google) -------------------------------------------- #
-step(3, "Gemini 3 Flash — direct (Google GenAI SDK)")
-google_client = genai.Client(api_key=GOOGLE_API_KEY)
-resp = google_client.models.generate_content(
-    model="gemini-3-flash-preview",
-    contents=PROMPT,
-    config=types.GenerateContentConfig(max_output_tokens=MAX_TOKENS),
-)
-results["direct__google__gemini3_flash"] = resp.model_dump(mode="json")
-total = resp.usage_metadata.total_token_count if resp.usage_metadata else "n/a"
-print(f"       done — {total} tokens")
-
-
-# =========================================================================== #
-#  CONDITION 2 — LiteLLM (direct to providers)
-# =========================================================================== #
-
-litellm_direct_models = {
-    "litellm__openai__gpt5_mini": "gpt-5-mini",
-    "litellm__anthropic__claude_haiku": "claude-haiku-4-5-20251001",
-    "litellm__google__gemini3_flash": "gemini/gemini-3-flash-preview",
-}
-
-for i, (label, model) in enumerate(litellm_direct_models.items(), start=4):
-    step(i, f"{model} — LiteLLM (direct)")
-    resp = litellm.completion(
-        model=model,
-        messages=[{"role": "user", "content": PROMPT}],
-        max_tokens=MAX_TOKENS,
-    )
-    results[label] = resp.model_dump()
-    usage_total = resp.usage.total_tokens if resp.usage else "n/a"
-    print(f"       done — {usage_total} tokens")
-
-
-# =========================================================================== #
-#  CONDITION 3 — LiteLLM via OpenRouter  (+Qwen)
-# =========================================================================== #
-
-
-def fetch_openrouter_generation(gen_id: str) -> dict | None:
-    """Query OpenRouter's generation endpoint for cost metadata."""
-    time.sleep(2)  # brief wait for metadata to be available
-    r = requests.get(
-        f"https://openrouter.ai/api/v1/generation?id={gen_id}",
-        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
-    )
-    if r.status_code == 200:
-        return r.json()
-    return None
-
-
-litellm_openrouter_models = {
-    "openrouter__openai__gpt5_mini": "openrouter/openai/gpt-5-mini",
-    "openrouter__anthropic__claude_haiku": "openrouter/anthropic/claude-haiku-4-5",
-    "openrouter__google__gemini3_flash": "openrouter/google/gemini-3-flash-preview",
-    "openrouter__qwen__qwen3_30b": "openrouter/qwen/qwen3-30b-a3b",
-}
-
-for i, (label, model) in enumerate(litellm_openrouter_models.items(), start=7):
-    step(i, f"{model} — LiteLLM (OpenRouter)")
-    resp = litellm.completion(
-        model=model,
-        messages=[{"role": "user", "content": PROMPT}],
-        max_tokens=MAX_TOKENS,
-    )
-    result = resp.model_dump()
-
-    # Fetch OpenRouter generation metadata (cost, native tokens, etc.)
-    gen_meta = fetch_openrouter_generation(resp.id)
-    if gen_meta:
-        result["_openrouter_generation"] = gen_meta
-
-    results[label] = result
-    usage_total = resp.usage.total_tokens if resp.usage else "n/a"
-    print(f"       done — {usage_total} tokens")
-
-
-# =========================================================================== #
-#  Save results
-# =========================================================================== #
-out_path = Path(__file__).parent / "api_usage_results.json"
-with open(out_path, "w") as f:
-    json.dump(results, f, indent=2, default=str)
-
-print(f"\nResults saved to {out_path}")

From aaf1662f75301fa973ca6ad7b4beb16f2ee1cc8b Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Mon, 16 Mar 2026 01:35:56 +0100
Subject: [PATCH 16/19] fixed type hinting issue

---
 .../test_agent_integration/test_langgraph_integration.py       | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
index 4a4d5e1b..1c2e9f82 100644
--- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py
+++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py
@@ -420,7 +420,7 @@ def agent_node(state: State) -> State:
 def test_langgraph_adapter_no_cost_without_model_id():
     """Test that LangGraph adapter cannot auto-detect model_id (by design)."""
     from maseval.interface.agents.langgraph import LangGraphAgentAdapter
-    from maseval.core.usage import StaticPricingCalculator
+    from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator
     from langgraph.graph import StateGraph, END
     from typing_extensions import TypedDict
     from langchain_core.messages import AIMessage
@@ -448,5 +448,6 @@ def agent_node(state: State) -> State:
     adapter.run("Test")
 
     usage = adapter.gather_usage()
+    assert isinstance(usage, MasevalTokenUsage)
     assert usage.cost == 0.0
     assert usage.input_tokens == 100

From 3103ef86a8621fcae139a0798b78efce036fe7c8 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Mon, 16 Mar 2026 19:37:29 +0100
Subject: [PATCH 17/19] [skip ci] fixed guide

---
 docs/guides/usage-tracking.md | 19 ++++++-------------
 maseval/core/usage.py         |  3 +--
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md
index b2adc5c8..db29bb75 100644
--- a/docs/guides/usage-tracking.md
+++ b/docs/guides/usage-tracking.md
@@ -16,7 +16,7 @@ MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars)
 
 **Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed.
 
-**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below).
+**Agent adapters** aggregate token usage from the underlying framework's execution. Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below).
 
 **Benchmarks** collect usage from all registered components after each task and include it in reports.
 
@@ -57,7 +57,7 @@ This works across all supported frameworks (smolagents, CAMEL, LangGraph, and Ll
 
 ### Agent Cost Tracking
 
-Agent adapters auto-detect cost when possible. For smolagents, CAMEL, and LlamaIndex, the adapter reads the model ID from the framework's agent object and uses `LiteLLMCostCalculator` if litellm is installed. No configuration needed:
+Agent adapters compute cost automatically when litellm is installed. The adapter detects the model ID from the framework's agent object and uses `LiteLLMCostCalculator` behind the scenes. No configuration needed:
 
 ```python
 # Cost tracking works automatically if litellm is installed
@@ -66,16 +66,16 @@ adapter.run("What's the capital of France?")
 print(f"Cost: ${adapter.gather_usage().cost:.4f}")
 ```
 
-For **LangGraph**, the model ID cannot be auto-detected because a graph can contain multiple models across its nodes. Pass `model_id` explicitly:
+If auto-detection doesn't work for your setup (e.g., the adapter can't find the model ID), pass `model_id` explicitly:
 
 ```python
 adapter = LangGraphAgentAdapter(
     compiled_graph, "agent",
-    model_id="gpt-4o-mini",  # Required for cost tracking
+    model_id="gpt-4o-mini",
 )
 ```
 
-To override auto-detection or use custom pricing, pass `cost_calculator` and/or `model_id`:
+To use custom pricing instead, pass `cost_calculator` and/or `model_id`:
 
 ```python
 from maseval import StaticPricingCalculator
@@ -91,13 +91,6 @@ adapter = SmolAgentAdapter(
 )
 ```
 
-| Framework | Model ID | Cost Calculator |
-|-----------|----------|-----------------|
-| smolagents | Auto (`agent.model.model_id`) | Auto (`LiteLLMCostCalculator`) |
-| CAMEL | Auto (`agent.model_backend.model_type`) | Auto (`LiteLLMCostCalculator`) |
-| LlamaIndex | Auto (`agent.llm.metadata.model_name`) | Auto (`LiteLLMCostCalculator`) |
-| LangGraph | **Manual** (`model_id=...`) | Auto (`LiteLLMCostCalculator`) |
-
 If litellm is not installed, auto-creation of the calculator is skipped and cost stays at `0.0`. Tokens are always tracked regardless.
 
 ### In Benchmarks
@@ -215,7 +208,7 @@ model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cos
 
 When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order:
 
-1. **Provider-reported cost**: e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
+1. **Provider-reported cost**: some providers (e.g., LiteLLM) include cost in the API response. This always wins.
 2. **CostCalculator**: if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
 3. **Zero**: if neither source provides cost, `usage.cost` stays `0.0`.
 
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index 651e4d6d..81bf8bda 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -15,8 +15,7 @@
 ``Usage.cost`` defaults to ``0.0``, so ``Usage()`` works as a starting value
 for accumulation (e.g., ``sum(records, Usage())``). Cost calculators are
 optional — if no calculator is provided to a ``ModelAdapter``, cost stays
-at ``0.0`` unless the provider reports it directly (e.g., LiteLLM's
-``response._hidden_params.response_cost``).
+at ``0.0`` unless the provider reports it directly.
 For automatic pricing via LiteLLM's bundled model database, see
 ``maseval.interface.usage``.
 """

From acd73db990756a93e89aab9b69ccf35054720610 Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Tue, 17 Mar 2026 18:08:29 +0100
Subject: [PATCH 18/19] [skip ci] fix smaller issues

---
 AGENTS.md                                     | 26 ++++++-------
 .../five_a_day_benchmark.py                   |  2 +-
 maseval/core/agent.py                         |  8 ++--
 maseval/core/model.py                         |  4 +-
 maseval/core/usage.py                         |  7 +++-
 maseval/interface/agents/_cost.py             | 38 +++++++++++++++++++
 maseval/interface/agents/camel.py             | 30 ++++++++-------
 maseval/interface/agents/langgraph.py         | 23 +++++------
 maseval/interface/agents/llamaindex.py        | 25 ++++++------
 maseval/interface/agents/smolagents.py        | 33 ++++++++--------
 pyproject.toml                                |  8 ++--
 tests/test_core/test_registry.py              | 37 ------------------
 12 files changed, 121 insertions(+), 120 deletions(-)
 create mode 100644 maseval/interface/agents/_cost.py

diff --git a/AGENTS.md b/AGENTS.md
index 2ba2a935..d16d775b 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -265,12 +265,11 @@ mkdocs serve
 
 1. Create a feature branch (never commit to `main`)
 2. Make changes following code style guidelines
-3. Run formatters and linters: `ruff format . && ruff check . --fix`
-4. Run tests: `pytest -v`
-5. Update documentation if needed
-6. Open PR against `main` branch
-7. Request review from `cemde`
-8. Ensure all CI checks pass
+3. Run `just all` before committing. This formats, lints, typechecks, and tests in one step. See the `justfile` for all available recipes.
+4. Update documentation if needed
+5. Open PR against `main` branch
+6. Request review from `cemde`
+7. Ensure all CI checks pass
 
 **CI Pipeline:** GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge.
 
@@ -301,18 +300,15 @@ Example workflow:
 ## Common Tasks Quick Reference
 
 ```bash
-# Fresh environment setup
-uv sync --all-extras --all-groups
+# Fresh environment setup / Update after pulling changes
+just install # uv sync --all-extras --all-groups
 
-# Before committing
-uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
+# Before committing (format, lint, typecheck, test)
+just all
 
 # Run example
 uv run python examples/amazon_collab.py
 
-# Update after pulling changes
-uv sync --all-extras --all-groups
-
 # Add optional dependency
 uv add --optional <extra-name> <package-name>
 
@@ -320,6 +316,8 @@ uv add --optional <extra-name> <package-name>
 uv run pytest tests/test_core/test_agent.py -v
 ```
 
+For more comments see `justfile`.
+
 ## Security and Confidentiality
 
 **IMPORTANT:** This project contains confidential research material.
@@ -540,4 +538,4 @@ class Evaluator:
         ...
 ```
 
-**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
\ No newline at end of file
+**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py
index 77bc00ee..23ddfc62 100644
--- a/examples/five_a_day_benchmark/five_a_day_benchmark.py
+++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py
@@ -978,7 +978,7 @@ def _fmt_usage(usage):
 
     # Group components by category
     if benchmark.usage_by_component:
-        by_category: dict[str, dict[str, object]] = defaultdict(dict)
+        by_category: Dict[str, Dict[str, object]] = defaultdict(dict)
         for key, usage in benchmark.usage_by_component.items():
             category, name = key.split(":", 1)
             by_category[category][name] = usage
diff --git a/maseval/core/agent.py b/maseval/core/agent.py
index 19b68f0e..39584700 100644
--- a/maseval/core/agent.py
+++ b/maseval/core/agent.py
@@ -1,4 +1,5 @@
 from abc import ABC, abstractmethod
+from dataclasses import replace
 from typing import List, Any, Optional, Dict
 
 from .callback import AgentCallback
@@ -28,16 +29,13 @@ class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin):
         litellm is installed). This means cost tracking often works with zero
         configuration.
 
-        To override or disable auto-detection, pass explicit values::
+        To override auto-detection, pass explicit values::
 
             adapter = SmolAgentAdapter(
                 agent, name="researcher",
                 cost_calculator=StaticPricingCalculator({...}),
                 model_id="my-custom-model",
             )
-
-        Pass ``cost_calculator=None`` explicitly to disable cost calculation
-        even when auto-detection would otherwise enable it.
     """
 
     def __init__(
@@ -161,7 +159,7 @@ def gather_usage(self) -> Usage:
                 if mid:
                     cost = calculator.calculate_cost(usage, mid)
                     if cost is not None:
-                        usage.cost = cost
+                        usage = replace(usage, cost=cost)
         return usage
 
     def _gather_usage(self) -> Usage:
diff --git a/maseval/core/model.py b/maseval/core/model.py
index 86399d7f..528f5ac2 100644
--- a/maseval/core/model.py
+++ b/maseval/core/model.py
@@ -48,7 +48,7 @@
 from __future__ import annotations
 
 from abc import ABC, abstractmethod
-from dataclasses import dataclass
+from dataclasses import dataclass, replace
 from typing import Any, Optional, Dict, List, Union
 from datetime import datetime
 import time
@@ -316,7 +316,7 @@ def chat(
                 if token_usage.cost == 0.0 and self._cost_calculator is not None:
                     calculated = self._cost_calculator.calculate_cost(token_usage, self.model_id)
                     if calculated is not None:
-                        token_usage.cost = calculated
+                        token_usage = replace(token_usage, cost=calculated)
 
                 self._usage_records.append(token_usage)
 
diff --git a/maseval/core/usage.py b/maseval/core/usage.py
index 81bf8bda..6e32eae1 100644
--- a/maseval/core/usage.py
+++ b/maseval/core/usage.py
@@ -92,6 +92,11 @@ def __add__(self, other: Usage) -> Usage:
         if not isinstance(other, Usage):
             return NotImplemented
 
+        # Delegate to TokenUsage.__add__ when the right operand is a
+        # TokenUsage but self is a plain Usage, so token fields are preserved.
+        if type(self) is Usage and isinstance(other, TokenUsage):
+            return TokenUsage.__add__(other, self)
+
         cost = self.cost + other.cost
 
         # Sum units
@@ -228,7 +233,7 @@ def to_dict(self) -> Dict[str, Any]:
     @classmethod
     def from_chat_response_usage(
         cls,
-        usage_dict: Dict[str, int],
+        usage_dict: Dict[str, Any],
         *,
         cost: float = 0.0,
         provider: Optional[str] = None,
diff --git a/maseval/interface/agents/_cost.py b/maseval/interface/agents/_cost.py
new file mode 100644
index 00000000..11f1e0f7
--- /dev/null
+++ b/maseval/interface/agents/_cost.py
@@ -0,0 +1,38 @@
+"""Shared cost-calculator auto-detection for agent adapters."""
+
+from typing import Optional, Tuple
+
+from maseval.core.usage import CostCalculator
+
+
+def resolve_auto_cost_calculator(
+    explicit: Optional[CostCalculator],
+    cached: Optional[CostCalculator],
+    attempted: bool,
+) -> Tuple[Optional[CostCalculator], Optional[CostCalculator], bool]:
+    """Resolve the cost calculator, auto-creating one if litellm is available.
+
+    Args:
+        explicit: The calculator passed explicitly by the user (may be ``None``).
+        cached: The cached auto-calculator from a previous call (``None`` if
+            not yet created or creation failed).
+        attempted: Whether auto-creation has been attempted before.
+
+    Returns:
+        Tuple of ``(calculator_to_use, updated_cache, updated_attempted)``.
+        Callers should store the second and third elements back into
+        ``self._auto_calculator`` and ``self._auto_attempted``.
+    """
+    if explicit is not None:
+        return explicit, cached, attempted
+
+    if not attempted:
+        attempted = True
+        try:
+            from maseval.interface.usage import LiteLLMCostCalculator
+
+            cached = LiteLLMCostCalculator()
+        except (ImportError, Exception):
+            cached = None
+
+    return cached, cached, attempted
diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py
index 1c440687..d67b65dd 100644
--- a/maseval/interface/agents/camel.py
+++ b/maseval/interface/agents/camel.py
@@ -19,7 +19,7 @@
 from maseval import AgentAdapter, MessageHistory, LLMUser, User
 from maseval.core.tracing import TraceableMixin
 from maseval.core.config import ConfigurableMixin
-from maseval.core.usage import TokenUsage, Usage
+from maseval.core.usage import CostCalculator, TokenUsage, Usage
 
 __all__ = [
     "CamelAgentAdapter",
@@ -176,7 +176,12 @@ class CamelAgentAdapter(AgentAdapter):
     """
 
     def __init__(
-        self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, cost_calculator: Any = None, model_id: Optional[str] = None
+        self,
+        agent_instance: Any,
+        name: str,
+        callbacks: Optional[List[Any]] = None,
+        cost_calculator: Optional[CostCalculator] = None,
+        model_id: Optional[str] = None,
     ):
         """Initialize the CAMEL adapter.
 
@@ -199,7 +204,8 @@ def __init__(
         self.messages = None
         self._cost_calculator = cost_calculator
         self._model_id = model_id
-        self._auto_calculator = None  # Lazy-initialized
+        self._auto_calculator: Optional[CostCalculator] = None
+        self._auto_attempted = False
         # Store responses from each step() call
         self._responses: List[Any] = []
         # Store errors that occur during execution (for comprehensive logging)
@@ -429,7 +435,7 @@ def gather_traces(self) -> Dict[str, Any]:
 
         return base_traces
 
-    def _resolve_model_id(self):
+    def _resolve_model_id(self) -> Optional[str]:
         """Auto-detect model ID from CAMEL agent.
 
         CAMEL's ChatAgent stores the model backend in ``model_backend``
@@ -445,18 +451,14 @@ def _resolve_model_id(self):
             pass
         return None
 
-    def _resolve_cost_calculator(self):
+    def _resolve_cost_calculator(self) -> Optional[CostCalculator]:
         """Return the cost calculator, auto-creating one if litellm is available."""
-        if self._cost_calculator is not None:
-            return self._cost_calculator
-        if self._auto_calculator is None:
-            try:
-                from maseval.interface.usage import LiteLLMCostCalculator
+        from maseval.interface.agents._cost import resolve_auto_cost_calculator
 
-                self._auto_calculator = LiteLLMCostCalculator()
-            except (ImportError, Exception):
-                self._auto_calculator = False
-        return self._auto_calculator if self._auto_calculator is not False else None
+        calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator(
+            self._cost_calculator, self._auto_calculator, self._auto_attempted
+        )
+        return calculator
 
     def _gather_usage(self) -> Usage:
         """Gather aggregated token usage across all CAMEL agent responses.
diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py
index a2e9fd06..63afafe4 100644
--- a/maseval/interface/agents/langgraph.py
+++ b/maseval/interface/agents/langgraph.py
@@ -9,7 +9,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
-from maseval.core.usage import TokenUsage, Usage
+from maseval.core.usage import CostCalculator, TokenUsage, Usage
 
 __all__ = ["LangGraphAgentAdapter", "LangGraphLLMUser"]
 
@@ -123,7 +123,7 @@ def __init__(
         name: str,
         callbacks: Optional[List[Any]] = None,
         config: Optional[Dict[str, Any]] = None,
-        cost_calculator: Any = None,
+        cost_calculator: Optional[CostCalculator] = None,
         model_id: Optional[str] = None,
     ):
         """Initialize the LangGraph adapter.
@@ -149,7 +149,8 @@ def __init__(
         super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id)
         self._langgraph_config = config
         self._last_result = None
-        self._auto_calculator = None  # Lazy-initialized
+        self._auto_calculator: Optional[CostCalculator] = None
+        self._auto_attempted = False
 
     def get_messages(self) -> MessageHistory:
         """Get message history from LangGraph.
@@ -234,18 +235,14 @@ def gather_config(self) -> dict[str, Any]:
 
         return base_config
 
-    def _resolve_cost_calculator(self):
+    def _resolve_cost_calculator(self) -> Optional[CostCalculator]:
         """Return the cost calculator, auto-creating one if litellm is available."""
-        if self._cost_calculator is not None:
-            return self._cost_calculator
-        if self._auto_calculator is None:
-            try:
-                from maseval.interface.usage import LiteLLMCostCalculator
+        from maseval.interface.agents._cost import resolve_auto_cost_calculator
 
-                self._auto_calculator = LiteLLMCostCalculator()
-            except (ImportError, Exception):
-                self._auto_calculator = False
-        return self._auto_calculator if self._auto_calculator is not False else None
+        calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator(
+            self._cost_calculator, self._auto_calculator, self._auto_attempted
+        )
+        return calculator
 
     def _gather_usage(self) -> Usage:
         """Gather aggregated token usage from LangGraph message metadata.
diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py
index 5c1de402..30ce4283 100644
--- a/maseval/interface/agents/llamaindex.py
+++ b/maseval/interface/agents/llamaindex.py
@@ -10,7 +10,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
-from maseval.core.usage import TokenUsage, Usage
+from maseval.core.usage import CostCalculator, TokenUsage, Usage
 
 __all__ = ["LlamaIndexAgentAdapter", "LlamaIndexLLMUser"]
 
@@ -118,7 +118,7 @@ def __init__(
         name: str,
         callbacks: Optional[List[Any]] = None,
         max_iterations: Optional[int] = None,
-        cost_calculator: Any = None,
+        cost_calculator: Optional[CostCalculator] = None,
         model_id: Optional[str] = None,
     ):
         """Initialize the LlamaIndex adapter.
@@ -143,7 +143,8 @@ def __init__(
         self._last_result = None
         self._message_cache: List[Dict[str, Any]] = []
         self._max_iterations = max_iterations
-        self._auto_calculator = None  # Lazy-initialized
+        self._auto_calculator: Optional[CostCalculator] = None
+        self._auto_attempted = False
 
     def get_messages(self) -> MessageHistory:
         """Get message history from LlamaIndex.
@@ -224,7 +225,7 @@ def gather_config(self) -> Dict[str, Any]:
 
         return base_config
 
-    def _resolve_model_id(self):
+    def _resolve_model_id(self) -> Optional[str]:
         """Auto-detect model ID from LlamaIndex agent.
 
         LlamaIndex agents store their LLM in ``self.llm``, which has a
@@ -244,18 +245,14 @@ def _resolve_model_id(self):
             pass
         return None
 
-    def _resolve_cost_calculator(self):
+    def _resolve_cost_calculator(self) -> Optional[CostCalculator]:
         """Return the cost calculator, auto-creating one if litellm is available."""
-        if self._cost_calculator is not None:
-            return self._cost_calculator
-        if self._auto_calculator is None:
-            try:
-                from maseval.interface.usage import LiteLLMCostCalculator
+        from maseval.interface.agents._cost import resolve_auto_cost_calculator
 
-                self._auto_calculator = LiteLLMCostCalculator()
-            except (ImportError, Exception):
-                self._auto_calculator = False
-        return self._auto_calculator if self._auto_calculator is not False else None
+        calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator(
+            self._cost_calculator, self._auto_calculator, self._auto_attempted
+        )
+        return calculator
 
     def _gather_usage(self) -> Usage:
         """Gather aggregated token usage from LlamaIndex execution logs.
diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py
index d0ae1905..25057208 100644
--- a/maseval/interface/agents/smolagents.py
+++ b/maseval/interface/agents/smolagents.py
@@ -7,7 +7,7 @@
 from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
 from maseval import AgentAdapter, MessageHistory, LLMUser
-from maseval.core.usage import TokenUsage, Usage
+from maseval.core.usage import CostCalculator, TokenUsage, Usage
 
 __all__ = ["SmolAgentAdapter", "SmolAgentLLMUser"]
 
@@ -102,7 +102,14 @@ class SmolAgentAdapter(AgentAdapter):
         smolagents to be installed: `pip install maseval[smolagents]`
     """
 
-    def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_calculator: Any = None, model_id: Optional[str] = None):
+    def __init__(
+        self,
+        agent_instance: Any,
+        name: str,
+        callbacks: Any = None,
+        cost_calculator: Optional[CostCalculator] = None,
+        model_id: Optional[str] = None,
+    ):
         """Initialize the Smolagent adapter.
 
         Note: We don't call super().__init__() to avoid initializing self.logs as a list,
@@ -124,7 +131,8 @@ def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_c
         self.messages = None
         self._cost_calculator = cost_calculator
         self._model_id = model_id
-        self._auto_calculator = None  # Lazy-initialized
+        self._auto_calculator: Optional[CostCalculator] = None
+        self._auto_attempted = False
 
     @property
     def logs(self) -> List[Dict[str, Any]]:  # type: ignore[override]
@@ -333,7 +341,7 @@ def gather_traces(self) -> dict:
 
         return base_logs
 
-    def _resolve_model_id(self):
+    def _resolve_model_id(self) -> Optional[str]:
         """Auto-detect model ID from smolagents agent.
 
         All smolagents model classes (LiteLLMModel, OpenAIServerModel,
@@ -345,19 +353,14 @@ def _resolve_model_id(self):
         except AttributeError:
             return None
 
-    def _resolve_cost_calculator(self):
+    def _resolve_cost_calculator(self) -> Optional[CostCalculator]:
         """Return the cost calculator, auto-creating one if litellm is available."""
-        if self._cost_calculator is not None:
-            return self._cost_calculator
-        # Lazy auto-create: try LiteLLMCostCalculator once
-        if self._auto_calculator is None:
-            try:
-                from maseval.interface.usage import LiteLLMCostCalculator
+        from maseval.interface.agents._cost import resolve_auto_cost_calculator
 
-                self._auto_calculator = LiteLLMCostCalculator()
-            except (ImportError, Exception):
-                self._auto_calculator = False  # Sentinel: don't retry
-        return self._auto_calculator if self._auto_calculator is not False else None
+        calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator(
+            self._cost_calculator, self._auto_calculator, self._auto_attempted
+        )
+        return calculator
 
     def _gather_usage(self) -> Usage:
         """Gather aggregated token usage across all agent steps.
diff --git a/pyproject.toml b/pyproject.toml
index a352a908..45805087 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -31,10 +31,10 @@ dependencies = [
 # Enable optional dependencies for end users
 [project.optional-dependencies]
 # Agent frameworks
-smolagents = ["smolagents>=1.21.3"]
-langgraph = ["langgraph>=0.6.0"]
-llamaindex = ["llama-index-core>=0.12.0"]
-camel = ["camel-ai>=0.2.0"]
+smolagents = ["smolagents>=1.21.3", "litellm>=1.0.0"]
+langgraph = ["langgraph>=0.6.0", "litellm>=1.0.0"]
+llamaindex = ["llama-index-core>=0.12.0", "litellm>=1.0.0"]
+camel = ["camel-ai>=0.2.0", "litellm>=1.0.0"]
 
 # Inference engines
 anthropic = ["anthropic>=0.40.0"]
diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py
index 17b30c5b..f3afd27f 100644
--- a/tests/test_core/test_registry.py
+++ b/tests/test_core/test_registry.py
@@ -262,43 +262,6 @@ def worker(worker_id: int):
 # ==================== Usage Tracking Tests ====================
 
 
-class MockUsageComponent(TraceableMixin):
-    """Component that implements UsageTrackableMixin for testing."""
-
-    def __init__(self, name: str, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0):
-        super().__init__()
-        self._name = name
-        self._cost = cost
-        self._input_tokens = input_tokens
-        self._output_tokens = output_tokens
-
-    def gather_traces(self) -> Dict[str, Any]:
-        return {"name": self._name}
-
-    def gather_usage(self):
-        from maseval.core.usage import TokenUsage
-
-        return TokenUsage(
-            cost=self._cost,
-            input_tokens=self._input_tokens,
-            output_tokens=self._output_tokens,
-            total_tokens=self._input_tokens + self._output_tokens,
-        )
-
-
-class MockBrokenUsageComponent(TraceableMixin):
-    """Component whose gather_usage raises an exception."""
-
-    def __init__(self):
-        super().__init__()
-
-    def gather_traces(self) -> Dict[str, Any]:
-        return {}
-
-    def gather_usage(self):
-        raise RuntimeError("Usage collection failed")
-
-
 class UsageAwareComponent(TraceableMixin, UsageTrackableMixin):
     """Component with both tracing and usage tracking."""
 

From d7a8b0b9507e6b7310b420f24f520876061be3fa Mon Sep 17 00:00:00 2001
From: cemde <c.emde@me.com>
Date: Tue, 17 Mar 2026 18:20:04 +0100
Subject: [PATCH 19/19] fixed changelog

---
 CHANGELOG.md | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 61e35d06..53911dc0 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,13 +11,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Core**
 
-- Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45)
-- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45)
-- `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45)
-- Cost calculation for agent adapters. `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both the model ID and cost calculator are auto-detected (model ID from the framework's agent object, calculator via `LiteLLMCostCalculator` if litellm is installed). For LangGraph, `model_id` must be passed explicitly since graphs can contain multiple models. Explicit `cost_calculator` and `model_id` always override auto-detection. (PR: #45)
-- `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45)
-- Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45)
-- `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45)
+- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
+- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
+- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
 
 - `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
 - `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
@@ -55,8 +51,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Documentation**
 
-- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) covering automatic LLM tracking, cost calculators, non-LLM usage, post-hoc analysis with `UsageReporter`, and the data model. (PR: #45)
-- Usage & Cost reference page (`docs/reference/usage.md`) with API documentation for all usage and cost classes. (PR: #45)
+- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)
 
 **Core**