From 6c9614f6d39463140dc48821356057e806f62896 Mon Sep 17 00:00:00 2001 From: cemde Date: Thu, 12 Mar 2026 16:30:00 +0100 Subject: [PATCH 01/19] updated plan --- usage_tracking/PLAN.md | 317 ++++++++++++++++ usage_tracking/api_usage_results.json | 523 ++++++++++++++++++++++++++ usage_tracking/api_usage_test.py | 154 ++++++++ 3 files changed, 994 insertions(+) create mode 100644 usage_tracking/PLAN.md create mode 100644 usage_tracking/api_usage_results.json create mode 100644 usage_tracking/api_usage_test.py diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md new file mode 100644 index 00000000..f9b45133 --- /dev/null +++ b/usage_tracking/PLAN.md @@ -0,0 +1,317 @@ +# Usage & Cost Tracking — Implementation Plan + +## Motivation + +Benchmarking multi-agent systems incurs real costs: LLM API calls (the primary driver), but also external service calls (e.g., Bloomberg data API, geocoding services, paid search APIs). MASEval currently extracts basic token counts into `ChatResponse.usage` but does not persist, enrich, or aggregate this data. We want first-class usage tracking that: + +- Captures token usage and cost per LLM call with provider-specific detail +- Supports non-token costs (external service calls billed per-request or per-unit) +- Aggregates across provider, task, component role, and total +- Is queryable live during benchmark execution (not just post-hoc) +- Captures usage even for failed tasks +- Requires zero changes from benchmark implementers for the common LLM case + +## Design Principles + +1. **LLM-first, not LLM-only.** The base abstraction is generic (cost + arbitrary units), with an LLM-specific subclass that adds token semantics. +2. **No hardcoded prices.** Pricing changes constantly. Users supply pricing or rely on provider-reported cost (e.g., OpenRouter). If neither is available, cost is `None`. +3. **Automatic for models, opt-in for tools.** ModelAdapter tracks usage automatically via the base `chat()` method. Tool/environment authors opt in via `UsageTrackableMixin`. +4. **Non-breaking.** `ChatResponse.usage` stays a `Dict[str, int]` with additional optional keys. Existing code that reads `usage["input_tokens"]` continues to work. +5. **First-class collection axis.** Usage is collected via `gather_usage()` / `collect_usage()`, parallel to `gather_traces()` / `collect_traces()` and `gather_config()` / `collect_configs()`. It is not embedded inside traces. +6. **Live queryable.** The registry maintains a running usage total across repetitions, queryable at any time via `benchmark.usage`. + +--- + +## Data Model + +### `Usage` (base) + +Generic usage record for any billable resource. Stored as a simple dataclass. + +``` +Usage + cost: Optional[float] # Total cost in USD (None = unknown) + cost_details: Dict[str, float] # Breakdown (e.g., {"input": 0.01, "output": 0.03}) + units: Dict[str, int | float] # Arbitrary countable units (e.g., {"api_calls": 3, "bytes": 1024}) + metadata: Dict[str, Any] # Provider-specific extras +``` + +Supports `__add__` to sum two records (costs sum if both known, else None; units sum; metadata merges). + +### `TokenUsage(Usage)` (LLM-specific) + +Extends `Usage` with token fields that every LLM provider reports. + +``` +TokenUsage(Usage) + input_tokens: int + output_tokens: int + total_tokens: int + # Optional provider-specific detail + cached_input_tokens: int # Anthropic cache_read, OpenAI cached_tokens + reasoning_tokens: int # OpenAI reasoning, Google thoughts + audio_tokens: int # OpenAI audio +``` + +`TokenUsage.__add__` sums all token fields plus delegates to `Usage.__add__` for cost/units. + +Class method `TokenUsage.from_chat_response_usage(usage_dict) -> TokenUsage` maps the dict returned by adapters today into a `TokenUsage` instance, handling provider-specific key names. + +--- + +## UsageTrackableMixin + +Follows the established mixin pattern (`TraceableMixin`, `ConfigurableMixin`). Any component that inherits `UsageTrackableMixin` will have its usage automatically collected by the registry when registered. + +```python +class UsageTrackableMixin: + """Mixin that provides usage tracking capability to any component.""" + + def gather_usage(self) -> Usage: + """Return accumulated usage for this component. + + Subclasses must override this to return their accumulated Usage. + Base implementation returns an empty Usage. + """ + return Usage() +``` + +Components internally accumulate `Usage` records however they see fit (typically a list + sum). The mixin only defines the collection protocol — `gather_usage() -> Usage`. + +### Usage in components + +**ModelAdapter** (automatic): + +```python +class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): + def __init__(self, seed=None): + super().__init__() + self._usage_records: List[Usage] = [] + + def chat(self, messages, ...): + response = self._chat_impl(messages, ...) + if response.usage: + self._usage_records.append( + TokenUsage.from_chat_response_usage(response.usage) + ) + return response + + def gather_usage(self) -> Usage: + if not self._usage_records: + return Usage() + return sum(self._usage_records[1:], self._usage_records[0]) +``` + +**Non-model components** (opt-in): + +```python +class BloombergEnvironment(Environment, UsageTrackableMixin): + def __init__(self, task_data): + super().__init__(task_data) + self._usage_records: List[Usage] = [] + + def _call_bloomberg(self, query): + result = bloomberg_client.query(query) + self._usage_records.append(Usage( + cost=result.billed_amount, + units={"api_calls": 1, "data_points": result.count}, + )) + return result + + def gather_usage(self) -> Usage: + if not self._usage_records: + return Usage() + return sum(self._usage_records[1:], self._usage_records[0]) +``` + +--- + +## Registry Integration + +The `ComponentRegistry` gains a third collection axis for usage, parallel to traces and configs. + +### Per-repetition collection + +`collect_usage()` walks all registered `UsageTrackableMixin` components and calls `gather_usage()` on each. Returns a structured dict (same shape as `collect_traces()`/`collect_configs()`). This goes into `report["usage"]`. + +```python +def collect_usage(self) -> Dict[str, Any]: + """Collect usage from all registered UsageTrackableMixin components.""" + usage = { + "metadata": {...}, + "agents": {}, + "models": {}, + "tools": {}, + ... + "environment": None, + "user": None, + } + + for key, component in self._usage_registry.items(): + category, comp_name = key.split(":", 1) + component_usage = component.gather_usage() + + # Store in structured dict (same pattern as traces/configs) + ... + + # Accumulate into persistent aggregates + self._usage_total += component_usage + self._usage_by_component[key] += component_usage + + return usage +``` + +### Persistent aggregates (survive `clear()`) + +The registry maintains running totals that persist across task repetitions: + +```python +class ComponentRegistry: + def __init__(self): + # ... existing per-repetition state ... + + # Persistent usage aggregates (NOT cleared between repetitions) + self._usage_total: Usage = Usage() + self._usage_by_component: Dict[str, Usage] = {} + + def clear(self): + # Clears per-repetition registrations + # Does NOT clear _usage_total or _usage_by_component + + @property + def total_usage(self) -> Usage: + """Running total across all repetitions. Queryable at any time.""" + return self._usage_total + + @property + def usage_by_component(self) -> Dict[str, Usage]: + """Per-component running totals across all repetitions.""" + return dict(self._usage_by_component) +``` + +### Registration + +The `register()` method gains an `isinstance(component, UsageTrackableMixin)` check, parallel to the existing `TraceableMixin` and `ConfigurableMixin` checks: + +```python +def register(self, category, name, component): + # ... existing trace/config registration ... + + if isinstance(component, UsageTrackableMixin): + self._usage_registry[key] = component + self._usage_component_id_map[component_id] = key +``` + +`RegisterableComponent` type alias is updated to include `UsageTrackableMixin`. + +--- + +## Benchmark Integration + +### Report structure + +Each report gains a top-level `"usage"` key alongside `"traces"` and `"config"`: + +```python +report = { + "task_id": str(task.id), + "repeat_idx": repeat_idx, + "status": execution_status.value, + "traces": execution_traces, + "config": execution_configs, + "usage": execution_usage, # <-- new + "eval": eval_results, + "task": {...}, +} +``` + +### Live usage access + +```python +benchmark.usage # -> Usage (running grand total, delegates to registry) +benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals) +``` + +### Failed task usage + +`collect_usage()` is called alongside `collect_all_traces()` and `collect_all_configs()` — before error status is determined. If a task fails mid-execution, whatever usage was accumulated up to the failure point is still collected and aggregated. + +--- + +## Adapter `_chat_impl` Enrichment (per-provider) + +Each adapter enriches the `ChatResponse.usage` dict with provider-specific fields beyond the basic three. The base class `TokenUsage.from_chat_response_usage()` handles mapping. + +| Adapter | Extra fields to extract | +|---------|------------------------| +| OpenAI | `reasoning_tokens` from `completion_tokens_details`, `cached_input_tokens` from `prompt_tokens_details.cached_tokens` | +| Anthropic | `cached_input_tokens` from `cache_read_input_tokens` | +| Google | `reasoning_tokens` from `thoughts_token_count` | +| LiteLLM | `reasoning_tokens` + `cached_input_tokens` from details; `cost` from `response._hidden_params` if available | +| HuggingFace | No change (local inference, no API cost) | + +--- + +## UsageReporter (post-hoc) + +Post-run utility that walks `report["usage"]` across all reports for sliced analysis. + +``` +UsageReporter + @staticmethod from_reports(reports: List[Dict]) -> UsageReporter + + by_task() -> Dict[str, Usage] # keyed by task_id + by_component() -> Dict[str, Usage] # keyed by registry key (e.g., "models:main_model") + by_model() -> Dict[str, TokenUsage] # keyed by model_id (LLM-only) + total() -> Usage # grand total + + summary() -> Dict[str, Any] # nested dict with all breakdowns +``` + +Unlike the registry's live aggregates, `UsageReporter` can slice by task (since it sees the full report list with task IDs). + +--- + +## Evaluators + +Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. That model should be registered in the benchmark via `self.register("evaluator_models", "judge", model)` inside `setup_evaluators()`. Since `ModelAdapter` now inherits `UsageTrackableMixin`, its usage is automatically collected under `usage.evaluator_models.judge`. + +No changes to the `Evaluator` base class. This is a registration convention. + +## LLMUser / AgenticLLMUser + +These already hold a `ModelAdapter`. Their model's usage is collected automatically (since `ModelAdapter` inherits `UsageTrackableMixin` and `chat()` accumulates records). The model is already registered by the benchmark. No changes needed. + +--- + +## File Plan + +| File | Action | Content | +|------|--------|---------| +| `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` | +| `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property | +| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()` | +| `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict | +| `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility | +| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens` | +| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens` | +| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens` | +| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost` | +| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `UsageReporter` | +| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation | + +No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`. + +--- + +## Non-goals + +- **Hardcoded pricing tables** — prices change too often; user-supplied or provider-reported only. +- **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now. +- **Billing integration** — no webhook/billing system integration. +- **Streaming usage** — not supported yet (usage is captured after completion). + +## Open Questions + +1. **Pricing config format**: Should pricing be passed to `ModelAdapter.__init__` as a new param, or set externally after construction? Leaning toward a `pricing` kwarg on adapter init for ergonomics. When `pricing` is provided and a `TokenUsage` record has `cost=None`, cost is computed from `pricing["input"] * input_tokens + pricing["output"] * output_tokens`. +2. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1. diff --git a/usage_tracking/api_usage_results.json b/usage_tracking/api_usage_results.json new file mode 100644 index 00000000..4dcd9b8e --- /dev/null +++ b/usage_tracking/api_usage_results.json @@ -0,0 +1,523 @@ +{ + "direct__openai__gpt5_mini": { + "id": "chatcmpl-DFJoysUJJeWtuOVc5UIq3EnA580Ok", + "choices": [ + { + "finish_reason": "length", + "index": 0, + "logprobs": null, + "message": { + "content": "", + "refusal": null, + "role": "assistant", + "annotations": [], + "audio": null, + "function_call": null, + "tool_calls": null + } + } + ], + "created": 1772543484, + "model": "gpt-5-mini-2025-08-07", + "object": "chat.completion", + "service_tier": "default", + "system_fingerprint": null, + "usage": { + "completion_tokens": 64, + "prompt_tokens": 10, + "total_tokens": 74, + "completion_tokens_details": { + "accepted_prediction_tokens": 0, + "audio_tokens": 0, + "reasoning_tokens": 64, + "rejected_prediction_tokens": 0 + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0 + } + } + }, + "direct__anthropic__claude_haiku": { + "id": "msg_01UDvWsS78tyf4xQ1wwDsNop", + "content": [ + { + "citations": null, + "text": "# Hello! \ud83d\udc4b\n\nWelcome! I'm Claude, an AI assistant made by Anthropic. How can I help you today?", + "type": "text" + } + ], + "model": "claude-haiku-4-5-20251001", + "role": "assistant", + "stop_reason": "end_turn", + "stop_sequence": null, + "type": "message", + "usage": { + "cache_creation": { + "ephemeral_1h_input_tokens": 0, + "ephemeral_5m_input_tokens": 0 + }, + "cache_creation_input_tokens": 0, + "cache_read_input_tokens": 0, + "input_tokens": 11, + "output_tokens": 32, + "server_tool_use": null, + "service_tier": "standard", + "inference_geo": "not_available" + } + }, + "direct__google__gemini3_flash": { + "sdk_http_response": { + "headers": { + "content-type": "application/json; charset=UTF-8", + "vary": "Origin, X-Origin, Referer", + "content-encoding": "gzip", + "date": "Tue, 03 Mar 2026 13:11:32 GMT", + "server": "scaffolding on HTTPServer2", + "x-xss-protection": "0", + "x-frame-options": "SAMEORIGIN", + "x-content-type-options": "nosniff", + "server-timing": "gfet4t7; dur=4579", + "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000", + "transfer-encoding": "chunked" + }, + "body": null + }, + "candidates": [ + { + "content": { + "parts": [ + { + "media_resolution": null, + "code_execution_result": null, + "executable_code": null, + "file_data": null, + "function_call": null, + "function_response": null, + "inline_data": null, + "text": "Hello", + "thought": null, + "thought_signature": "EqwCCqkCAb4-9vtlfRlETMXR13Bw3xpBm-D3EzoUlVhmePvHy720UANX0hdyBGaq1d8FfiHVTMccuBl5r7sg3fy_2GoTexpytWLm15I7GfRloHt278ioOMDH3Ua8SVuGCIiRyIVSye1vkQw7p0KwZMzJ51fjhuBH-4_weZe24FglHg0p3eo79cKZIMz8eiWpcGtK6Xb25Gk1mXuKCi7GaifkKaOmhXTSjVZ-P-w5qERlTscMv-2YMD26Th8MUEg13PwFlz385A9RnHLH_oXkdr0lXAHemNj7dHdJEfNzjSgqCJdeVT3PCwH0v6-AIqdQuqD6jnvODLDPms5liN7VAVAAOZiq8tLDE771c3Xc-7eIPFdD3h9_cdvb82hjefYjEwC-aWNXQrl1SlVw0Un0", + "video_metadata": null + } + ], + "role": "model" + }, + "citation_metadata": null, + "finish_message": null, + "token_count": null, + "finish_reason": "MAX_TOKENS", + "avg_logprobs": null, + "grounding_metadata": null, + "index": 0, + "logprobs_result": null, + "safety_ratings": null, + "url_context_metadata": null + } + ], + "create_time": null, + "model_version": "gemini-3-flash-preview", + "prompt_feedback": null, + "response_id": "BN6maev1HoqB7M8P3fvBkAo", + "usage_metadata": { + "cache_tokens_details": null, + "cached_content_token_count": null, + "candidates_token_count": 1, + "candidates_tokens_details": null, + "prompt_token_count": 5, + "prompt_tokens_details": [ + { + "modality": "TEXT", + "token_count": 5 + } + ], + "thoughts_token_count": 59, + "tool_use_prompt_token_count": null, + "tool_use_prompt_tokens_details": null, + "total_token_count": 65, + "traffic_type": null + }, + "automatic_function_calling_history": [], + "parsed": null + }, + "litellm__openai__gpt5_mini": { + "id": "chatcmpl-DFJp7YfyGQH1HtypDlKDucycpUAdk", + "created": 1772543493, + "model": "gpt-5-mini-2025-08-07", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "length", + "index": 0, + "message": { + "content": "", + "role": "assistant", + "tool_calls": null, + "function_call": null, + "provider_specific_fields": { + "refusal": null + }, + "annotations": [] + }, + "provider_specific_fields": {} + } + ], + "usage": { + "completion_tokens": 64, + "prompt_tokens": 10, + "total_tokens": 74, + "completion_tokens_details": { + "accepted_prediction_tokens": 0, + "audio_tokens": 0, + "reasoning_tokens": 64, + "rejected_prediction_tokens": 0, + "text_tokens": null + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null + } + }, + "service_tier": "default" + }, + "litellm__anthropic__claude_haiku": { + "id": "chatcmpl-d251dec3-5b1a-424c-a432-cb71ea3d600f", + "created": 1772543495, + "model": "claude-haiku-4-5-20251001", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "stop", + "index": 0, + "message": { + "content": "Hello! \ud83d\udc4b How can I help you today?", + "role": "assistant", + "tool_calls": null, + "function_call": null, + "provider_specific_fields": { + "citations": null, + "thinking_blocks": null + } + } + } + ], + "usage": { + "completion_tokens": 16, + "prompt_tokens": 11, + "total_tokens": 27, + "completion_tokens_details": null, + "prompt_tokens_details": { + "audio_tokens": null, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null, + "cache_creation_tokens": 0, + "cache_creation_token_details": { + "ephemeral_5m_input_tokens": 0, + "ephemeral_1h_input_tokens": 0 + } + }, + "cache_creation_input_tokens": 0, + "cache_read_input_tokens": 0 + } + }, + "litellm__google__gemini3_flash": { + "id": "DN6macurC97hnsEPvs-FmA0", + "created": 1772543495, + "model": "gemini-3-flash-preview", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "length", + "index": 0, + "message": { + "content": "Hello", + "role": "assistant", + "tool_calls": null, + "function_call": null, + "images": [], + "thinking_blocks": [ + { + "type": "thinking", + "thinking": "{\"text\": \"Hello\"}", + "signature": "EpoCCpcCAb4+9vtdR++YPo/XeAmaLKPKkk7+YyeGjuHP9w646HEu9lG0xhb6qHOfkUTcH7xh08RlU6QXrTKAkXwfBAsSbiBfIBCGlzygFq+QGAS4LzUFaCLOD73MmSk7WiB393VWRw04NsxbhNtTH5aM9JFaxb7yvZMwWMckTON8L9Rv7gFlo6NmYjn01ct+kBKxleJzyD8d2AnAA4wMw9zqz8pLSAU9swKxmuqs0JkHt8WNRzwtw11xGt5zR909g/v/swLY/Oh+lcHiO7PMBsPHtBvzmPHTMM/ecn1VdA9sWqmoc8suFfzTaOPeegvtkhaytoZnaNZ/FoV9y9qVex5r8R0zvPd4ennA9/asI5P1i9HL0NedNJ78avW4" + } + ], + "provider_specific_fields": null + } + } + ], + "usage": { + "completion_tokens": 60, + "prompt_tokens": 5, + "total_tokens": 65, + "completion_tokens_details": { + "accepted_prediction_tokens": null, + "audio_tokens": null, + "reasoning_tokens": 59, + "rejected_prediction_tokens": null, + "text_tokens": 1 + }, + "prompt_tokens_details": { + "audio_tokens": null, + "cached_tokens": null, + "text_tokens": 5, + "image_tokens": null + } + }, + "vertex_ai_grounding_metadata": [], + "vertex_ai_url_context_metadata": [], + "vertex_ai_safety_results": [], + "vertex_ai_citation_metadata": [] + }, + "openrouter__openai__gpt5_mini": { + "id": "gen-1772543500-cToh8SauCW1u8pGlb4qQ", + "created": 1772543500, + "model": "openai/gpt-5-mini-2025-08-07", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "length", + "index": 0, + "message": { + "content": null, + "role": "assistant", + "tool_calls": null, + "function_call": null, + "provider_specific_fields": { + "refusal": null, + "reasoning": null, + "reasoning_details": [ + { + "type": "reasoning.summary", + "summary": "**Greet and respond warmly**\n\nThe user says, \"Hello, World!\" \u2014 it feels like a friendly greeting. I should definitely greet back warmly and ask how I can help them! It\u2019s a classic reference in programming, but they might just want a simple hello. I'm thinking of keeping my response concise. So, I\u2019ll reply with a friendly greeting and a question about what they\u2019d like to know or discuss. That seems like a nice approach!**Greet and engage**\n\nThe user is saying \"Hello, World!\" which feels like a classic greeting\u2014possibly a nod to programming culture. I think it\u2019s best to respond warmly, so I\u2019ll greet back with enthusiasm. Since they might just be saying hello, I\u2019ll keep it concise and friendly by asking how I can help. It\u2019s always nice to invite a conversation!", + "format": "openai-responses-v1", + "index": 0 + }, + { + "type": "reasoning.encrypted", + "data": "gAAAAABppt4SyBWDwFHFtAmspfjQOATVwEjInh21J297wGrovStyUcpp_QsIc3E18qz3EreGtoiQVrUwb4UnffV87xCLDfmoBtxDaSxzbYEJUgYNtjZ6hPr8peySEgtsGPypJmtVJJQ2In9BN-57EeNEqAifTsmKxtCPCm4KHRRAmiXI3zXpokxr8IldC8LYXFM4stVdsWBJxwBYWM6G_vV4VWgmJr15jHIk0tVhx2Rtoca5JQ-0MAf0mQQQbLHBAFnGKhNgBoi_Qnq06A87xSejoUkb7Lb8N6_1u9nFYyixciACYaJIqMeRU5timTRIivBsypP8GPgx-6HyCfqRGhi2nbd5HvTKw4vLTFbtDBR2lRUUsFfJXnbLZvBZO2jbWhYAPvQnnQjpbU5jXE6jPM8z-J4eyGvUg49u3n7fFqe-Nxph4Fuophnbz1-ZCdboejHXfbz9-zKcX-FaVhCkuT82gUWNBq09lLpmOjGERQr5EHguZbhGC1QKkSG59iXMTfRlPssV42xYDpWL1ci0Jbg96TAq8sEnlaY9AMtJTh0NH14Ou8rX3g-g7U2MDomJbcZtY8oNZtyY_3s7ENSatmmaCsX6eQsRuuhSOrxXZSz1l4Zyxes-TseYCQya0YPu3eCNA7-7qhYBWDbtxdBqaTyN9krqM9rkC_p4fQn3q4-2S-Wt9kElCX-SrdMR_qXYZPz8O4BsJwM1aA8gQQji5X8CnYFWTBkBLEQuv2MaR6dDuwvZUsWuCf41YJGw4GJHmdBdDbZflvgpVmuBRwk476MqDac6jXl2VlOgQ0v0zk4M6j5Hb29uCgUFDv0aTyf24wqAZdYRsKQOSLV2Wke38K1qLvUfn99yqkBllsFpdk0DsJJBG4axiK4Kr10BhhApNJokRqIjkT8HU7w5PDRLPryFoc6kuMIuS72RhOKXxZrDu7D_fuWHseOMyVrDULSYhf_GfZEIcnFwGBcIRhhQZG-lzSs_wssCojIGjRX0J67fOZk8YCCvjeabRCbbGTbHDXZxhRL_5Niwz0V4Jgd_97pOlIsVVOgS2-IuIc4445WjpkqGk6mRplBTZEPwZV2ny3v9w3aq-W6_lasXOOmv342RTXXo-pSKaZrowkI3rQUJ_fR5y7mumdDI82C-2onxbNfWI65PUgRW5KUXVgL4RXPu0yI0wu2z7LTyNaVoLSaF9wOtOzEtLux9Pf50EYjqlfD7niQoVR8Pv9D-1fhvrFDmeAzmgBdaqmCWhJWJgZUvtN43Wv2UNjk=", + "format": "openai-responses-v1", + "id": "rs_0f89d92952e937610169a6de0d3f28819085143727619d92cb", + "index": 1 + } + ] + } + }, + "provider_specific_fields": { + "native_finish_reason": "max_output_tokens" + } + } + ], + "usage": { + "completion_tokens": 64, + "prompt_tokens": 10, + "total_tokens": 74, + "completion_tokens_details": { + "accepted_prediction_tokens": null, + "audio_tokens": 0, + "reasoning_tokens": 64, + "rejected_prediction_tokens": null, + "text_tokens": null, + "image_tokens": 0 + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null, + "cache_write_tokens": 0, + "video_tokens": 0 + }, + "cost": 0.0001305, + "is_byok": false, + "cost_details": { + "upstream_inference_cost": 0.0001305, + "upstream_inference_prompt_cost": 2.5e-06, + "upstream_inference_completions_cost": 0.000128 + } + }, + "provider": "OpenAI" + }, + "openrouter__anthropic__claude_haiku": { + "id": "gen-1772543509-FAaKhTwazzoJmVDDd3ih", + "created": 1772543509, + "model": "anthropic/claude-4.5-haiku-20251001", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "stop", + "index": 0, + "message": { + "content": "Hello! \ud83d\udc4b How can I help you today?", + "role": "assistant", + "tool_calls": null, + "function_call": null, + "provider_specific_fields": { + "refusal": null, + "reasoning": null + } + }, + "provider_specific_fields": { + "native_finish_reason": "stop" + } + } + ], + "usage": { + "completion_tokens": 16, + "prompt_tokens": 11, + "total_tokens": 27, + "completion_tokens_details": { + "accepted_prediction_tokens": null, + "audio_tokens": 0, + "reasoning_tokens": 0, + "rejected_prediction_tokens": null, + "text_tokens": null, + "image_tokens": 0 + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null, + "cache_write_tokens": 0, + "video_tokens": 0 + }, + "cost": 9.1e-05, + "is_byok": false, + "cost_details": { + "upstream_inference_cost": 9.1e-05, + "upstream_inference_prompt_cost": 1.1e-05, + "upstream_inference_completions_cost": 8e-05 + } + }, + "provider": "Google" + }, + "openrouter__google__gemini3_flash": { + "id": "gen-1772543512-Mxn343CzRXITNLaWa3uw", + "created": 1772543512, + "model": "google/gemini-3-flash-preview-20251217", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "stop", + "index": 0, + "message": { + "content": "Hello, World! How can I help you today?", + "role": "assistant", + "tool_calls": null, + "function_call": null, + "provider_specific_fields": { + "refusal": null, + "reasoning": null, + "reasoning_details": [ + { + "type": "reasoning.encrypted", + "data": "CiEBjz1rX5mfLGj1Fml96xozj3K4fv7JeTBdOSaUUlxd96c=", + "format": "google-gemini-v1", + "index": 0 + } + ] + } + }, + "provider_specific_fields": { + "native_finish_reason": "STOP" + } + } + ], + "usage": { + "completion_tokens": 11, + "prompt_tokens": 4, + "total_tokens": 15, + "completion_tokens_details": { + "accepted_prediction_tokens": null, + "audio_tokens": 0, + "reasoning_tokens": 0, + "rejected_prediction_tokens": null, + "text_tokens": null, + "image_tokens": 0 + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null, + "cache_write_tokens": 0, + "video_tokens": 0 + }, + "cost": 3.5e-05, + "is_byok": false, + "cost_details": { + "upstream_inference_cost": 3.5e-05, + "upstream_inference_prompt_cost": 2e-06, + "upstream_inference_completions_cost": 3.3e-05 + } + }, + "provider": "Google" + }, + "openrouter__qwen__qwen3_30b": { + "id": "gen-1772543515-76qFgjV9ySYOE8mtplV6", + "created": 1772543515, + "model": "qwen/qwen3-30b-a3b-04-28", + "object": "chat.completion", + "system_fingerprint": null, + "choices": [ + { + "finish_reason": "length", + "index": 0, + "message": { + "content": null, + "role": "assistant", + "tool_calls": null, + "function_call": null, + "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,", + "provider_specific_fields": { + "refusal": null, + "reasoning": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,", + "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like," + } + }, + "provider_specific_fields": { + "native_finish_reason": "length" + } + } + ], + "usage": { + "completion_tokens": 64, + "prompt_tokens": 13, + "total_tokens": 77, + "completion_tokens_details": { + "accepted_prediction_tokens": null, + "audio_tokens": 0, + "reasoning_tokens": 75, + "rejected_prediction_tokens": null, + "text_tokens": null, + "image_tokens": 0 + }, + "prompt_tokens_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": null, + "image_tokens": null, + "cache_write_tokens": 0, + "video_tokens": 0 + }, + "cost": 1.896e-05, + "is_byok": false, + "cost_details": { + "upstream_inference_cost": 1.896e-05, + "upstream_inference_prompt_cost": 1.04e-06, + "upstream_inference_completions_cost": 1.792e-05 + } + }, + "provider": "DeepInfra" + } +} \ No newline at end of file diff --git a/usage_tracking/api_usage_test.py b/usage_tracking/api_usage_test.py new file mode 100644 index 00000000..1c0a34b2 --- /dev/null +++ b/usage_tracking/api_usage_test.py @@ -0,0 +1,154 @@ +""" +Test script that calls GPT-5 mini, Claude Haiku 4.5, and Gemini 3 Flash in three +conditions each — (1) native client, (2) LiteLLM, (3) LiteLLM via OpenRouter — +plus Qwen 3 via LiteLLM+OpenRouter. Saves full response dicts to JSON for +usage/cost analysis. +""" + +import json +import os +import time +from pathlib import Path + +import anthropic +import litellm +import requests +from dotenv import load_dotenv +from google import genai +from google.genai import types +from openai import OpenAI + +load_dotenv() + +OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] +ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"] +GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"] +OPENROUTER_API_KEY = os.environ["OPENROUTER_API_KEY"] + +# LiteLLM reads OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY from env. +# For Gemini it expects GEMINI_API_KEY, so alias it. +os.environ.setdefault("GEMINI_API_KEY", GOOGLE_API_KEY) + +PROMPT = "Hello, World!" +MAX_TOKENS = 64 +TOTAL = 10 + +results = {} + + +def step(n: int, label: str): + print(f"{n}/{TOTAL} {label} ...") + + +# =========================================================================== # +# CONDITION 1 — Native SDKs (direct) +# =========================================================================== # + +# -- 1. GPT-5 mini (OpenAI) ------------------------------------------------ # +step(1, "GPT-5 mini — direct (OpenAI SDK)") +openai_client = OpenAI(api_key=OPENAI_API_KEY) +resp = openai_client.chat.completions.create( + model="gpt-5-mini", + messages=[{"role": "user", "content": PROMPT}], + max_completion_tokens=MAX_TOKENS, +) +results["direct__openai__gpt5_mini"] = resp.model_dump() +print(f" done — {resp.usage.total_tokens} tokens") + +# -- 2. Claude Haiku 4.5 (Anthropic) --------------------------------------- # +step(2, "Claude Haiku 4.5 — direct (Anthropic SDK)") +anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) +resp = anthropic_client.messages.create( + model="claude-haiku-4-5-20251001", + max_tokens=MAX_TOKENS, + messages=[{"role": "user", "content": PROMPT}], +) +results["direct__anthropic__claude_haiku"] = resp.model_dump() +print(f" done — {resp.usage.input_tokens + resp.usage.output_tokens} tokens") + +# -- 3. Gemini 3 Flash (Google) -------------------------------------------- # +step(3, "Gemini 3 Flash — direct (Google GenAI SDK)") +google_client = genai.Client(api_key=GOOGLE_API_KEY) +resp = google_client.models.generate_content( + model="gemini-3-flash-preview", + contents=PROMPT, + config=types.GenerateContentConfig(max_output_tokens=MAX_TOKENS), +) +results["direct__google__gemini3_flash"] = resp.model_dump(mode="json") +total = resp.usage_metadata.total_token_count if resp.usage_metadata else "n/a" +print(f" done — {total} tokens") + + +# =========================================================================== # +# CONDITION 2 — LiteLLM (direct to providers) +# =========================================================================== # + +litellm_direct_models = { + "litellm__openai__gpt5_mini": "gpt-5-mini", + "litellm__anthropic__claude_haiku": "claude-haiku-4-5-20251001", + "litellm__google__gemini3_flash": "gemini/gemini-3-flash-preview", +} + +for i, (label, model) in enumerate(litellm_direct_models.items(), start=4): + step(i, f"{model} — LiteLLM (direct)") + resp = litellm.completion( + model=model, + messages=[{"role": "user", "content": PROMPT}], + max_tokens=MAX_TOKENS, + ) + results[label] = resp.model_dump() + usage_total = resp.usage.total_tokens if resp.usage else "n/a" + print(f" done — {usage_total} tokens") + + +# =========================================================================== # +# CONDITION 3 — LiteLLM via OpenRouter (+Qwen) +# =========================================================================== # + + +def fetch_openrouter_generation(gen_id: str) -> dict | None: + """Query OpenRouter's generation endpoint for cost metadata.""" + time.sleep(2) # brief wait for metadata to be available + r = requests.get( + f"https://openrouter.ai/api/v1/generation?id={gen_id}", + headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}, + ) + if r.status_code == 200: + return r.json() + return None + + +litellm_openrouter_models = { + "openrouter__openai__gpt5_mini": "openrouter/openai/gpt-5-mini", + "openrouter__anthropic__claude_haiku": "openrouter/anthropic/claude-haiku-4-5", + "openrouter__google__gemini3_flash": "openrouter/google/gemini-3-flash-preview", + "openrouter__qwen__qwen3_30b": "openrouter/qwen/qwen3-30b-a3b", +} + +for i, (label, model) in enumerate(litellm_openrouter_models.items(), start=7): + step(i, f"{model} — LiteLLM (OpenRouter)") + resp = litellm.completion( + model=model, + messages=[{"role": "user", "content": PROMPT}], + max_tokens=MAX_TOKENS, + ) + result = resp.model_dump() + + # Fetch OpenRouter generation metadata (cost, native tokens, etc.) + gen_meta = fetch_openrouter_generation(resp.id) + if gen_meta: + result["_openrouter_generation"] = gen_meta + + results[label] = result + usage_total = resp.usage.total_tokens if resp.usage else "n/a" + print(f" done — {usage_total} tokens") + + +# =========================================================================== # +# Save results +# =========================================================================== # +out_path = Path(__file__).parent / "api_usage_results.json" +with open(out_path, "w") as f: + json.dump(results, f, indent=2, default=str) + +print(f"\nResults saved to {out_path}") From 4eb847f7b28268bcc208702bb36f1c96535c4b4b Mon Sep 17 00:00:00 2001 From: cemde Date: Thu, 12 Mar 2026 18:05:58 +0100 Subject: [PATCH 02/19] initial commit --- maseval/__init__.py | 10 + maseval/core/benchmark.py | 43 ++- maseval/core/cost.py | 132 +++++++++ maseval/core/model.py | 40 ++- maseval/core/registry.py | 125 +++++++- maseval/core/reporting.py | 148 ++++++++++ maseval/core/usage.py | 301 ++++++++++++++++++++ maseval/interface/cost.py | 102 +++++++ maseval/interface/inference/anthropic.py | 10 +- maseval/interface/inference/google_genai.py | 10 +- maseval/interface/inference/huggingface.py | 6 +- maseval/interface/inference/litellm.py | 25 +- maseval/interface/inference/openai.py | 20 +- usage_tracking/PLAN.md | 85 +++++- 14 files changed, 1029 insertions(+), 28 deletions(-) create mode 100644 maseval/core/cost.py create mode 100644 maseval/core/reporting.py create mode 100644 maseval/core/usage.py create mode 100644 maseval/interface/cost.py diff --git a/maseval/__init__.py b/maseval/__init__.py index d50350ff..addedf3e 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -37,6 +37,9 @@ from .core.evaluator import Evaluator from .core.history import MessageHistory, ToolInvocationHistory from .core.tracing import TraceableMixin +from .core.usage import Usage, TokenUsage, UsageTrackableMixin +from .core.cost import CostCalculator, StaticPricingCalculator +from .core.reporting import UsageReporter from .core.registry import ComponentRegistry from .core.context import TaskContext from .core.exceptions import ( @@ -87,6 +90,13 @@ "MessageHistory", "ToolInvocationHistory", "TraceableMixin", + # Usage tracking + "Usage", + "TokenUsage", + "UsageTrackableMixin", + "CostCalculator", + "StaticPricingCalculator", + "UsageReporter", # Registry and execution context "ComponentRegistry", "TaskContext", diff --git a/maseval/core/benchmark.py b/maseval/core/benchmark.py index a027eb9d..d59ad233 100644 --- a/maseval/core/benchmark.py +++ b/maseval/core/benchmark.py @@ -16,6 +16,7 @@ from .callback import BenchmarkCallback from .user import User from .tracing import TraceableMixin +from .usage import Usage from .registry import ComponentRegistry, RegisterableComponent from .context import TaskContext from .utils.system_info import gather_benchmark_config @@ -442,6 +443,36 @@ def collect_all_configs(self) -> Dict[str, Any]: """ return self._registry.collect_configs() + def collect_all_usage(self) -> Dict[str, Any]: + """Collect usage from all registered components for the current task repetition. + + This method is called automatically by ``run()`` after each task repetition + completes. It gathers usage from all registered ``UsageTrackableMixin`` + components and also accumulates into persistent running totals accessible + via ``usage`` and ``usage_by_component``. + + Returns: + Structured dictionary containing usage from all registered components. + """ + return self._registry.collect_usage() + + @property + def usage(self) -> Usage: + """Running usage total across all task repetitions. + + Queryable at any time, including while the benchmark is still running. + Returns the grand total of all usage collected so far. + """ + return self._registry.total_usage + + @property + def usage_by_component(self) -> Dict[str, Usage]: + """Per-component running usage totals across all repetitions. + + Keys are registry keys (e.g., ``"models:main_model"``). + """ + return self._registry.usage_by_component + def _invoke_callbacks(self, method_name: str, *args, suppress_errors: bool = True, **kwargs) -> List[Exception]: """Invoke a callback method on all registered callbacks (thread-safe). @@ -1176,14 +1207,16 @@ def _execute_task_repetition( final_answers = None - # 3. Collect traces and configs (always attempt this) + # 3. Collect traces, configs, and usage (always attempt this) + execution_usage: Optional[Dict[str, Any]] = None try: execution_configs = self.collect_all_configs() execution_traces = self.collect_all_traces() + execution_usage = self.collect_all_usage() # Store in context for potential timeout errors context.set_collected_traces(execution_traces) except Exception as e: - # If trace/config collection fails, record it but continue + # If collection fails, record it but continue execution_configs = { "error": f"Failed to collect configs: {e}", "error_type": type(e).__name__, @@ -1192,6 +1225,11 @@ def _execute_task_repetition( "error": f"Failed to collect traces: {e}", "error_type": type(e).__name__, } + if execution_usage is None: + execution_usage = { + "error": f"Failed to collect usage: {e}", + "error_type": type(e).__name__, + } # 4. Evaluate (skip if task execution failed) if execution_status == TaskExecutionStatus.SUCCESS: @@ -1234,6 +1272,7 @@ def _execute_task_repetition( "error": error_info, "traces": execution_traces, "config": execution_configs, + "usage": execution_usage, "eval": eval_results, "task": { "query": task.query, diff --git a/maseval/core/cost.py b/maseval/core/cost.py new file mode 100644 index 00000000..a6eb8583 --- /dev/null +++ b/maseval/core/cost.py @@ -0,0 +1,132 @@ +"""Pluggable cost calculation for usage records. + +This module provides the ``CostCalculator`` protocol and a built-in +``StaticPricingCalculator`` that computes cost from token counts and +user-supplied pricing tables. For automatic pricing via LiteLLM's +bundled model database, see ``maseval.interface.cost``. + +Cost calculators are optional — if no calculator is provided to a +``ModelAdapter``, cost is only set when the provider reports it directly +(e.g., LiteLLM's ``response._hidden_params.response_cost``). +""" + +from __future__ import annotations + +from typing import Any, Dict, Optional, Protocol, runtime_checkable + +from .usage import TokenUsage + + +@runtime_checkable +class CostCalculator(Protocol): + """Protocol for computing cost from token usage. + + Implementations receive a ``TokenUsage`` and the model ID, and return + the cost in whatever unit the calculator declares (typically USD). + + Example: + ```python + class MyCostCalculator: + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + rate = MY_PRICING.get(model_id) + if rate is None: + return None + return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens + ``` + """ + + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + """Compute cost for a single chat call. + + Args: + usage: Token usage from the call. + model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``). + + Returns: + Cost as a float, or ``None`` if pricing is unknown for this model. + """ + ... + + +class StaticPricingCalculator: + """Cost calculator using user-supplied per-model pricing. + + Pricing is specified as cost per token (not per 1K or 1M tokens). + If a model is not in the pricing table, ``calculate_cost`` returns ``None``. + + Args: + pricing: Dict mapping model IDs to their per-token rates. + Each value is a dict with keys: + + - ``"input"`` — cost per input token (required) + - ``"output"`` — cost per output token (required) + - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate) + + Example: + ```python + calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, + }) + + model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator) + ``` + + For university clusters or custom credit systems, the "cost" unit + is whatever the pricing values represent (credits, EUR, etc.): + + ```python + calculator = StaticPricingCalculator({ + "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token + }) + ``` + """ + + def __init__(self, pricing: Dict[str, Dict[str, float]]): + self._pricing = pricing + + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + """Compute cost from static per-token rates. + + Args: + usage: Token usage from the call. + model_id: The model identifier to look up in the pricing table. + + Returns: + Computed cost, or ``None`` if the model is not in the pricing table. + """ + rates = self._pricing.get(model_id) + if rates is None: + return None + + input_rate = rates.get("input", 0.0) + output_rate = rates.get("output", 0.0) + cached_rate = rates.get("cached_input", input_rate) + + # Non-cached input tokens = total input - cached + non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens) + + cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate + + return cost + + def add_model(self, model_id: str, rates: Dict[str, float]) -> None: + """Add or update pricing for a model. + + Args: + model_id: The model identifier. + rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``). + """ + self._pricing[model_id] = rates + + @property + def models(self) -> list[str]: + """List of model IDs with pricing configured.""" + return list(self._pricing.keys()) + + def gather_config(self) -> Dict[str, Any]: + """Return pricing configuration for reproducibility.""" + return { + "type": type(self).__name__, + "pricing": dict(self._pricing), + } diff --git a/maseval/core/model.py b/maseval/core/model.py index cac1c2ed..f33e48f4 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -55,6 +55,8 @@ from .tracing import TraceableMixin from .config import ConfigurableMixin +from .usage import Usage, TokenUsage, UsageTrackableMixin +from .cost import CostCalculator from .history import MessageHistory @@ -133,7 +135,7 @@ def to_message(self) -> Dict[str, Any]: return msg -class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin): +class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): """Abstract base class for model adapters. ModelAdapter provides a consistent interface for LLM inference across @@ -169,17 +171,24 @@ class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin): adapter's seed parameter. """ - def __init__(self, seed: Optional[int] = None): + def __init__(self, seed: Optional[int] = None, cost_calculator: Optional[CostCalculator] = None): """Initialize the model adapter with call tracing. Args: seed: Seed for deterministic generation. Passed to the underlying provider API if supported. If the provider doesn't support seeding, subclasses should raise SeedingError. + cost_calculator: Optional cost calculator for computing USD (or + other unit) cost from token counts. If provided and the + provider does not report cost directly, the calculator is + used to fill in ``Usage.cost`` after each call. Provider- + reported cost always takes precedence. """ super().__init__() self._seed = seed + self._cost_calculator = cost_calculator self.logs: List[Dict[str, Any]] = [] + self._usage_records: List[Usage] = [] @property def seed(self) -> Optional[int]: @@ -298,6 +307,17 @@ def chat( } ) + # Record token usage if available + if result.usage: + cost = result.usage.get("cost") if isinstance(result.usage.get("cost"), (int, float)) else None + token_usage = TokenUsage.from_chat_response_usage(result.usage, cost=cost, kind="llm") + + # If no provider-reported cost, try the cost calculator + if token_usage.cost is None and self._cost_calculator is not None: + token_usage.cost = self._cost_calculator.calculate_cost(token_usage, self.model_id) + + self._usage_records.append(token_usage) + return result except Exception as e: @@ -375,6 +395,16 @@ def generate( response = self.chat(messages, generation_params=generation_params, **kwargs) return response.content or "" + def gather_usage(self) -> Usage: + """Gather accumulated token usage from all chat calls. + + Returns: + Summed TokenUsage across all calls, or empty Usage if no calls were made. + """ + if not self._usage_records: + return Usage() + return sum(self._usage_records, Usage()) + def gather_traces(self) -> Dict[str, Any]: """Gather execution traces from this model adapter. @@ -431,9 +461,13 @@ def gather_config(self) -> Dict[str, Any]: Returns: Dictionary containing model configuration. """ - return { + config = { **super().gather_config(), "model_id": self.model_id, "adapter_type": type(self).__name__, "seed": self._seed, } + if self._cost_calculator is not None: + gather = getattr(self._cost_calculator, "gather_config", None) + config["cost_calculator"] = gather() if callable(gather) else type(self._cost_calculator).__name__ + return config diff --git a/maseval/core/registry.py b/maseval/core/registry.py index 267e9ee2..e34fc972 100644 --- a/maseval/core/registry.py +++ b/maseval/core/registry.py @@ -11,9 +11,10 @@ from .tracing import TraceableMixin from .config import ConfigurableMixin +from .usage import Usage, UsageTrackableMixin # Type alias for components that can be registered -RegisterableComponent = Union[TraceableMixin, ConfigurableMixin] +RegisterableComponent = Union[TraceableMixin, ConfigurableMixin, UsageTrackableMixin] class ComponentRegistry: @@ -48,6 +49,12 @@ def __init__(self, benchmark_config: Optional[Dict[str, Any]] = None): self._local = threading.local() self._benchmark_config = benchmark_config or {} + # Persistent usage aggregates (NOT cleared between repetitions). + # Protected by a lock since multiple threads may call collect_usage(). + self._usage_lock = threading.Lock() + self._usage_total: Usage = Usage() + self._usage_by_component: Dict[str, Usage] = {} + # --- Thread-local state properties --- @property @@ -74,6 +81,18 @@ def _config_component_id_map(self) -> Dict[int, str]: self._local.config_component_id_map = {} return self._local.config_component_id_map + @property + def _usage_registry(self) -> Dict[str, UsageTrackableMixin]: + if not hasattr(self._local, "usage_registry"): + self._local.usage_registry = {} + return self._local.usage_registry + + @property + def _usage_component_id_map(self) -> Dict[int, str]: + if not hasattr(self._local, "usage_component_id_map"): + self._local.usage_component_id_map = {} + return self._local.usage_component_id_map + # --- Public API --- def register(self, category: str, name: str, component: RegisterableComponent) -> RegisterableComponent: @@ -94,7 +113,11 @@ def register(self, category: str, name: str, component: RegisterableComponent) - key = f"{category}:{name}" # Check for duplicate registration under different key - existing_key = self._component_id_map.get(component_id) or self._config_component_id_map.get(component_id) + existing_key = ( + self._component_id_map.get(component_id) + or self._config_component_id_map.get(component_id) + or self._usage_component_id_map.get(component_id) + ) if existing_key and existing_key != key: raise ValueError( f"Component is already registered as '{existing_key}' and cannot be " @@ -114,14 +137,25 @@ def register(self, category: str, name: str, component: RegisterableComponent) - self._config_registry[key] = component self._config_component_id_map[component_id] = key + # Register for usage tracking if supported + if isinstance(component, UsageTrackableMixin): + self._usage_registry[key] = component + self._usage_component_id_map[component_id] = key + return component def clear(self) -> None: - """Clear all registrations for the current thread.""" + """Clear per-repetition registrations for the current thread. + + Does NOT clear persistent usage aggregates (``total_usage``, + ``usage_by_component``), which accumulate across all repetitions. + """ self._trace_registry.clear() self._component_id_map.clear() self._config_registry.clear() self._config_component_id_map.clear() + self._usage_registry.clear() + self._usage_component_id_map.clear() def collect_traces(self) -> Dict[str, Any]: """Collect execution traces from all registered components.""" @@ -238,6 +272,91 @@ def collect_configs(self) -> Dict[str, Any]: return configs + def collect_usage(self) -> Dict[str, Any]: + """Collect usage from all registered UsageTrackableMixin components. + + Returns a structured dict (same shape as ``collect_traces()`` and + ``collect_configs()``). Also accumulates into persistent aggregates + (``total_usage``, ``usage_by_component``) that survive ``clear()``. + """ + usage: Dict[str, Any] = { + "metadata": { + "timestamp": datetime.now().isoformat(), + "thread_id": threading.current_thread().ident, + "total_components": len(self._usage_registry), + }, + "agents": {}, + "models": {}, + "tools": {}, + "simulators": {}, + "callbacks": {}, + "environment": None, + "user": None, + "other": {}, + } + + for key, component in self._usage_registry.items(): + category, comp_name = key.split(":", 1) + + try: + component_usage = component.gather_usage() + + # Inject grouping fields from registry context if not set + if component_usage.category is None: + component_usage.category = category + if component_usage.component_name is None: + component_usage.component_name = comp_name + + usage_dict = component_usage.to_dict() + + # Handle environment and user as direct values (not nested in dict) + if category == "environment": + usage["environment"] = usage_dict + elif category == "user": + usage["user"] = usage_dict + else: + if category not in usage: + usage[category] = {} + usage[category][comp_name] = usage_dict + + # Accumulate into persistent aggregates (thread-safe) + with self._usage_lock: + self._usage_total = self._usage_total + component_usage + if key in self._usage_by_component: + self._usage_by_component[key] = self._usage_by_component[key] + component_usage + else: + self._usage_by_component[key] = component_usage + + except Exception as e: + error_info = { + "error": f"Failed to gather usage: {e}", + "error_type": type(e).__name__, + "component_type": type(component).__name__, + } + + if category == "environment": + usage["environment"] = error_info + elif category == "user": + usage["user"] = error_info + else: + if category not in usage: + usage[category] = {} + usage[category][comp_name] = error_info + + return usage + + @property + def total_usage(self) -> Usage: + """Running usage total across all repetitions. Queryable at any time.""" + with self._usage_lock: + return self._usage_total + + @property + def usage_by_component(self) -> Dict[str, Usage]: + """Per-component running totals across all repetitions.""" + with self._usage_lock: + return dict(self._usage_by_component) + def update_benchmark_config(self, benchmark_config: Dict[str, Any]) -> None: """Update the benchmark-level configuration. diff --git a/maseval/core/reporting.py b/maseval/core/reporting.py new file mode 100644 index 00000000..2465fba0 --- /dev/null +++ b/maseval/core/reporting.py @@ -0,0 +1,148 @@ +"""Post-hoc usage reporting utilities. + +This module provides ``UsageReporter`` for slicing and analyzing usage data +from benchmark reports. Unlike the registry's live aggregates (which provide +running totals), the reporter can slice by task since it sees the full report +list with task IDs. +""" + +from __future__ import annotations + +from typing import Any, Dict, List + +from .usage import Usage, TokenUsage + + +class UsageReporter: + """Post-hoc utility for analyzing usage across benchmark reports. + + Walks ``report["usage"]`` across all reports to produce breakdowns + by task, component, model, etc. + + Example: + ```python + reporter = UsageReporter.from_reports(benchmark.reports) + print(reporter.total()) + print(reporter.by_task()) + print(reporter.by_component()) + ``` + """ + + def __init__(self, entries: List[Dict[str, Any]]): + """Initialize with raw entries extracted from reports. + + Args: + entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``, + and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples). + """ + self._entries = entries + + @staticmethod + def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter: + """Create a UsageReporter from benchmark reports. + + Args: + reports: The ``benchmark.reports`` list. + + Returns: + A UsageReporter ready for analysis. + """ + entries = [] + for report in reports: + usage_data = report.get("usage") + if not usage_data or "error" in usage_data: + continue + + usage_items = [] + for category, value in usage_data.items(): + if category == "metadata": + continue + if isinstance(value, dict) and "cost" in value: + # Direct value (environment/user) — it's a usage dict + usage_items.append((category, value)) + elif isinstance(value, dict): + # Category dict with component names as keys + for comp_name, comp_usage in value.items(): + if isinstance(comp_usage, dict) and "error" not in comp_usage: + usage_items.append((f"{category}:{comp_name}", comp_usage)) + + entries.append( + { + "task_id": report.get("task_id"), + "repeat_idx": report.get("repeat_idx"), + "usage_items": usage_items, + } + ) + + return UsageReporter(entries) + + @staticmethod + def _usage_from_dict(d: Dict[str, Any]) -> Usage: + """Reconstruct a Usage (or TokenUsage) from a serialized dict.""" + has_tokens = "input_tokens" in d + if has_tokens: + return TokenUsage( + cost=d.get("cost"), + units=d.get("units", {}), + provider=d.get("provider"), + category=d.get("category"), + component_name=d.get("component_name"), + kind=d.get("kind"), + input_tokens=d.get("input_tokens", 0), + output_tokens=d.get("output_tokens", 0), + total_tokens=d.get("total_tokens", 0), + cached_input_tokens=d.get("cached_input_tokens", 0), + reasoning_tokens=d.get("reasoning_tokens", 0), + audio_tokens=d.get("audio_tokens", 0), + ) + return Usage( + cost=d.get("cost"), + units=d.get("units", {}), + provider=d.get("provider"), + category=d.get("category"), + component_name=d.get("component_name"), + kind=d.get("kind"), + ) + + def by_task(self) -> Dict[str, Usage]: + """Aggregate usage by task_id across all repetitions.""" + result: Dict[str, Usage] = {} + for entry in self._entries: + task_id = entry["task_id"] + for _key, usage_dict in entry["usage_items"]: + usage = self._usage_from_dict(usage_dict) + if task_id in result: + result[task_id] = result[task_id] + usage + else: + result[task_id] = usage + return result + + def by_component(self) -> Dict[str, Usage]: + """Aggregate usage by registry key (e.g., ``"models:main_model"``).""" + result: Dict[str, Usage] = {} + for entry in self._entries: + for key, usage_dict in entry["usage_items"]: + usage = self._usage_from_dict(usage_dict) + if key in result: + result[key] = result[key] + usage + else: + result[key] = usage + return result + + def total(self) -> Usage: + """Grand total across all tasks and components.""" + all_usages = [] + for entry in self._entries: + for _key, usage_dict in entry["usage_items"]: + all_usages.append(self._usage_from_dict(usage_dict)) + if not all_usages: + return Usage() + return sum(all_usages, Usage()) + + def summary(self) -> Dict[str, Any]: + """Nested dict with all breakdowns.""" + return { + "total": self.total().to_dict(), + "by_task": {k: v.to_dict() for k, v in self.by_task().items()}, + "by_component": {k: v.to_dict() for k, v in self.by_component().items()}, + } diff --git a/maseval/core/usage.py b/maseval/core/usage.py new file mode 100644 index 00000000..78edeab9 --- /dev/null +++ b/maseval/core/usage.py @@ -0,0 +1,301 @@ +"""Core usage tracking infrastructure for API cost and resource monitoring. + +This module provides the `Usage` and `TokenUsage` data classes for recording +billable resource consumption, and the `UsageTrackableMixin` that enables +automatic usage collection through the component registry. + +Usage tracking is a first-class collection axis alongside tracing +(`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that +inherit `UsageTrackableMixin` have their usage automatically collected by the +registry via `gather_usage()`. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Any, Optional, Dict + + +@dataclass +class Usage: + """Generic usage record for any billable resource. + + Represents accumulated cost and countable units for a component or + aggregated group. Grouping fields (`provider`, `category`, + `component_name`, `kind`) identify what scope the record covers. + When two records are summed, matching grouping fields are preserved; + mismatches become `None` (meaning "aggregated over"). + + Attributes: + cost: Total cost in USD. `None` means unknown/not reported. + units: Arbitrary countable units (e.g., ``{"api_calls": 3}``). + provider: Provider identifier (e.g., ``"anthropic"``, ``"bloomberg"``). + category: Registry category (e.g., ``"models"``, ``"tools"``). + component_name: Component name within category (e.g., ``"main_model"``). + kind: Component kind (e.g., ``"llm"``, ``"service"``, ``"local"``). + + Example: + ```python + usage = Usage(cost=0.05, units={"api_calls": 1}, provider="bloomberg", kind="service") + + # Summing preserves matching fields + total = usage + Usage(cost=0.03, units={"api_calls": 2}, provider="bloomberg", kind="service") + assert total.cost == 0.08 + assert total.units == {"api_calls": 3} + assert total.provider == "bloomberg" + + # Mismatched fields become None + mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm") + assert mixed.provider is None # aggregated over + assert mixed.kind is None # aggregated over + ``` + """ + + cost: Optional[float] = None + units: Dict[str, int | float] = field(default_factory=dict) + provider: Optional[str] = None + category: Optional[str] = None + component_name: Optional[str] = None + kind: Optional[str] = None + + def __add__(self, other: Usage) -> Usage: + if not isinstance(other, Usage): + return NotImplemented + + # Sum costs: both known -> sum, either unknown -> None + if self.cost is not None and other.cost is not None: + cost = self.cost + other.cost + else: + cost = None + + # Sum units + units: Dict[str, int | float] = dict(self.units) + for key, value in other.units.items(): + units[key] = units.get(key, 0) + value + + # Grouping fields: preserve on match, None on mismatch + provider = self.provider if self.provider == other.provider else None + category = self.category if self.category == other.category else None + component_name = self.component_name if self.component_name == other.component_name else None + kind = self.kind if self.kind == other.kind else None + + return Usage( + cost=cost, + units=units, + provider=provider, + category=category, + component_name=component_name, + kind=kind, + ) + + def __radd__(self, other: object) -> Usage: + """Support sum() by handling 0 + Usage.""" + if other == 0: + return self + if isinstance(other, Usage): + return other.__add__(self) + return NotImplemented + + def to_dict(self) -> Dict[str, Any]: + """Serialize to a JSON-compatible dictionary.""" + return { + "cost": self.cost, + "units": dict(self.units), + "provider": self.provider, + "category": self.category, + "component_name": self.component_name, + "kind": self.kind, + } + + +@dataclass +class TokenUsage(Usage): + """LLM-specific usage record with token counts. + + Extends `Usage` with token fields reported by LLM providers. Use + `from_chat_response_usage()` to create from the dict returned by + model adapters. + + Attributes: + input_tokens: Number of input/prompt tokens. + output_tokens: Number of output/completion tokens. + total_tokens: Total tokens (input + output). + cached_input_tokens: Tokens served from cache (Anthropic ``cache_read_input_tokens``, + OpenAI ``cached_tokens``). + reasoning_tokens: Tokens used for reasoning (OpenAI ``reasoning_tokens``, + Google ``thoughts_token_count``). + audio_tokens: Tokens for audio processing (OpenAI). + + Example: + ```python + token_usage = TokenUsage.from_chat_response_usage({ + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + }) + assert token_usage.input_tokens == 100 + ``` + """ + + input_tokens: int = 0 + output_tokens: int = 0 + total_tokens: int = 0 + cached_input_tokens: int = 0 + reasoning_tokens: int = 0 + audio_tokens: int = 0 + + def __add__(self, other: Usage) -> Usage: + base = super().__add__(other) + if not isinstance(base, Usage): + return NotImplemented + + if isinstance(other, TokenUsage): + return TokenUsage( + cost=base.cost, + units=base.units, + provider=base.provider, + category=base.category, + component_name=base.component_name, + kind=base.kind, + input_tokens=self.input_tokens + other.input_tokens, + output_tokens=self.output_tokens + other.output_tokens, + total_tokens=self.total_tokens + other.total_tokens, + cached_input_tokens=self.cached_input_tokens + other.cached_input_tokens, + reasoning_tokens=self.reasoning_tokens + other.reasoning_tokens, + audio_tokens=self.audio_tokens + other.audio_tokens, + ) + + # Adding TokenUsage + plain Usage: preserve token fields from self + return TokenUsage( + cost=base.cost, + units=base.units, + provider=base.provider, + category=base.category, + component_name=base.component_name, + kind=base.kind, + input_tokens=self.input_tokens, + output_tokens=self.output_tokens, + total_tokens=self.total_tokens, + cached_input_tokens=self.cached_input_tokens, + reasoning_tokens=self.reasoning_tokens, + audio_tokens=self.audio_tokens, + ) + + def to_dict(self) -> Dict[str, Any]: + """Serialize to a JSON-compatible dictionary.""" + return { + **super().to_dict(), + "input_tokens": self.input_tokens, + "output_tokens": self.output_tokens, + "total_tokens": self.total_tokens, + "cached_input_tokens": self.cached_input_tokens, + "reasoning_tokens": self.reasoning_tokens, + "audio_tokens": self.audio_tokens, + } + + @classmethod + def from_chat_response_usage( + cls, + usage_dict: Dict[str, int], + *, + cost: Optional[float] = None, + provider: Optional[str] = None, + category: Optional[str] = None, + component_name: Optional[str] = None, + kind: str = "llm", + ) -> TokenUsage: + """Create a TokenUsage from a ChatResponse.usage dict. + + Maps provider-specific key names to the canonical fields. + + Args: + usage_dict: The usage dict from ``ChatResponse.usage``. + cost: Cost in USD if known (e.g., from provider-reported cost). + provider: Provider identifier. + category: Registry category. + component_name: Component name. + kind: Component kind, defaults to ``"llm"``. + + Returns: + A TokenUsage instance with mapped fields. + """ + return cls( + cost=cost, + provider=provider, + category=category, + component_name=component_name, + kind=kind, + input_tokens=usage_dict.get("input_tokens", 0), + output_tokens=usage_dict.get("output_tokens", 0), + total_tokens=usage_dict.get("total_tokens", 0), + cached_input_tokens=usage_dict.get("cached_input_tokens", 0), + reasoning_tokens=usage_dict.get("reasoning_tokens", 0), + audio_tokens=usage_dict.get("audio_tokens", 0), + ) + + +class UsageTrackableMixin: + """Mixin that provides usage tracking capability to any component. + + Classes that inherit from UsageTrackableMixin can be registered with a + Benchmark instance and will have their usage automatically collected + by the registry via `collect_usage()`. + + The `gather_usage()` method provides a default implementation that returns + an empty `Usage`. Subclasses should override this to return their + accumulated usage data. + + How to use: + For custom components that incur billable costs, inherit from + UsageTrackableMixin and override `gather_usage()`: + + ```python + class MyPaidService(TraceableMixin, UsageTrackableMixin): + def __init__(self): + self._usage_records: List[Usage] = [] + + def call_api(self, query): + result = api.call(query) + self._usage_records.append(Usage( + cost=result.cost, + units={"api_calls": 1}, + )) + return result + + def gather_usage(self) -> Usage: + return sum(self._usage_records, Usage()) + ``` + + Then register it with your benchmark: + + ```python + service = MyPaidService() + benchmark.register("tools", "my_service", service) + ``` + + Thread Safety: + Usage collection happens synchronously in the main thread after + task execution completes. Components should use thread-safe data + structures when accumulating usage during concurrent execution, + but `gather_usage()` itself is called sequentially. + """ + + def gather_usage(self) -> Usage: + """Gather accumulated usage from this component. + + Provides a default implementation that returns an empty Usage. + Subclasses should override this to return their accumulated + usage data. + + Returns: + Accumulated usage for this component. + + How to use: + Override this method to return your component's usage: + + ```python + def gather_usage(self) -> Usage: + return sum(self._usage_records, Usage()) + ``` + """ + return Usage() diff --git a/maseval/interface/cost.py b/maseval/interface/cost.py new file mode 100644 index 00000000..178b0cc6 --- /dev/null +++ b/maseval/interface/cost.py @@ -0,0 +1,102 @@ +"""Cost calculators that depend on optional third-party packages. + +This module provides ``LiteLLMCostCalculator``, which uses LiteLLM's +bundled model pricing database to compute cost from token counts. + +Requires: ``pip install litellm`` +""" + +from __future__ import annotations + +from typing import Any, Dict, Optional + +from maseval.core.cost import CostCalculator # noqa: F401 — re-export protocol +from maseval.core.usage import TokenUsage + + +class LiteLLMCostCalculator: + """Cost calculator using LiteLLM's bundled pricing database. + + LiteLLM maintains a comprehensive `model_prices_and_context_window.json + `_ + that covers most major LLM providers. This calculator delegates to + ``litellm.cost_per_token`` for per-token rates and computes the total. + + This is the recommended calculator for most users — it covers OpenAI, + Anthropic, Google, Mistral, Cohere, and many more without requiring + manual pricing tables. + + Note: + If you're already using the ``LiteLLMModelAdapter``, it extracts + provider-reported cost from ``response._hidden_params.response_cost`` + automatically. This calculator is useful as a fallback when using + other adapters (OpenAI, Anthropic, Google) directly. + + Example: + ```python + from maseval.interface.cost import LiteLLMCostCalculator + from maseval.interface.inference import OpenAIModelAdapter + + calculator = LiteLLMCostCalculator() + model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) + + # Cost is now computed automatically after each chat() call + response = model.chat([{"role": "user", "content": "Hello"}]) + print(model.gather_usage().cost) # e.g., 0.00123 + ``` + """ + + def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None): + """Initialize the LiteLLM cost calculator. + + Args: + custom_pricing: Optional overrides for specific models. Keys are + model IDs, values are dicts with ``"input_cost_per_token"`` + and ``"output_cost_per_token"``. These take precedence over + LiteLLM's built-in pricing. + """ + try: + import litellm # noqa: F401 + except ImportError as e: + raise ImportError("LiteLLMCostCalculator requires litellm. Install it with: pip install litellm") from e + + self._custom_pricing = custom_pricing or {} + + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + """Compute cost using LiteLLM's pricing database. + + Args: + usage: Token usage from the call. + model_id: The model identifier (must match LiteLLM's naming). + + Returns: + Cost in USD, or ``None`` if LiteLLM doesn't have pricing for + this model and no custom pricing was provided. + """ + # Check custom overrides first + if model_id in self._custom_pricing: + rates = self._custom_pricing[model_id] + input_cost = rates.get("input_cost_per_token", 0.0) * usage.input_tokens + output_cost = rates.get("output_cost_per_token", 0.0) * usage.output_tokens + return input_cost + output_cost + + # Fall back to LiteLLM's built-in pricing + try: + import litellm + + input_cost, output_cost = litellm.cost_per_token( + model=model_id, + prompt_tokens=usage.input_tokens, + completion_tokens=usage.output_tokens, + ) + return input_cost + output_cost + except Exception: + # Model not in LiteLLM's pricing database + return None + + def gather_config(self) -> Dict[str, Any]: + """Return calculator configuration for reproducibility.""" + return { + "type": type(self).__name__, + "custom_pricing": dict(self._custom_pricing) if self._custom_pricing else None, + } diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py index 1c6389ea..5e0c92d4 100644 --- a/maseval/interface/inference/anthropic.py +++ b/maseval/interface/inference/anthropic.py @@ -52,6 +52,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse +from maseval.core.cost import CostCalculator from maseval.core.seeding import SeedingError @@ -76,6 +77,7 @@ def __init__( default_generation_params: Optional[Dict[str, Any]] = None, max_tokens: int = 4096, seed: Optional[int] = None, + cost_calculator: Optional[CostCalculator] = None, ): """Initialize Anthropic model adapter. @@ -88,6 +90,8 @@ def __init__( parameter. Default is 4096. seed: Seed for deterministic generation. Note: Anthropic does NOT support seeding. Providing a seed will raise SeedingError. + cost_calculator: Optional cost calculator for computing cost from + token counts when the provider doesn't report cost directly. Raises: SeedingError: If seed is provided (Anthropic doesn't support seeding). @@ -98,7 +102,7 @@ def __init__( f"Model '{model_id}' cannot use seed={seed}. " f"Remove the seed parameter or use a provider that supports seeding." ) - super().__init__(seed=seed) + super().__init__(seed=seed, cost_calculator=cost_calculator) self._client = client self._model_id = model_id self._default_generation_params = default_generation_params or {} @@ -344,6 +348,10 @@ def _parse_response(self, response: Any) -> ChatResponse: "output_tokens": getattr(response.usage, "output_tokens", 0), "total_tokens": (getattr(response.usage, "input_tokens", 0) + getattr(response.usage, "output_tokens", 0)), } + # Provider-specific detail + cached = getattr(response.usage, "cache_read_input_tokens", 0) + if cached: + usage["cached_input_tokens"] = cached # Extract stop reason stop_reason = None diff --git a/maseval/interface/inference/google_genai.py b/maseval/interface/inference/google_genai.py index 8a38a281..8ceb466a 100644 --- a/maseval/interface/inference/google_genai.py +++ b/maseval/interface/inference/google_genai.py @@ -47,6 +47,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse +from maseval.core.cost import CostCalculator class GoogleGenAIModelAdapter(ModelAdapter): @@ -65,6 +66,7 @@ def __init__( model_id: str, default_generation_params: Optional[Dict[str, Any]] = None, seed: Optional[int] = None, + cost_calculator: Optional[CostCalculator] = None, ): """Initialize Google GenAI model adapter. @@ -74,8 +76,10 @@ def __init__( default_generation_params: Default parameters for all calls. Common parameters: temperature, max_output_tokens, top_p. seed: Seed for deterministic generation. Google GenAI supports this. + cost_calculator: Optional cost calculator for computing cost from + token counts when the provider doesn't report cost directly. """ - super().__init__(seed=seed) + super().__init__(seed=seed, cost_calculator=cost_calculator) self._client = client self._model_id = model_id self._default_generation_params = default_generation_params or {} @@ -291,6 +295,10 @@ def _parse_response(self, response: Any) -> ChatResponse: "output_tokens": getattr(um, "candidates_token_count", 0), "total_tokens": getattr(um, "total_token_count", 0), } + # Provider-specific detail + thoughts = getattr(um, "thoughts_token_count", 0) + if thoughts: + usage["reasoning_tokens"] = thoughts # Extract stop reason stop_reason = None diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py index 45fac7e8..f28cc293 100644 --- a/maseval/interface/inference/huggingface.py +++ b/maseval/interface/inference/huggingface.py @@ -34,6 +34,7 @@ from typing import Any, Optional, Dict, List, Callable, Union from maseval.core.model import ModelAdapter, ChatResponse +from maseval.core.cost import CostCalculator class ToolCallingNotSupportedError(Exception): @@ -65,6 +66,7 @@ def __init__( model_id: Optional[str] = None, default_generation_params: Optional[Dict[str, Any]] = None, seed: Optional[int] = None, + cost_calculator: Optional[CostCalculator] = None, ): """Initialize HuggingFace model adapter. @@ -78,8 +80,10 @@ def __init__( Common parameters: max_new_tokens, temperature, top_p, do_sample. seed: Seed for deterministic generation. Sets the random seed before each generation call using transformers.set_seed(). + cost_calculator: Optional cost calculator for computing cost from + token counts when the provider doesn't report cost directly. """ - super().__init__(seed=seed) + super().__init__(seed=seed, cost_calculator=cost_calculator) self._model = model self._model_id = model_id or getattr(model, "name_or_path", "huggingface:unknown") self._default_generation_params = default_generation_params or {} diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py index ed932247..a13fcd6d 100644 --- a/maseval/interface/inference/litellm.py +++ b/maseval/interface/inference/litellm.py @@ -44,6 +44,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse +from maseval.core.cost import CostCalculator class LiteLLMModelAdapter(ModelAdapter): @@ -70,6 +71,7 @@ def __init__( api_key: Optional[str] = None, api_base: Optional[str] = None, seed: Optional[int] = None, + cost_calculator: Optional[CostCalculator] = None, ): """Initialize LiteLLM model adapter. @@ -87,8 +89,12 @@ def __init__( api_base: Custom API base URL for self-hosted or Azure endpoints. seed: Seed for deterministic generation. LiteLLM passes this to the underlying provider. Note: Not all providers support seeding. + cost_calculator: Optional cost calculator for computing cost from + token counts. Note: LiteLLM already reports cost via + ``response._hidden_params.response_cost`` for most models, + so a calculator is only needed as a fallback or override. """ - super().__init__(seed=seed) + super().__init__(seed=seed, cost_calculator=cost_calculator) self._model_id = model_id self._default_generation_params = default_generation_params or {} self._api_key = api_key @@ -176,6 +182,23 @@ def _chat_impl( "output_tokens": getattr(response.usage, "completion_tokens", 0), "total_tokens": getattr(response.usage, "total_tokens", 0), } + # Provider-specific detail + completion_details = getattr(response.usage, "completion_tokens_details", None) + if completion_details: + reasoning = getattr(completion_details, "reasoning_tokens", 0) + if reasoning: + usage["reasoning_tokens"] = reasoning + prompt_details = getattr(response.usage, "prompt_tokens_details", None) + if prompt_details: + cached = getattr(prompt_details, "cached_tokens", 0) + if cached: + usage["cached_input_tokens"] = cached + # LiteLLM provider-reported cost + hidden = getattr(response, "_hidden_params", None) + if hidden and isinstance(hidden, dict): + cost = hidden.get("response_cost") + if isinstance(cost, (int, float)): + usage["cost"] = cost return ChatResponse( content=message.content, diff --git a/maseval/interface/inference/openai.py b/maseval/interface/inference/openai.py index 5855b12d..fd31fd44 100644 --- a/maseval/interface/inference/openai.py +++ b/maseval/interface/inference/openai.py @@ -50,6 +50,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse +from maseval.core.cost import CostCalculator class OpenAIModelAdapter(ModelAdapter): @@ -70,6 +71,7 @@ def __init__( model_id: str, default_generation_params: Optional[Dict[str, Any]] = None, seed: Optional[int] = None, + cost_calculator: Optional[CostCalculator] = None, ): """Initialize OpenAI model adapter. @@ -81,8 +83,10 @@ def __init__( Common parameters: temperature, max_tokens, top_p. seed: Seed for deterministic generation. OpenAI supports this natively. Note: Determinism is best-effort, not guaranteed by OpenAI. + cost_calculator: Optional cost calculator for computing cost from + token counts when the provider doesn't report cost directly. """ - super().__init__(seed=seed) + super().__init__(seed=seed, cost_calculator=cost_calculator) self._client = client self._model_id = model_id self._default_generation_params = default_generation_params or {} @@ -209,6 +213,20 @@ def _parse_response(self, response: Any) -> ChatResponse: "output_tokens": getattr(response.usage, "completion_tokens", 0), "total_tokens": getattr(response.usage, "total_tokens", 0), } + # Provider-specific detail + completion_details = getattr(response.usage, "completion_tokens_details", None) + if completion_details: + reasoning = getattr(completion_details, "reasoning_tokens", 0) + if reasoning: + usage["reasoning_tokens"] = reasoning + audio = getattr(completion_details, "audio_tokens", 0) + if audio: + usage["audio_tokens"] = audio + prompt_details = getattr(response.usage, "prompt_tokens_details", None) + if prompt_details: + cached = getattr(prompt_details, "cached_tokens", 0) + if cached: + usage["cached_input_tokens"] = cached return ChatResponse( content=message.content, diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md index f9b45133..91683ee0 100644 --- a/usage_tracking/PLAN.md +++ b/usage_tracking/PLAN.md @@ -30,13 +30,15 @@ Generic usage record for any billable resource. Stored as a simple dataclass. ``` Usage - cost: Optional[float] # Total cost in USD (None = unknown) - cost_details: Dict[str, float] # Breakdown (e.g., {"input": 0.01, "output": 0.03}) - units: Dict[str, int | float] # Arbitrary countable units (e.g., {"api_calls": 3, "bytes": 1024}) - metadata: Dict[str, Any] # Provider-specific extras + cost: Optional[float] # Total cost in USD (None = unknown) + units: Dict[str, int | float] # Countable units (e.g., {"api_calls": 3, "bytes": 1024}) + provider: Optional[str] # e.g., "anthropic", "openai", "bloomberg" + category: Optional[str] # e.g., "models", "evaluator_models", "tools" + component_name: Optional[str] # e.g., "main_model", "judge", "bloomberg_api" + kind: Optional[str] # e.g., "llm", "service", "local" ``` -Supports `__add__` to sum two records (costs sum if both known, else None; units sum; metadata merges). +Supports `__add__`: costs sum (if both known, else None), units sum. Grouping fields (`provider`, `category`, `component_name`, `kind`) are preserved when they match, set to `None` on mismatch. `None` means "aggregated over" — e.g., `provider=None, category="models"` represents all models summed across providers. A fully `None` grouping is a grand total. ### `TokenUsage(Usage)` (LLM-specific) @@ -289,29 +291,82 @@ These already hold a `ModelAdapter`. Their model's usage is collected automatica | File | Action | Content | |------|--------|---------| | `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` | +| `maseval/core/cost.py` | **Create** | `CostCalculator` protocol, `StaticPricingCalculator` | | `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property | -| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()` | +| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()`, accept `cost_calculator` param | | `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict | | `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility | -| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens` | -| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens` | -| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens` | -| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost` | -| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `UsageReporter` | -| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation | +| `maseval/interface/cost.py` | **Create** | `LiteLLMCostCalculator` (optional `litellm` dependency) | +| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens`; accept `cost_calculator` | +| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens`; accept `cost_calculator` | +| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens`; accept `cost_calculator` | +| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost`; accept `cost_calculator` | +| `maseval/interface/inference/huggingface.py` | **Edit** | Accept `cost_calculator` | +| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `CostCalculator`, `StaticPricingCalculator`, `UsageReporter` | +| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation, cost calculators | No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`. --- +## Cost Calculation + +Most LLM APIs return token counts but **not** cost. Cost calculation is a client-side concern. + +### CostCalculator protocol + +A `CostCalculator` is a simple protocol with one method: + +```python +class CostCalculator(Protocol): + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: ... +``` + +`ModelAdapter` accepts an optional `cost_calculator` parameter. After each `chat()` call, if the provider didn't report cost and a calculator is present, the calculator fills in `TokenUsage.cost`. Provider-reported cost always takes precedence. + +### Built-in implementations + +| Calculator | Location | Dependencies | Use case | +|-----------|----------|-------------|----------| +| `StaticPricingCalculator` | `maseval.core.cost` | None | User-supplied per-model rates. Supports custom units (USD, EUR, credits). | +| `LiteLLMCostCalculator` | `maseval.interface.cost` | `litellm` | Automatic pricing via LiteLLM's bundled model database. Covers OpenAI, Anthropic, Google, Mistral, etc. | + +### Cost flow (priority order) + +1. **Provider-reported cost** — e.g., LiteLLM's `response._hidden_params.response_cost`. Set directly in `ChatResponse.usage["cost"]`. +2. **CostCalculator** — if no provider cost, `ModelAdapter.chat()` calls `calculator.calculate_cost(token_usage, model_id)`. +3. **None** — if neither source provides cost, `Usage.cost` stays `None`. + +### Examples + +```python +# Static pricing for a university cluster (credits per token) +calculator = StaticPricingCalculator({ + "llama-3-70b": {"input": 0.5, "output": 1.0}, +}) + +# Automatic pricing via LiteLLM's database +from maseval.interface.cost import LiteLLMCostCalculator +calculator = LiteLLMCostCalculator() + +# Pass to any model adapter +model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) +``` + +### Non-LLM components + +Non-LLM components (tools, environments) set cost directly in their `gather_usage()` implementation — there is no calculator involvement. Each component knows its own billing model. + +--- + ## Non-goals -- **Hardcoded pricing tables** — prices change too often; user-supplied or provider-reported only. +- **Hardcoded pricing tables** — prices change too often; delegated to LiteLLM or user-supplied. - **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now. - **Billing integration** — no webhook/billing system integration. - **Streaming usage** — not supported yet (usage is captured after completion). +- **Currency conversion** — `Usage.cost` is a bare float in whatever unit the calculator uses. Mixing units in one benchmark is a user error. ## Open Questions -1. **Pricing config format**: Should pricing be passed to `ModelAdapter.__init__` as a new param, or set externally after construction? Leaning toward a `pricing` kwarg on adapter init for ergonomics. When `pricing` is provided and a `TokenUsage` record has `cost=None`, cost is computed from `pricing["input"] * input_tokens + pricing["output"] * output_tokens`. -2. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1. +1. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1. From dd4864f76f6253df36cb42e04daafd6d3831cc2c Mon Sep 17 00:00:00 2001 From: cemde Date: Thu, 12 Mar 2026 18:23:35 +0100 Subject: [PATCH 03/19] updated cost tracking --- maseval/interface/cost.py | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/maseval/interface/cost.py b/maseval/interface/cost.py index 178b0cc6..77d4f48d 100644 --- a/maseval/interface/cost.py +++ b/maseval/interface/cost.py @@ -46,7 +46,11 @@ class LiteLLMCostCalculator: ``` """ - def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None): + def __init__( + self, + custom_pricing: Optional[Dict[str, Dict[str, float]]] = None, + model_id_map: Optional[Dict[str, str]] = None, + ): """Initialize the LiteLLM cost calculator. Args: @@ -54,6 +58,18 @@ def __init__(self, custom_pricing: Optional[Dict[str, Dict[str, float]]] = None) model IDs, values are dicts with ``"input_cost_per_token"`` and ``"output_cost_per_token"``. These take precedence over LiteLLM's built-in pricing. + model_id_map: Optional mapping from adapter model IDs to LiteLLM + model IDs. Use this when your adapter's ``model_id`` doesn't + match LiteLLM's naming convention — e.g., when using Google's + OpenAI-compatible endpoint where the adapter sees + ``"gemini-2.0-flash"`` but LiteLLM expects + ``"gemini/gemini-2.0-flash"``. + + Example:: + + LiteLLMCostCalculator(model_id_map={ + "gemini-2.0-flash": "gemini/gemini-2.0-flash", + }) """ try: import litellm # noqa: F401 From 4ab0efe80a263879586bda122c0ca81479baee65 Mon Sep 17 00:00:00 2001 From: cemde Date: Thu, 12 Mar 2026 19:55:59 +0100 Subject: [PATCH 04/19] updated litellm cost calculator --- docs/guides/index.md | 1 + docs/guides/usage-tracking.md | 309 ++++++++++++++++++++ docs/reference/usage.md | 31 ++ maseval/__init__.py | 2 +- maseval/core/cost.py | 132 --------- maseval/core/model.py | 3 +- maseval/core/usage.py | 128 +++++++- maseval/interface/inference/anthropic.py | 2 +- maseval/interface/inference/google_genai.py | 2 +- maseval/interface/inference/huggingface.py | 2 +- maseval/interface/inference/litellm.py | 2 +- maseval/interface/inference/openai.py | 2 +- maseval/interface/{cost.py => usage.py} | 16 +- mkdocs.yml | 2 + 14 files changed, 486 insertions(+), 148 deletions(-) create mode 100644 docs/guides/usage-tracking.md create mode 100644 docs/reference/usage.md delete mode 100644 maseval/core/cost.py rename maseval/interface/{cost.py => usage.py} (87%) diff --git a/docs/guides/index.md b/docs/guides/index.md index f659e5f4..9ff77ba2 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices. | [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility | | [Exception Handling](exception-handling.md) | Distinguish agent errors from infrastructure failures | | [Seeding](seeding.md) | Enable reproducible benchmark runs with deterministic seeds | +| [Usage & Cost Tracking](usage-tracking.md) | Track token usage and compute cost across providers | diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md new file mode 100644 index 00000000..6b249a05 --- /dev/null +++ b/docs/guides/usage-tracking.md @@ -0,0 +1,309 @@ +# Usage & Cost Tracking + +## Overview + +MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for: + +- **Cost control**: Track how much each benchmark run costs across providers +- **Budgeting**: Compare cost across models, tasks, and components +- **Billing**: Support custom credit systems (university clusters, internal APIs) +- **Analysis**: Understand token usage patterns per task, agent, or model + +!!! info "Usage vs Cost" + + **Usage** = Token counts and arbitrary resource units (API calls, data points, etc.) + + **Cost** = Monetary value computed from usage (USD, EUR, credits, etc.) + + Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator. + +## Core Concepts + +**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata. + +**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.). + +**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`. + +**`CostCalculator`**: Protocol for pluggable cost computation from token counts. + +## Automatic LLM Usage Tracking + +All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally. + +```python +from maseval.interface.inference import OpenAIModelAdapter + +model = OpenAIModelAdapter(client=client, model_id="gpt-4") + +# Make some calls +model.chat([{"role": "user", "content": "Hello"}]) +model.chat([{"role": "user", "content": "How are you?"}]) + +# Inspect accumulated usage +usage = model.gather_usage() +print(usage.input_tokens) # e.g., 25 +print(usage.output_tokens) # e.g., 42 +print(usage.cost) # None (no cost calculator configured) +``` + +### In Benchmarks + +Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key: + +```python +results = benchmark.run() + +for report in results: + print(f"Task {report['task_id']}: {report['usage']}") +``` + +Live running totals are available during execution: + +```python +benchmark.usage # -> Usage (grand total across all tasks) +benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals) +``` + +## Cost Calculation + +Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones. + +### Cost Priority + +When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order: + +1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. +2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. +3. **None** — if neither source provides cost, `Usage.cost` stays `None`. + +### StaticPricingCalculator + +Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`. + +```python +from maseval import StaticPricingCalculator + +calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, +}) + +model = OpenAIModelAdapter( + client=client, + model_id="gpt-4", + cost_calculator=calculator, +) + +response = model.chat([{"role": "user", "content": "Hello"}]) +print(model.gather_usage().cost) # e.g., 0.00234 +``` + +Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate: + +```python +calculator = StaticPricingCalculator({ + "claude-sonnet-4-5": { + "input": 0.000003, + "output": 0.000015, + "cached_input": 0.0000003, # 10x cheaper for cached tokens + }, +}) +``` + +For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents: + +```python +calculator = StaticPricingCalculator({ + "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token +}) +``` + +### LiteLLMCostCalculator + +Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more. + +```python +from maseval.interface.usage import LiteLLMCostCalculator + +calculator = LiteLLMCostCalculator() + +model = OpenAIModelAdapter( + client=client, + model_id="gpt-4", + cost_calculator=calculator, +) +``` + +!!! tip "LiteLLMModelAdapter already reports cost" + + If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup. + +#### Custom Pricing Overrides + +Override pricing for specific models while using LiteLLM's database for the rest: + +```python +calculator = LiteLLMCostCalculator(custom_pricing={ + "my-finetuned-gpt4": { + "input_cost_per_token": 0.00006, + "output_cost_per_token": 0.00012, + }, +}) +``` + +#### Model ID Remapping + +When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap: + +```python +calculator = LiteLLMCostCalculator(model_id_map={ + "gemini-2.0-flash": "gemini/gemini-2.0-flash", + "my-custom-gpt4": "gpt-4", +}) +``` + +The map is applied before both custom pricing and LiteLLM lookup. + +### Custom Cost Calculator + +Implement the `CostCalculator` protocol for custom pricing logic: + +```python +from maseval import CostCalculator, TokenUsage +from typing import Optional + +class MyCostCalculator: + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + rate = MY_PRICING_TABLE.get(model_id) + if rate is None: + return None + return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens +``` + +The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model. + +### Sharing Calculators Across Adapters + +A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing: + +```python +calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, +}) + +model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) +model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator) +``` + +## Non-LLM Usage Tracking + +Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`: + +```python +from maseval import Usage, UsageTrackableMixin +from maseval.core.tracing import TraceableMixin + +class BloombergEnvironment(Environment, UsageTrackableMixin): + def __init__(self, task_data): + super().__init__(task_data) + self._usage_records = [] + + def _call_bloomberg(self, query): + result = bloomberg_client.query(query) + self._usage_records.append(Usage( + cost=result.billed_amount, + units={"api_calls": 1, "data_points": result.count}, + provider="bloomberg", + kind="service", + )) + return result + + def gather_usage(self) -> Usage: + if not self._usage_records: + return Usage() + return sum(self._usage_records, Usage()) +``` + +Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model. + +## Post-hoc Analysis with UsageReporter + +`UsageReporter` provides sliced analysis across all benchmark reports: + +```python +from maseval import UsageReporter + +reporter = UsageReporter.from_reports(benchmark.reports) + +# Grand total +total = reporter.total() +print(f"Total cost: ${total.cost:.4f}") +print(f"Total tokens: {total.input_tokens + total.output_tokens}") + +# Per-task breakdown +for task_id, usage in reporter.by_task().items(): + print(f" {task_id}: ${usage.cost:.4f}") + +# Per-component breakdown +for component, usage in reporter.by_component().items(): + print(f" {component}: ${usage.cost:.4f}") + +# Full nested summary dict +summary = reporter.summary() +``` + +## Usage Data Model + +### Usage + +Generic record for any billable resource: + +| Field | Type | Description | +|-------|------|-------------| +| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. | +| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). | +| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). | +| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). | +| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). | +| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). | + +`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch. + +### TokenUsage + +Extends `Usage` with LLM-specific token counts: + +| Field | Type | Description | +|-------|------|-------------| +| `input_tokens` | `int` | Input/prompt tokens. | +| `output_tokens` | `int` | Output/completion tokens. | +| `total_tokens` | `int` | Total tokens. | +| `cached_input_tokens` | `int` | Tokens served from cache. | +| `reasoning_tokens` | `int` | Reasoning/thinking tokens. | +| `audio_tokens` | `int` | Audio processing tokens. | + +## Evaluator Usage + +Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically: + +```python +class MyBenchmark(Benchmark): + def setup_evaluators(self, task, environment): + judge_model = OpenAIModelAdapter(client=client, model_id="gpt-4") + self.register("evaluator_models", "judge", judge_model) + return [MyLLMEvaluator(judge_model)] +``` + +The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage. + +## Tips + +**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates. + +**For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming. + +**For failed tasks**: Usage is collected before error status is determined, so partial usage from failed tasks is still tracked. + +**For live monitoring**: Access `benchmark.usage` during execution to check running totals. diff --git a/docs/reference/usage.md b/docs/reference/usage.md new file mode 100644 index 00000000..2326aaef --- /dev/null +++ b/docs/reference/usage.md @@ -0,0 +1,31 @@ +# Usage & Cost Tracking + +Usage and cost tracking provides data classes for recording resource consumption, a mixin for automatic collection, and pluggable cost calculators. + +See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage patterns and examples. + +## Core + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/usage.py){ .md-source-file } + +::: maseval.core.usage.Usage + +::: maseval.core.usage.TokenUsage + +::: maseval.core.usage.UsageTrackableMixin + +::: maseval.core.usage.CostCalculator + +::: maseval.core.usage.StaticPricingCalculator + +## Reporting + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/reporting.py){ .md-source-file } + +::: maseval.core.reporting.UsageReporter + +## Interface + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/usage.py){ .md-source-file } + +::: maseval.interface.usage.LiteLLMCostCalculator diff --git a/maseval/__init__.py b/maseval/__init__.py index addedf3e..460bf1a9 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -38,7 +38,7 @@ from .core.history import MessageHistory, ToolInvocationHistory from .core.tracing import TraceableMixin from .core.usage import Usage, TokenUsage, UsageTrackableMixin -from .core.cost import CostCalculator, StaticPricingCalculator +from .core.usage import CostCalculator, StaticPricingCalculator from .core.reporting import UsageReporter from .core.registry import ComponentRegistry from .core.context import TaskContext diff --git a/maseval/core/cost.py b/maseval/core/cost.py deleted file mode 100644 index a6eb8583..00000000 --- a/maseval/core/cost.py +++ /dev/null @@ -1,132 +0,0 @@ -"""Pluggable cost calculation for usage records. - -This module provides the ``CostCalculator`` protocol and a built-in -``StaticPricingCalculator`` that computes cost from token counts and -user-supplied pricing tables. For automatic pricing via LiteLLM's -bundled model database, see ``maseval.interface.cost``. - -Cost calculators are optional — if no calculator is provided to a -``ModelAdapter``, cost is only set when the provider reports it directly -(e.g., LiteLLM's ``response._hidden_params.response_cost``). -""" - -from __future__ import annotations - -from typing import Any, Dict, Optional, Protocol, runtime_checkable - -from .usage import TokenUsage - - -@runtime_checkable -class CostCalculator(Protocol): - """Protocol for computing cost from token usage. - - Implementations receive a ``TokenUsage`` and the model ID, and return - the cost in whatever unit the calculator declares (typically USD). - - Example: - ```python - class MyCostCalculator: - def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: - rate = MY_PRICING.get(model_id) - if rate is None: - return None - return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens - ``` - """ - - def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: - """Compute cost for a single chat call. - - Args: - usage: Token usage from the call. - model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``). - - Returns: - Cost as a float, or ``None`` if pricing is unknown for this model. - """ - ... - - -class StaticPricingCalculator: - """Cost calculator using user-supplied per-model pricing. - - Pricing is specified as cost per token (not per 1K or 1M tokens). - If a model is not in the pricing table, ``calculate_cost`` returns ``None``. - - Args: - pricing: Dict mapping model IDs to their per-token rates. - Each value is a dict with keys: - - - ``"input"`` — cost per input token (required) - - ``"output"`` — cost per output token (required) - - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate) - - Example: - ```python - calculator = StaticPricingCalculator({ - "gpt-4": {"input": 0.00003, "output": 0.00006}, - "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, - }) - - model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator) - ``` - - For university clusters or custom credit systems, the "cost" unit - is whatever the pricing values represent (credits, EUR, etc.): - - ```python - calculator = StaticPricingCalculator({ - "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token - }) - ``` - """ - - def __init__(self, pricing: Dict[str, Dict[str, float]]): - self._pricing = pricing - - def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: - """Compute cost from static per-token rates. - - Args: - usage: Token usage from the call. - model_id: The model identifier to look up in the pricing table. - - Returns: - Computed cost, or ``None`` if the model is not in the pricing table. - """ - rates = self._pricing.get(model_id) - if rates is None: - return None - - input_rate = rates.get("input", 0.0) - output_rate = rates.get("output", 0.0) - cached_rate = rates.get("cached_input", input_rate) - - # Non-cached input tokens = total input - cached - non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens) - - cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate - - return cost - - def add_model(self, model_id: str, rates: Dict[str, float]) -> None: - """Add or update pricing for a model. - - Args: - model_id: The model identifier. - rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``). - """ - self._pricing[model_id] = rates - - @property - def models(self) -> list[str]: - """List of model IDs with pricing configured.""" - return list(self._pricing.keys()) - - def gather_config(self) -> Dict[str, Any]: - """Return pricing configuration for reproducibility.""" - return { - "type": type(self).__name__, - "pricing": dict(self._pricing), - } diff --git a/maseval/core/model.py b/maseval/core/model.py index f33e48f4..110d9879 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -55,8 +55,7 @@ from .tracing import TraceableMixin from .config import ConfigurableMixin -from .usage import Usage, TokenUsage, UsageTrackableMixin -from .cost import CostCalculator +from .usage import Usage, TokenUsage, UsageTrackableMixin, CostCalculator from .history import MessageHistory diff --git a/maseval/core/usage.py b/maseval/core/usage.py index 78edeab9..aa2c2e08 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -1,19 +1,26 @@ """Core usage tracking infrastructure for API cost and resource monitoring. This module provides the `Usage` and `TokenUsage` data classes for recording -billable resource consumption, and the `UsageTrackableMixin` that enables -automatic usage collection through the component registry. +billable resource consumption, the `UsageTrackableMixin` that enables +automatic usage collection through the component registry, and pluggable +cost calculators (`CostCalculator`, `StaticPricingCalculator`) for translating +token counts into monetary cost. Usage tracking is a first-class collection axis alongside tracing (`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that inherit `UsageTrackableMixin` have their usage automatically collected by the registry via `gather_usage()`. + +Cost calculators are optional — if no calculator is provided to a +``ModelAdapter``, cost is only set when the provider reports it directly +(e.g., LiteLLM's ``response._hidden_params.response_cost``). For automatic +pricing via LiteLLM's bundled model database, see ``maseval.interface.usage``. """ from __future__ import annotations from dataclasses import dataclass, field -from typing import Any, Optional, Dict +from typing import Any, Dict, Optional, Protocol, runtime_checkable @dataclass @@ -299,3 +306,118 @@ def gather_usage(self) -> Usage: ``` """ return Usage() + + +@runtime_checkable +class CostCalculator(Protocol): + """Protocol for computing cost from token usage. + + Implementations receive a ``TokenUsage`` and the model ID, and return + the cost in whatever unit the calculator declares (typically USD). + + Example: + ```python + class MyCostCalculator: + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + rate = MY_PRICING.get(model_id) + if rate is None: + return None + return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens + ``` + """ + + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + """Compute cost for a single chat call. + + Args: + usage: Token usage from the call. + model_id: The model identifier (e.g., ``"gpt-4"``, ``"claude-sonnet-4-5"``). + + Returns: + Cost as a float, or ``None`` if pricing is unknown for this model. + """ + ... + + +class StaticPricingCalculator: + """Cost calculator using user-supplied per-model pricing. + + Pricing is specified as cost per token (not per 1K or 1M tokens). + If a model is not in the pricing table, ``calculate_cost`` returns ``None``. + + Args: + pricing: Dict mapping model IDs to their per-token rates. + Each value is a dict with keys: + + - ``"input"`` — cost per input token (required) + - ``"output"`` — cost per output token (required) + - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate) + + Example: + ```python + calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, + }) + + model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator) + ``` + + For university clusters or custom credit systems, the "cost" unit + is whatever the pricing values represent (credits, EUR, etc.): + + ```python + calculator = StaticPricingCalculator({ + "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token + }) + ``` + """ + + def __init__(self, pricing: Dict[str, Dict[str, float]]): + self._pricing = pricing + + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: + """Compute cost from static per-token rates. + + Args: + usage: Token usage from the call. + model_id: The model identifier to look up in the pricing table. + + Returns: + Computed cost, or ``None`` if the model is not in the pricing table. + """ + rates = self._pricing.get(model_id) + if rates is None: + return None + + input_rate = rates.get("input", 0.0) + output_rate = rates.get("output", 0.0) + cached_rate = rates.get("cached_input", input_rate) + + # Non-cached input tokens = total input - cached + non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens) + + cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate + + return cost + + def add_model(self, model_id: str, rates: Dict[str, float]) -> None: + """Add or update pricing for a model. + + Args: + model_id: The model identifier. + rates: Per-token rates (``"input"``, ``"output"``, optionally ``"cached_input"``). + """ + self._pricing[model_id] = rates + + @property + def models(self) -> list[str]: + """List of model IDs with pricing configured.""" + return list(self._pricing.keys()) + + def gather_config(self) -> Dict[str, Any]: + """Return pricing configuration for reproducibility.""" + return { + "type": type(self).__name__, + "pricing": dict(self._pricing), + } diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py index 5e0c92d4..5c816e76 100644 --- a/maseval/interface/inference/anthropic.py +++ b/maseval/interface/inference/anthropic.py @@ -52,7 +52,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse -from maseval.core.cost import CostCalculator +from maseval.core.usage import CostCalculator from maseval.core.seeding import SeedingError diff --git a/maseval/interface/inference/google_genai.py b/maseval/interface/inference/google_genai.py index 8ceb466a..5bbf33ce 100644 --- a/maseval/interface/inference/google_genai.py +++ b/maseval/interface/inference/google_genai.py @@ -47,7 +47,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse -from maseval.core.cost import CostCalculator +from maseval.core.usage import CostCalculator class GoogleGenAIModelAdapter(ModelAdapter): diff --git a/maseval/interface/inference/huggingface.py b/maseval/interface/inference/huggingface.py index f28cc293..9de0c7df 100644 --- a/maseval/interface/inference/huggingface.py +++ b/maseval/interface/inference/huggingface.py @@ -34,7 +34,7 @@ from typing import Any, Optional, Dict, List, Callable, Union from maseval.core.model import ModelAdapter, ChatResponse -from maseval.core.cost import CostCalculator +from maseval.core.usage import CostCalculator class ToolCallingNotSupportedError(Exception): diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py index a13fcd6d..ce5385e7 100644 --- a/maseval/interface/inference/litellm.py +++ b/maseval/interface/inference/litellm.py @@ -44,7 +44,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse -from maseval.core.cost import CostCalculator +from maseval.core.usage import CostCalculator class LiteLLMModelAdapter(ModelAdapter): diff --git a/maseval/interface/inference/openai.py b/maseval/interface/inference/openai.py index fd31fd44..ff0c4245 100644 --- a/maseval/interface/inference/openai.py +++ b/maseval/interface/inference/openai.py @@ -50,7 +50,7 @@ from typing import Any, Optional, Dict, List, Union from maseval.core.model import ModelAdapter, ChatResponse -from maseval.core.cost import CostCalculator +from maseval.core.usage import CostCalculator class OpenAIModelAdapter(ModelAdapter): diff --git a/maseval/interface/cost.py b/maseval/interface/usage.py similarity index 87% rename from maseval/interface/cost.py rename to maseval/interface/usage.py index 77d4f48d..87070f13 100644 --- a/maseval/interface/cost.py +++ b/maseval/interface/usage.py @@ -1,4 +1,4 @@ -"""Cost calculators that depend on optional third-party packages. +"""Usage and cost utilities that depend on optional third-party packages. This module provides ``LiteLLMCostCalculator``, which uses LiteLLM's bundled model pricing database to compute cost from token counts. @@ -10,8 +10,7 @@ from typing import Any, Dict, Optional -from maseval.core.cost import CostCalculator # noqa: F401 — re-export protocol -from maseval.core.usage import TokenUsage +from maseval.core.usage import CostCalculator, TokenUsage # noqa: F401 — re-export protocol class LiteLLMCostCalculator: @@ -34,7 +33,7 @@ class LiteLLMCostCalculator: Example: ```python - from maseval.interface.cost import LiteLLMCostCalculator + from maseval.interface.usage import LiteLLMCostCalculator from maseval.interface.inference import OpenAIModelAdapter calculator = LiteLLMCostCalculator() @@ -77,18 +76,24 @@ def __init__( raise ImportError("LiteLLMCostCalculator requires litellm. Install it with: pip install litellm") from e self._custom_pricing = custom_pricing or {} + self._model_id_map = model_id_map or {} def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: """Compute cost using LiteLLM's pricing database. Args: usage: Token usage from the call. - model_id: The model identifier (must match LiteLLM's naming). + model_id: The model identifier. Remapped via ``model_id_map`` + if configured, then looked up in custom pricing and + LiteLLM's database. Returns: Cost in USD, or ``None`` if LiteLLM doesn't have pricing for this model and no custom pricing was provided. """ + # Remap model ID if configured + model_id = self._model_id_map.get(model_id, model_id) + # Check custom overrides first if model_id in self._custom_pricing: rates = self._custom_pricing[model_id] @@ -115,4 +120,5 @@ def gather_config(self) -> Dict[str, Any]: return { "type": type(self).__name__, "custom_pricing": dict(self._custom_pricing) if self._custom_pricing else None, + "model_id_map": dict(self._model_id_map) if self._model_id_map else None, } diff --git a/mkdocs.yml b/mkdocs.yml index 4b489f50..6cbce841 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -95,6 +95,7 @@ nav: - Configuration Gathering: guides/config-gathering.md - Exception Handling: guides/exception-handling.md - Seeding: guides/seeding.md + - Usage & Cost Tracking: guides/usage-tracking.md - Examples: - examples/index.md - Tiny Tutorial: examples/tutorial.ipynb @@ -113,6 +114,7 @@ nav: - Seeding: reference/seeding.md - Simulator: reference/simulator.md - Tasks: reference/task.md + - Usage & Cost: reference/usage.md - User: reference/user.md - Utilities: reference/utilities.md - Interface: From afe16a71488802a5fde38330e50e02a22b9b7728 Mon Sep 17 00:00:00 2001 From: cemde Date: Fri, 13 Mar 2026 14:30:09 +0100 Subject: [PATCH 05/19] - updated examples - moved reporting to usage --- CHANGELOG.md | 16 +- docs/reference/usage.md | 6 +- .../five_a_day_benchmark.ipynb | 69 ++------ .../five_a_day_benchmark.py | 22 ++- maseval/__init__.py | 3 +- maseval/core/reporting.py | 148 ------------------ maseval/core/usage.py | 144 ++++++++++++++++- 7 files changed, 192 insertions(+), 216 deletions(-) delete mode 100644 maseval/core/reporting.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 6f74dab6..98a6f0d3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Core** +- Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45) +- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45) +- `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45) +- `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45) +- Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45) +- `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45) + - `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42) - `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42) @@ -39,10 +46,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Examples** +- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45) + - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34) - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28) - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26) +**Documentation** + +- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) covering automatic LLM tracking, cost calculators, non-LLM usage, post-hoc analysis with `UsageReporter`, and the data model. (PR: #45) +- Usage & Cost reference page (`docs/reference/usage.md`) with API documentation for all usage and cost classes. (PR: #45) + **Core** - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) @@ -108,8 +122,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - `LangGraphUser` → `LangGraphLLMUser` - `LlamaIndexUser` → `LlamaIndexLLMUser` -**Documentation** - - All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39) **Testing** diff --git a/docs/reference/usage.md b/docs/reference/usage.md index 2326aaef..460769bf 100644 --- a/docs/reference/usage.md +++ b/docs/reference/usage.md @@ -18,11 +18,7 @@ See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage pat ::: maseval.core.usage.StaticPricingCalculator -## Reporting - -[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/reporting.py){ .md-source-file } - -::: maseval.core.reporting.UsageReporter +::: maseval.core.usage.UsageReporter ## Interface diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb index 1eebd71f..2e4a6375 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb @@ -651,64 +651,25 @@ " print(f\"{k:<35} {v}\")" ] }, + { + "cell_type": "markdown", + "id": "tspsj2zzdyo", + "source": "### 2.7 Usage & Cost Tracking\n\nMASEval automatically tracks token usage for every LLM call made during benchmark execution. Each report includes a `\"usage\"` key with per-component breakdowns, and the benchmark maintains running totals across all tasks.\n\nFor cost estimation, pass a `CostCalculator` to your model adapters. MASEval ships two built-in calculators:\n\n- **`StaticPricingCalculator`** — user-supplied per-token rates (no dependencies)\n- **`LiteLLMCostCalculator`** — automatic pricing via LiteLLM's model database (requires `litellm`)\n\nSince this benchmark uses smolagents with LiteLLM models (which don't go through MASEval's `ModelAdapter`), token usage is tracked at the tool level. In benchmarks that use MASEval's model adapters directly, token-level usage and cost are captured automatically.", + "metadata": {} + }, + { + "cell_type": "code", + "id": "amrylkbxkb7", + "source": "from maseval import UsageReporter\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\" Total cost: {f'${total.cost:.6f}' if total.cost is not None else 'N/A (no cost calculator)'}\")\nprint(f\" Total units: {dict(total.units) if total.units else '{}'}\")\nprint()\n\n# Per-component breakdown\nprint(\"Per-Component Breakdown\")\nprint(\"-\" * 60)\nfor component_key, usage in benchmark.usage_by_component.items():\n cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n units_str = dict(usage.units) if usage.units else \"\"\n print(f\" {component_key:<35} cost={cost_str} units={units_str}\")\nprint()\n\n# --- Post-hoc analysis with UsageReporter ---\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n print(f\" {task_id:<35} cost={cost_str}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))", + "metadata": {}, + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", "id": "080f6216", "metadata": {}, - "source": [ - "## Summary and Key Takeaways\n", - "\n", - "### What You've Learned\n", - "\n", - "You now understand how to build production agent benchmarks with MASEval:\n", - "\n", - "#### Part 1: Multi-Agent Systems\n", - "- **Model creation** with LiteLLM for framework compatibility\n", - "- **Framework-agnostic tools** that convert to any agent library\n", - "- **Multi-agent architecture** with orchestrators and specialists\n", - "- **Tool state management** for realistic task environments\n", - "\n", - "#### Part 2: MASEval Framework\n", - "- **Task abstraction** packages queries, environments, and evaluation criteria\n", - "- **Environment class** creates tools and enables automatic tracing\n", - "- **Benchmark class** orchestrates evaluation across multiple tasks\n", - "- **Custom evaluators** for diverse evaluation approaches (unit tests, LLM judges, etc.)\n", - "- **Automatic tracing** captures all tool calls and agent interactions\n", - "\n", - "### Key Design Patterns\n", - "\n", - "1. **Separation of Concerns**:\n", - " - Tasks define WHAT to evaluate\n", - " - Environments provides a world in which the agents act (tools and state)\n", - " - Benchmarks orchestrate WHEN and WHERE\n", - " - Evaluators determine SUCCESS\n", - "\n", - "2. **Framework Agnostic**:\n", - " - Same tasks work with smolagents, LangGraph, LlamaIndex\n", - " - Tools convert automatically to framework-specific formats\n", - " - Easy to compare frameworks on identical tasks\n", - "\n", - "3. **Reproducibility**:\n", - " - Seeds derived systematically from task_id + agent_id\n", - " - All parameters logged automatically\n", - " - Results saved in structured JSONL format\n", - "\n", - "## Next Steps\n", - "\n", - "1. **Explore evaluators** — Check `evaluators/` for different evaluation strategies\n", - "2. **Try single-agent mode** — Load `data/singleagent.json` to compare architectures\n", - "3. **Run from CLI** — Use `five_a_day_benchmark.py` for scripted runs with different frameworks\n", - "4. **Add custom tasks** — Create your own task definitions and evaluators\n", - "5. **Compare frameworks** — Run the same benchmark with LangGraph or LlamaIndex\n", - "\n", - "## Resources\n", - "\n", - "- [MASEval Documentation](https://github.com/parameterlab/MASEval)\n", - "- Example code: [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark)\n", - "- Example data: [`examples/five_a_day_benchmark/data/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/data)\n", - "- Tool implementations: [`examples/five_a_day_benchmark/tools/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/tools)\n", - "- Evaluator implementations: [`examples/five_a_day_benchmark/evaluators/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/evaluators)" - ] + "source": "## Summary and Key Takeaways\n\n### What You've Learned\n\nYou now understand how to build production agent benchmarks with MASEval:\n\n#### Part 1: Multi-Agent Systems\n- **Model creation** with LiteLLM for framework compatibility\n- **Framework-agnostic tools** that convert to any agent library\n- **Multi-agent architecture** with orchestrators and specialists\n- **Tool state management** for realistic task environments\n\n#### Part 2: MASEval Framework\n- **Task abstraction** packages queries, environments, and evaluation criteria\n- **Environment class** creates tools and enables automatic tracing\n- **Benchmark class** orchestrates evaluation across multiple tasks\n- **Custom evaluators** for diverse evaluation approaches (unit tests, LLM judges, etc.)\n- **Automatic tracing** captures all tool calls and agent interactions\n- **Usage & cost tracking** monitors token consumption and computes cost across providers\n\n### Key Design Patterns\n\n1. **Separation of Concerns**:\n - Tasks define WHAT to evaluate\n - Environments provides a world in which the agents act (tools and state)\n - Benchmarks orchestrate WHEN and WHERE\n - Evaluators determine SUCCESS\n\n2. **Framework Agnostic**:\n - Same tasks work with smolagents, LangGraph, LlamaIndex\n - Tools convert automatically to framework-specific formats\n - Easy to compare frameworks on identical tasks\n\n3. **Reproducibility**:\n - Seeds derived systematically from task_id + agent_id\n - All parameters logged automatically\n - Results saved in structured JSONL format\n\n## Next Steps\n\n1. **Explore evaluators** — Check `evaluators/` for different evaluation strategies\n2. **Try single-agent mode** — Load `data/singleagent.json` to compare architectures\n3. **Run from CLI** — Use `five_a_day_benchmark.py` for scripted runs with different frameworks\n4. **Add custom tasks** — Create your own task definitions and evaluators\n5. **Compare frameworks** — Run the same benchmark with LangGraph or LlamaIndex\n\n## Resources\n\n- [MASEval Documentation](https://github.com/parameterlab/MASEval)\n- Example code: [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark)\n- Example data: [`examples/five_a_day_benchmark/data/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/data)\n- Tool implementations: [`examples/five_a_day_benchmark/tools/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/tools)\n- Evaluator implementations: [`examples/five_a_day_benchmark/evaluators/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/evaluators)" } ], "metadata": { diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py index 6972a203..a3972a9c 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.py +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py @@ -26,7 +26,7 @@ from utils import sanitize_name # type: ignore[unresolved-import] -from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue, AgentAdapter, ModelAdapter, SeedGenerator +from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue, AgentAdapter, ModelAdapter, SeedGenerator, UsageReporter from maseval.core.callbacks.result_logger import FileResultLogger # Import tool implementations @@ -960,6 +960,26 @@ def load_benchmark_data( ) results = benchmark.run(tasks=tasks, agent_data=agent_configs) + # --- Usage summary --- + print("\n--- Usage Summary ---") + total = benchmark.usage + cost_str = f"${total.cost:.6f}" if total.cost is not None else "N/A (no cost calculator)" + print(f"Total cost: {cost_str}") + + if benchmark.usage_by_component: + print("\nPer-component:") + for key, usage in benchmark.usage_by_component.items(): + c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A" + print(f" {key:<35} cost={c} units={dict(usage.units) if usage.units else '{}'}") + + reporter = UsageReporter.from_reports(results) + by_task = reporter.by_task() + if by_task: + print("\nPer-task:") + for task_id, usage in by_task.items(): + c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A" + print(f" {task_id:<35} cost={c}") + print("\n--- Benchmark Complete ---") print(f"Total tasks: {len(tasks)}") print(f"Results saved to: {logger.output_dir}") diff --git a/maseval/__init__.py b/maseval/__init__.py index 460bf1a9..c50012ac 100644 --- a/maseval/__init__.py +++ b/maseval/__init__.py @@ -38,8 +38,7 @@ from .core.history import MessageHistory, ToolInvocationHistory from .core.tracing import TraceableMixin from .core.usage import Usage, TokenUsage, UsageTrackableMixin -from .core.usage import CostCalculator, StaticPricingCalculator -from .core.reporting import UsageReporter +from .core.usage import CostCalculator, StaticPricingCalculator, UsageReporter from .core.registry import ComponentRegistry from .core.context import TaskContext from .core.exceptions import ( diff --git a/maseval/core/reporting.py b/maseval/core/reporting.py deleted file mode 100644 index 2465fba0..00000000 --- a/maseval/core/reporting.py +++ /dev/null @@ -1,148 +0,0 @@ -"""Post-hoc usage reporting utilities. - -This module provides ``UsageReporter`` for slicing and analyzing usage data -from benchmark reports. Unlike the registry's live aggregates (which provide -running totals), the reporter can slice by task since it sees the full report -list with task IDs. -""" - -from __future__ import annotations - -from typing import Any, Dict, List - -from .usage import Usage, TokenUsage - - -class UsageReporter: - """Post-hoc utility for analyzing usage across benchmark reports. - - Walks ``report["usage"]`` across all reports to produce breakdowns - by task, component, model, etc. - - Example: - ```python - reporter = UsageReporter.from_reports(benchmark.reports) - print(reporter.total()) - print(reporter.by_task()) - print(reporter.by_component()) - ``` - """ - - def __init__(self, entries: List[Dict[str, Any]]): - """Initialize with raw entries extracted from reports. - - Args: - entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``, - and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples). - """ - self._entries = entries - - @staticmethod - def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter: - """Create a UsageReporter from benchmark reports. - - Args: - reports: The ``benchmark.reports`` list. - - Returns: - A UsageReporter ready for analysis. - """ - entries = [] - for report in reports: - usage_data = report.get("usage") - if not usage_data or "error" in usage_data: - continue - - usage_items = [] - for category, value in usage_data.items(): - if category == "metadata": - continue - if isinstance(value, dict) and "cost" in value: - # Direct value (environment/user) — it's a usage dict - usage_items.append((category, value)) - elif isinstance(value, dict): - # Category dict with component names as keys - for comp_name, comp_usage in value.items(): - if isinstance(comp_usage, dict) and "error" not in comp_usage: - usage_items.append((f"{category}:{comp_name}", comp_usage)) - - entries.append( - { - "task_id": report.get("task_id"), - "repeat_idx": report.get("repeat_idx"), - "usage_items": usage_items, - } - ) - - return UsageReporter(entries) - - @staticmethod - def _usage_from_dict(d: Dict[str, Any]) -> Usage: - """Reconstruct a Usage (or TokenUsage) from a serialized dict.""" - has_tokens = "input_tokens" in d - if has_tokens: - return TokenUsage( - cost=d.get("cost"), - units=d.get("units", {}), - provider=d.get("provider"), - category=d.get("category"), - component_name=d.get("component_name"), - kind=d.get("kind"), - input_tokens=d.get("input_tokens", 0), - output_tokens=d.get("output_tokens", 0), - total_tokens=d.get("total_tokens", 0), - cached_input_tokens=d.get("cached_input_tokens", 0), - reasoning_tokens=d.get("reasoning_tokens", 0), - audio_tokens=d.get("audio_tokens", 0), - ) - return Usage( - cost=d.get("cost"), - units=d.get("units", {}), - provider=d.get("provider"), - category=d.get("category"), - component_name=d.get("component_name"), - kind=d.get("kind"), - ) - - def by_task(self) -> Dict[str, Usage]: - """Aggregate usage by task_id across all repetitions.""" - result: Dict[str, Usage] = {} - for entry in self._entries: - task_id = entry["task_id"] - for _key, usage_dict in entry["usage_items"]: - usage = self._usage_from_dict(usage_dict) - if task_id in result: - result[task_id] = result[task_id] + usage - else: - result[task_id] = usage - return result - - def by_component(self) -> Dict[str, Usage]: - """Aggregate usage by registry key (e.g., ``"models:main_model"``).""" - result: Dict[str, Usage] = {} - for entry in self._entries: - for key, usage_dict in entry["usage_items"]: - usage = self._usage_from_dict(usage_dict) - if key in result: - result[key] = result[key] + usage - else: - result[key] = usage - return result - - def total(self) -> Usage: - """Grand total across all tasks and components.""" - all_usages = [] - for entry in self._entries: - for _key, usage_dict in entry["usage_items"]: - all_usages.append(self._usage_from_dict(usage_dict)) - if not all_usages: - return Usage() - return sum(all_usages, Usage()) - - def summary(self) -> Dict[str, Any]: - """Nested dict with all breakdowns.""" - return { - "total": self.total().to_dict(), - "by_task": {k: v.to_dict() for k, v in self.by_task().items()}, - "by_component": {k: v.to_dict() for k, v in self.by_component().items()}, - } diff --git a/maseval/core/usage.py b/maseval/core/usage.py index aa2c2e08..319d89bb 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -2,9 +2,10 @@ This module provides the `Usage` and `TokenUsage` data classes for recording billable resource consumption, the `UsageTrackableMixin` that enables -automatic usage collection through the component registry, and pluggable +automatic usage collection through the component registry, pluggable cost calculators (`CostCalculator`, `StaticPricingCalculator`) for translating -token counts into monetary cost. +token counts into monetary cost, and `UsageReporter` for post-hoc analysis +of usage data from benchmark reports. Usage tracking is a first-class collection axis alongside tracing (`TraceableMixin`) and configuration (`ConfigurableMixin`). Components that @@ -20,7 +21,7 @@ from __future__ import annotations from dataclasses import dataclass, field -from typing import Any, Dict, Optional, Protocol, runtime_checkable +from typing import Any, Dict, List, Optional, Protocol, runtime_checkable @dataclass @@ -411,7 +412,7 @@ def add_model(self, model_id: str, rates: Dict[str, float]) -> None: self._pricing[model_id] = rates @property - def models(self) -> list[str]: + def models(self) -> List[str]: """List of model IDs with pricing configured.""" return list(self._pricing.keys()) @@ -421,3 +422,138 @@ def gather_config(self) -> Dict[str, Any]: "type": type(self).__name__, "pricing": dict(self._pricing), } + + +class UsageReporter: + """Post-hoc utility for analyzing usage across benchmark reports. + + Walks ``report["usage"]`` across all reports to produce breakdowns + by task, component, model, etc. + + Example: + ```python + reporter = UsageReporter.from_reports(benchmark.reports) + print(reporter.total()) + print(reporter.by_task()) + print(reporter.by_component()) + ``` + """ + + def __init__(self, entries: List[Dict[str, Any]]): + """Initialize with raw entries extracted from reports. + + Args: + entries: List of dicts, each with ``"task_id"``, ``"repeat_idx"``, + and ``"usage_items"`` (list of ``(key, usage_dict)`` tuples). + """ + self._entries = entries + + @staticmethod + def from_reports(reports: List[Dict[str, Any]]) -> UsageReporter: + """Create a UsageReporter from benchmark reports. + + Args: + reports: The ``benchmark.reports`` list. + + Returns: + A UsageReporter ready for analysis. + """ + entries = [] + for report in reports: + usage_data = report.get("usage") + if not usage_data or "error" in usage_data: + continue + + usage_items = [] + for category, value in usage_data.items(): + if category == "metadata": + continue + if isinstance(value, dict) and "cost" in value: + # Direct value (environment/user) — it's a usage dict + usage_items.append((category, value)) + elif isinstance(value, dict): + # Category dict with component names as keys + for comp_name, comp_usage in value.items(): + if isinstance(comp_usage, dict) and "error" not in comp_usage: + usage_items.append((f"{category}:{comp_name}", comp_usage)) + + entries.append( + { + "task_id": report.get("task_id"), + "repeat_idx": report.get("repeat_idx"), + "usage_items": usage_items, + } + ) + + return UsageReporter(entries) + + @staticmethod + def _usage_from_dict(d: Dict[str, Any]) -> Usage: + """Reconstruct a Usage (or TokenUsage) from a serialized dict.""" + has_tokens = "input_tokens" in d + if has_tokens: + return TokenUsage( + cost=d.get("cost"), + units=d.get("units", {}), + provider=d.get("provider"), + category=d.get("category"), + component_name=d.get("component_name"), + kind=d.get("kind"), + input_tokens=d.get("input_tokens", 0), + output_tokens=d.get("output_tokens", 0), + total_tokens=d.get("total_tokens", 0), + cached_input_tokens=d.get("cached_input_tokens", 0), + reasoning_tokens=d.get("reasoning_tokens", 0), + audio_tokens=d.get("audio_tokens", 0), + ) + return Usage( + cost=d.get("cost"), + units=d.get("units", {}), + provider=d.get("provider"), + category=d.get("category"), + component_name=d.get("component_name"), + kind=d.get("kind"), + ) + + def by_task(self) -> Dict[str, Usage]: + """Aggregate usage by task_id across all repetitions.""" + result: Dict[str, Usage] = {} + for entry in self._entries: + task_id = entry["task_id"] + for _key, usage_dict in entry["usage_items"]: + usage = self._usage_from_dict(usage_dict) + if task_id in result: + result[task_id] = result[task_id] + usage + else: + result[task_id] = usage + return result + + def by_component(self) -> Dict[str, Usage]: + """Aggregate usage by registry key (e.g., ``"models:main_model"``).""" + result: Dict[str, Usage] = {} + for entry in self._entries: + for key, usage_dict in entry["usage_items"]: + usage = self._usage_from_dict(usage_dict) + if key in result: + result[key] = result[key] + usage + else: + result[key] = usage + return result + + def total(self) -> Usage: + """Grand total across all tasks and components.""" + all_usages = [] + for entry in self._entries: + for _key, usage_dict in entry["usage_items"]: + all_usages.append(self._usage_from_dict(usage_dict)) + if not all_usages: + return Usage() + return sum(all_usages, Usage()) + + def summary(self) -> Dict[str, Any]: + """Nested dict with all breakdowns.""" + return { + "total": self.total().to_dict(), + "by_task": {k: v.to_dict() for k, v in self.by_task().items()}, + "by_component": {k: v.to_dict() for k, v in self.by_component().items()}, + } From 13067b93c242abef1cfc587a760034b0bbc4e24f Mon Sep 17 00:00:00 2001 From: cemde Date: Fri, 13 Mar 2026 16:49:30 +0100 Subject: [PATCH 06/19] updated tests and fixed bugs in cost tracking --- docs/reference/usage.md | 4 - maseval/core/model.py | 5 +- maseval/core/usage.py | 26 +- maseval/interface/inference/anthropic.py | 3 + maseval/interface/inference/litellm.py | 3 + maseval/interface/usage.py | 2 + tests/test_core/test_usage.py | 681 ++++++++++++++++++ .../test_api_contracts.py | 422 +++++++++++ 8 files changed, 1137 insertions(+), 9 deletions(-) create mode 100644 tests/test_core/test_usage.py diff --git a/docs/reference/usage.md b/docs/reference/usage.md index 460769bf..87bbcfe9 100644 --- a/docs/reference/usage.md +++ b/docs/reference/usage.md @@ -4,8 +4,6 @@ Usage and cost tracking provides data classes for recording resource consumption See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage patterns and examples. -## Core - [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/usage.py){ .md-source-file } ::: maseval.core.usage.Usage @@ -20,8 +18,6 @@ See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage pat ::: maseval.core.usage.UsageReporter -## Interface - [:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/usage.py){ .md-source-file } ::: maseval.interface.usage.LiteLLMCostCalculator diff --git a/maseval/core/model.py b/maseval/core/model.py index 110d9879..4c4a7f6a 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -402,7 +402,10 @@ def gather_usage(self) -> Usage: """ if not self._usage_records: return Usage() - return sum(self._usage_records, Usage()) + result = self._usage_records[0] + for record in self._usage_records[1:]: + result = result + record + return result def gather_traces(self) -> Dict[str, Any]: """Gather execution traces from this model adapter. diff --git a/maseval/core/usage.py b/maseval/core/usage.py index 319d89bb..2a53931c 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -130,6 +130,8 @@ class TokenUsage(Usage): total_tokens: Total tokens (input + output). cached_input_tokens: Tokens served from cache (Anthropic ``cache_read_input_tokens``, OpenAI ``cached_tokens``). + cache_creation_input_tokens: Tokens used to create a new cache entry + (Anthropic ``cache_creation_input_tokens``). Billed at a higher rate. reasoning_tokens: Tokens used for reasoning (OpenAI ``reasoning_tokens``, Google ``thoughts_token_count``). audio_tokens: Tokens for audio processing (OpenAI). @@ -149,6 +151,7 @@ class TokenUsage(Usage): output_tokens: int = 0 total_tokens: int = 0 cached_input_tokens: int = 0 + cache_creation_input_tokens: int = 0 reasoning_tokens: int = 0 audio_tokens: int = 0 @@ -169,6 +172,7 @@ def __add__(self, other: Usage) -> Usage: output_tokens=self.output_tokens + other.output_tokens, total_tokens=self.total_tokens + other.total_tokens, cached_input_tokens=self.cached_input_tokens + other.cached_input_tokens, + cache_creation_input_tokens=self.cache_creation_input_tokens + other.cache_creation_input_tokens, reasoning_tokens=self.reasoning_tokens + other.reasoning_tokens, audio_tokens=self.audio_tokens + other.audio_tokens, ) @@ -185,6 +189,7 @@ def __add__(self, other: Usage) -> Usage: output_tokens=self.output_tokens, total_tokens=self.total_tokens, cached_input_tokens=self.cached_input_tokens, + cache_creation_input_tokens=self.cache_creation_input_tokens, reasoning_tokens=self.reasoning_tokens, audio_tokens=self.audio_tokens, ) @@ -197,6 +202,7 @@ def to_dict(self) -> Dict[str, Any]: "output_tokens": self.output_tokens, "total_tokens": self.total_tokens, "cached_input_tokens": self.cached_input_tokens, + "cache_creation_input_tokens": self.cache_creation_input_tokens, "reasoning_tokens": self.reasoning_tokens, "audio_tokens": self.audio_tokens, } @@ -237,6 +243,7 @@ def from_chat_response_usage( output_tokens=usage_dict.get("output_tokens", 0), total_tokens=usage_dict.get("total_tokens", 0), cached_input_tokens=usage_dict.get("cached_input_tokens", 0), + cache_creation_input_tokens=usage_dict.get("cache_creation_input_tokens", 0), reasoning_tokens=usage_dict.get("reasoning_tokens", 0), audio_tokens=usage_dict.get("audio_tokens", 0), ) @@ -353,6 +360,7 @@ class StaticPricingCalculator: - ``"input"`` — cost per input token (required) - ``"output"`` — cost per output token (required) - ``"cached_input"`` — cost per cached input token (optional, defaults to ``"input"`` rate) + - ``"cache_creation_input"`` — cost per cache creation token (optional, defaults to ``"input"`` rate) Example: ```python @@ -394,11 +402,17 @@ def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: input_rate = rates.get("input", 0.0) output_rate = rates.get("output", 0.0) cached_rate = rates.get("cached_input", input_rate) + cache_creation_rate = rates.get("cache_creation_input", input_rate) - # Non-cached input tokens = total input - cached - non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens) + # Non-cached input tokens = total input - cached - cache_creation + non_cached_input = max(0, usage.input_tokens - usage.cached_input_tokens - usage.cache_creation_input_tokens) - cost = non_cached_input * input_rate + usage.cached_input_tokens * cached_rate + usage.output_tokens * output_rate + cost = ( + non_cached_input * input_rate + + usage.cached_input_tokens * cached_rate + + usage.cache_creation_input_tokens * cache_creation_rate + + usage.output_tokens * output_rate + ) return cost @@ -503,6 +517,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage: output_tokens=d.get("output_tokens", 0), total_tokens=d.get("total_tokens", 0), cached_input_tokens=d.get("cached_input_tokens", 0), + cache_creation_input_tokens=d.get("cache_creation_input_tokens", 0), reasoning_tokens=d.get("reasoning_tokens", 0), audio_tokens=d.get("audio_tokens", 0), ) @@ -548,7 +563,10 @@ def total(self) -> Usage: all_usages.append(self._usage_from_dict(usage_dict)) if not all_usages: return Usage() - return sum(all_usages, Usage()) + result = all_usages[0] + for u in all_usages[1:]: + result = result + u + return result def summary(self) -> Dict[str, Any]: """Nested dict with all breakdowns.""" diff --git a/maseval/interface/inference/anthropic.py b/maseval/interface/inference/anthropic.py index 5c816e76..dfd07579 100644 --- a/maseval/interface/inference/anthropic.py +++ b/maseval/interface/inference/anthropic.py @@ -352,6 +352,9 @@ def _parse_response(self, response: Any) -> ChatResponse: cached = getattr(response.usage, "cache_read_input_tokens", 0) if cached: usage["cached_input_tokens"] = cached + cache_creation = getattr(response.usage, "cache_creation_input_tokens", 0) + if cache_creation: + usage["cache_creation_input_tokens"] = cache_creation # Extract stop reason stop_reason = None diff --git a/maseval/interface/inference/litellm.py b/maseval/interface/inference/litellm.py index ce5385e7..b12b618e 100644 --- a/maseval/interface/inference/litellm.py +++ b/maseval/interface/inference/litellm.py @@ -193,6 +193,9 @@ def _chat_impl( cached = getattr(prompt_details, "cached_tokens", 0) if cached: usage["cached_input_tokens"] = cached + cache_creation = getattr(prompt_details, "cache_creation_tokens", 0) + if cache_creation: + usage["cache_creation_input_tokens"] = cache_creation # LiteLLM provider-reported cost hidden = getattr(response, "_hidden_params", None) if hidden and isinstance(hidden, dict): diff --git a/maseval/interface/usage.py b/maseval/interface/usage.py index 87070f13..f5767e2d 100644 --- a/maseval/interface/usage.py +++ b/maseval/interface/usage.py @@ -109,6 +109,8 @@ def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: model=model_id, prompt_tokens=usage.input_tokens, completion_tokens=usage.output_tokens, + cache_read_input_tokens=usage.cached_input_tokens, + cache_creation_input_tokens=usage.cache_creation_input_tokens, ) return input_cost + output_cost except Exception: diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py new file mode 100644 index 00000000..fa6da9a6 --- /dev/null +++ b/tests/test_core/test_usage.py @@ -0,0 +1,681 @@ +"""Tests for usage tracking and cost calculation correctness. + +Verifies that: +- TokenUsage arithmetic produces correct results +- StaticPricingCalculator computes exact expected costs +- LiteLLMCostCalculator passes the right parameters to litellm +- Full pipeline (adapter → TokenUsage → CostCalculator → cost) is correct +- UsageReporter aggregates correctly from report dicts +- Serialization roundtrips preserve all fields +""" + +import pytest + +from maseval.core.usage import ( + Usage, + TokenUsage, + StaticPricingCalculator, + UsageReporter, +) + +pytestmark = [pytest.mark.core] + + +# ============================================================================= +# TokenUsage — Construction & Serialization +# ============================================================================= + + +class TestTokenUsageConstruction: + """Verify TokenUsage fields map correctly from various sources.""" + + def test_from_chat_response_basic(self): + """Minimal usage dict maps to the right fields.""" + tu = TokenUsage.from_chat_response_usage( + {"input_tokens": 100, "output_tokens": 50, "total_tokens": 150} + ) + assert tu.input_tokens == 100 + assert tu.output_tokens == 50 + assert tu.total_tokens == 150 + assert tu.cached_input_tokens == 0 + assert tu.cache_creation_input_tokens == 0 + assert tu.reasoning_tokens == 0 + assert tu.audio_tokens == 0 + assert tu.cost is None + + def test_from_chat_response_all_fields(self): + """All optional fields are mapped when present.""" + tu = TokenUsage.from_chat_response_usage( + { + "input_tokens": 1000, + "output_tokens": 200, + "total_tokens": 1200, + "cached_input_tokens": 800, + "cache_creation_input_tokens": 50, + "reasoning_tokens": 100, + "audio_tokens": 10, + }, + cost=0.05, + provider="anthropic", + ) + assert tu.input_tokens == 1000 + assert tu.output_tokens == 200 + assert tu.cached_input_tokens == 800 + assert tu.cache_creation_input_tokens == 50 + assert tu.reasoning_tokens == 100 + assert tu.audio_tokens == 10 + assert tu.cost == 0.05 + assert tu.provider == "anthropic" + + def test_serialization_roundtrip(self): + """to_dict → from_dict preserves every field.""" + original = TokenUsage( + cost=0.123, + input_tokens=500, + output_tokens=100, + total_tokens=600, + cached_input_tokens=200, + cache_creation_input_tokens=50, + reasoning_tokens=80, + audio_tokens=5, + provider="openai", + category="models", + component_name="main_model", + kind="llm", + ) + d = original.to_dict() + + # Verify dict has all expected keys + assert d["input_tokens"] == 500 + assert d["output_tokens"] == 100 + assert d["total_tokens"] == 600 + assert d["cached_input_tokens"] == 200 + assert d["cache_creation_input_tokens"] == 50 + assert d["reasoning_tokens"] == 80 + assert d["audio_tokens"] == 5 + assert d["cost"] == 0.123 + assert d["provider"] == "openai" + assert d["category"] == "models" + assert d["component_name"] == "main_model" + assert d["kind"] == "llm" + + # Reconstruct via UsageReporter's deserialization path + reconstructed = UsageReporter._usage_from_dict(d) + assert isinstance(reconstructed, TokenUsage) + assert reconstructed.input_tokens == original.input_tokens + assert reconstructed.output_tokens == original.output_tokens + assert reconstructed.cached_input_tokens == original.cached_input_tokens + assert reconstructed.cache_creation_input_tokens == original.cache_creation_input_tokens + assert reconstructed.reasoning_tokens == original.reasoning_tokens + assert reconstructed.audio_tokens == original.audio_tokens + assert reconstructed.cost == original.cost + + +# ============================================================================= +# TokenUsage — Arithmetic +# ============================================================================= + + +class TestTokenUsageArithmetic: + """Verify addition produces mathematically correct results.""" + + def test_add_two_token_usages(self): + """All token fields and cost sum correctly.""" + a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150, cached_input_tokens=20, cache_creation_input_tokens=10) + b = TokenUsage(cost=0.05, input_tokens=200, output_tokens=30, total_tokens=230, cached_input_tokens=50, cache_creation_input_tokens=5) + total = a + b + + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(0.15) + assert total.input_tokens == 300 + assert total.output_tokens == 80 + assert total.total_tokens == 380 + assert total.cached_input_tokens == 70 + assert total.cache_creation_input_tokens == 15 + + def test_sum_multiple(self): + """sum() over a list of TokenUsages works correctly.""" + records = [ + TokenUsage(cost=0.01, input_tokens=10, output_tokens=5, total_tokens=15), + TokenUsage(cost=0.02, input_tokens=20, output_tokens=10, total_tokens=30), + TokenUsage(cost=0.03, input_tokens=30, output_tokens=15, total_tokens=45), + ] + total = records[0] + for r in records[1:]: + total = total + r + + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(0.06) + assert total.input_tokens == 60 + assert total.output_tokens == 30 + assert total.total_tokens == 90 + + def test_none_cost_propagates(self): + """If either cost is None, sum cost is None.""" + a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150) + b = TokenUsage(cost=None, input_tokens=200, output_tokens=30, total_tokens=230) + total = a + b + + assert total.cost is None + # Token fields still sum correctly + assert total.input_tokens == 300 + assert total.output_tokens == 80 + + def test_grouping_fields_match(self): + """Matching grouping fields are preserved.""" + a = TokenUsage(cost=0.10, provider="anthropic", kind="llm", input_tokens=100, output_tokens=50, total_tokens=150) + b = TokenUsage(cost=0.05, provider="anthropic", kind="llm", input_tokens=200, output_tokens=30, total_tokens=230) + total = a + b + + assert total.provider == "anthropic" + assert total.kind == "llm" + + def test_grouping_fields_mismatch(self): + """Mismatched grouping fields become None.""" + a = TokenUsage(cost=0.10, provider="anthropic", input_tokens=100, output_tokens=50, total_tokens=150) + b = TokenUsage(cost=0.05, provider="openai", input_tokens=200, output_tokens=30, total_tokens=230) + total = a + b + + assert total.provider is None + + def test_add_token_usage_plus_plain_usage(self): + """TokenUsage + plain Usage preserves token fields from left operand.""" + token = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150, cached_input_tokens=20) + plain = Usage(cost=0.05, units={"api_calls": 1}) + total = token + plain + + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(0.15) + assert total.input_tokens == 100 + assert total.cached_input_tokens == 20 + assert total.units == {"api_calls": 1} + + +# ============================================================================= +# StaticPricingCalculator — Cost Correctness +# ============================================================================= + + +class TestStaticPricingCalculator: + """Verify cost formulas with hand-calculated expected values.""" + + def test_basic_cost(self): + """Simple input + output cost with no caching. + + 100 input * $0.01 = $1.00 + 50 output * $0.02 = $1.00 + Total = $2.00 + """ + calc = StaticPricingCalculator({ + "test-model": {"input": 0.01, "output": 0.02}, + }) + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + cost = calc.calculate_cost(usage, "test-model") + + assert cost == pytest.approx(2.00) + + def test_cached_input_tokens(self): + """Cached tokens use the cheaper rate. + + input_tokens=1000, cached_input_tokens=800 + Non-cached: 200 * $0.003 = $0.60 + Cached: 800 * $0.0003 = $0.24 + Output: 100 * $0.015 = $1.50 + Total = $2.34 + """ + calc = StaticPricingCalculator({ + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + }, + }) + usage = TokenUsage(input_tokens=1000, output_tokens=100, total_tokens=1100, cached_input_tokens=800) + cost = calc.calculate_cost(usage, "claude-sonnet-4-5") + + assert cost == pytest.approx(2.34) + + def test_cache_creation_tokens(self): + """Cache creation tokens use the higher rate. + + input_tokens=1000, cached_input_tokens=600, cache_creation_input_tokens=200 + Non-cached: (1000 - 600 - 200) = 200 * $0.003 = $0.60 + Cached: 600 * $0.0003 = $0.18 + Cache creation: 200 * $0.00375 = $0.75 + Output: 100 * $0.015 = $1.50 + Total = $3.03 + """ + calc = StaticPricingCalculator({ + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + "cache_creation_input": 0.00375, + }, + }) + usage = TokenUsage( + input_tokens=1000, + output_tokens=100, + total_tokens=1100, + cached_input_tokens=600, + cache_creation_input_tokens=200, + ) + cost = calc.calculate_cost(usage, "claude-sonnet-4-5") + + assert cost == pytest.approx(3.03) + + def test_cache_creation_defaults_to_input_rate(self): + """When cache_creation_input is not specified, it defaults to the input rate. + + input_tokens=1000, cache_creation_input_tokens=200 + Non-cached: 800 * $0.003 = $2.40 + Cache creation: 200 * $0.003 = $0.60 (uses input rate) + Output: 100 * $0.015 = $1.50 + Total = $4.50 + """ + calc = StaticPricingCalculator({ + "claude-sonnet-4-5": {"input": 0.003, "output": 0.015}, + }) + usage = TokenUsage( + input_tokens=1000, + output_tokens=100, + total_tokens=1100, + cache_creation_input_tokens=200, + ) + cost = calc.calculate_cost(usage, "claude-sonnet-4-5") + + assert cost == pytest.approx(4.50) + + def test_unknown_model_returns_none(self): + """Model not in pricing table returns None, not zero.""" + calc = StaticPricingCalculator({"gpt-4": {"input": 0.01, "output": 0.02}}) + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + + assert calc.calculate_cost(usage, "unknown-model") is None + + def test_zero_tokens(self): + """Zero tokens produces zero cost.""" + calc = StaticPricingCalculator({"m": {"input": 0.01, "output": 0.02}}) + usage = TokenUsage(input_tokens=0, output_tokens=0, total_tokens=0) + + assert calc.calculate_cost(usage, "m") == pytest.approx(0.0) + + def test_real_world_anthropic_pricing(self): + """Real Anthropic Sonnet 4 pricing: $3/$15 per 1M tokens. + + 500 input * $0.000003 = $0.0015 + 200 output * $0.000015 = $0.003 + Total = $0.0045 + """ + calc = StaticPricingCalculator({ + "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6}, + }) + usage = TokenUsage(input_tokens=500, output_tokens=200, total_tokens=700) + cost = calc.calculate_cost(usage, "claude-sonnet-4-5") + + assert cost == pytest.approx(0.0045) + + def test_real_world_openai_pricing(self): + """Real GPT-4o pricing: $2.50/$10 per 1M tokens. + + 1000 input * $0.0000025 = $0.0025 + 500 output * $0.000010 = $0.005 + Total = $0.0075 + """ + calc = StaticPricingCalculator({ + "gpt-4o": {"input": 2.5e-6, "output": 10e-6}, + }) + usage = TokenUsage(input_tokens=1000, output_tokens=500, total_tokens=1500) + cost = calc.calculate_cost(usage, "gpt-4o") + + assert cost == pytest.approx(0.0075) + + +# ============================================================================= +# LiteLLMCostCalculator — Parameter Passing +# ============================================================================= + + +class TestLiteLLMCostCalculator: + """Verify LiteLLMCostCalculator passes the right params to litellm.""" + + def test_passes_cache_tokens_to_cost_per_token(self): + """Verify cache_read and cache_creation tokens are forwarded.""" + litellm = pytest.importorskip("litellm") + from unittest.mock import patch + from maseval.interface.usage import LiteLLMCostCalculator + + calc = LiteLLMCostCalculator() + usage = TokenUsage( + input_tokens=1000, + output_tokens=200, + total_tokens=1200, + cached_input_tokens=600, + cache_creation_input_tokens=100, + ) + + with patch("litellm.cost_per_token", return_value=(0.003, 0.006)) as mock_cpt: + cost = calc.calculate_cost(usage, "claude-sonnet-4-5-20250514") + + mock_cpt.assert_called_once_with( + model="claude-sonnet-4-5-20250514", + prompt_tokens=1000, + completion_tokens=200, + cache_read_input_tokens=600, + cache_creation_input_tokens=100, + ) + assert cost == pytest.approx(0.009) + + def test_model_id_map_remapping(self): + """model_id_map remaps before calling litellm.""" + pytest.importorskip("litellm") + from unittest.mock import patch + from maseval.interface.usage import LiteLLMCostCalculator + + calc = LiteLLMCostCalculator(model_id_map={ + "gemini-2.0-flash": "gemini/gemini-2.0-flash", + }) + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + + with patch("litellm.cost_per_token", return_value=(0.001, 0.002)) as mock_cpt: + calc.calculate_cost(usage, "gemini-2.0-flash") + + # Verify it called with the remapped ID + assert mock_cpt.call_args.kwargs["model"] == "gemini/gemini-2.0-flash" + + def test_custom_pricing_overrides_litellm(self): + """custom_pricing takes precedence over litellm database. + + 100 input * $0.0001 = $0.01 + 50 output * $0.0002 = $0.01 + Total = $0.02 + """ + pytest.importorskip("litellm") + from unittest.mock import patch + from maseval.interface.usage import LiteLLMCostCalculator + + calc = LiteLLMCostCalculator(custom_pricing={ + "my-model": { + "input_cost_per_token": 0.0001, + "output_cost_per_token": 0.0002, + }, + }) + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + + with patch("litellm.cost_per_token") as mock_cpt: + cost = calc.calculate_cost(usage, "my-model") + + # litellm.cost_per_token should NOT be called + mock_cpt.assert_not_called() + assert cost == pytest.approx(0.02) + + def test_unknown_model_returns_none(self): + """Model not in litellm's database returns None.""" + pytest.importorskip("litellm") + from unittest.mock import patch + from maseval.interface.usage import LiteLLMCostCalculator + + calc = LiteLLMCostCalculator() + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + + with patch("litellm.cost_per_token", side_effect=Exception("not found")): + cost = calc.calculate_cost(usage, "nonexistent-model-xyz") + + assert cost is None + + +# ============================================================================= +# Full Pipeline — DummyModelAdapter + CostCalculator +# ============================================================================= + + +class TestFullPipeline: + """End-to-end: adapter → TokenUsage → CostCalculator → gather_usage().cost. + + Uses DummyModelAdapter from conftest with known usage dicts and a + StaticPricingCalculator with known rates, then verifies the final cost + matches hand-calculated values. + """ + + def test_basic_pipeline(self): + """Single chat call → correct cost on gather_usage(). + + 100 input * $0.01 + 50 output * $0.02 = $2.00 + """ + from tests.conftest import DummyModelAdapter + + calc = StaticPricingCalculator({ + "test-model": {"input": 0.01, "output": 0.02}, + }) + adapter = DummyModelAdapter( + model_id="test-model", + usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}, + ) + adapter._cost_calculator = calc + + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.input_tokens == 100 + assert total.output_tokens == 50 + assert total.cost == pytest.approx(2.00) + + def test_pipeline_multiple_calls_accumulate(self): + """Multiple chat calls accumulate usage correctly. + + Call 1: 100 input * $0.01 + 50 output * $0.02 = $2.00 + Call 2: 100 input * $0.01 + 50 output * $0.02 = $2.00 + Total = $4.00, 200 input, 100 output + """ + from tests.conftest import DummyModelAdapter + + calc = StaticPricingCalculator({ + "test-model": {"input": 0.01, "output": 0.02}, + }) + adapter = DummyModelAdapter( + model_id="test-model", + usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}, + ) + adapter._cost_calculator = calc + + adapter.chat([{"role": "user", "content": "Hello"}]) + adapter.chat([{"role": "user", "content": "World"}]) + total = adapter.gather_usage() + + assert total.input_tokens == 200 + assert total.output_tokens == 100 + assert total.cost == pytest.approx(4.00) + + def test_pipeline_provider_cost_takes_precedence(self): + """Provider-reported cost wins over calculator. + + Usage dict has cost=0.99 (provider-reported). + Calculator would compute $2.00. + Provider cost should win. + """ + from tests.conftest import DummyModelAdapter + + calc = StaticPricingCalculator({ + "test-model": {"input": 0.01, "output": 0.02}, + }) + adapter = DummyModelAdapter( + model_id="test-model", + usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150, "cost": 0.99}, + ) + adapter._cost_calculator = calc + + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert total.cost == pytest.approx(0.99) + + def test_pipeline_no_calculator_no_provider_cost(self): + """Without calculator or provider cost, cost is None.""" + from tests.conftest import DummyModelAdapter + + adapter = DummyModelAdapter( + model_id="test-model", + usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}, + ) + + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert total.input_tokens == 100 + assert total.cost is None + + def test_pipeline_with_cached_tokens(self): + """Pipeline correctly handles cached tokens in cost calculation. + + input_tokens=1000, cached_input_tokens=800 + Non-cached: 200 * $0.003 = $0.60 + Cached: 800 * $0.0003 = $0.24 + Output: 100 * $0.015 = $1.50 + Total = $2.34 + """ + from tests.conftest import DummyModelAdapter + + calc = StaticPricingCalculator({ + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + }, + }) + adapter = DummyModelAdapter( + model_id="claude-sonnet-4-5", + usage={ + "input_tokens": 1000, + "output_tokens": 100, + "total_tokens": 1100, + "cached_input_tokens": 800, + }, + ) + adapter._cost_calculator = calc + + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert total.cached_input_tokens == 800 + assert total.cost == pytest.approx(2.34) + + +# ============================================================================= +# UsageReporter — Aggregation Correctness +# ============================================================================= + + +class TestUsageReporter: + """Verify UsageReporter produces correct aggregations from report dicts.""" + + @pytest.fixture + def sample_reports(self): + """Two tasks, each with a model component.""" + return [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "models": { + "main_model": { + "cost": 0.10, + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": "openai", + "category": "models", + "component_name": "main_model", + "kind": "llm", + } + } + }, + }, + { + "task_id": "task_2", + "repeat_idx": 0, + "usage": { + "models": { + "main_model": { + "cost": 0.20, + "input_tokens": 200, + "output_tokens": 100, + "total_tokens": 300, + "cached_input_tokens": 50, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": "openai", + "category": "models", + "component_name": "main_model", + "kind": "llm", + } + } + }, + }, + ] + + def test_total(self, sample_reports): + reporter = UsageReporter.from_reports(sample_reports) + total = reporter.total() + + assert total.cost == pytest.approx(0.30) + assert total.input_tokens == 300 + assert total.output_tokens == 150 + assert total.cached_input_tokens == 50 + + def test_by_task(self, sample_reports): + reporter = UsageReporter.from_reports(sample_reports) + by_task = reporter.by_task() + + assert len(by_task) == 2 + assert by_task["task_1"].cost == pytest.approx(0.10) + assert by_task["task_1"].input_tokens == 100 + assert by_task["task_2"].cost == pytest.approx(0.20) + assert by_task["task_2"].input_tokens == 200 + + def test_by_component(self, sample_reports): + reporter = UsageReporter.from_reports(sample_reports) + by_comp = reporter.by_component() + + assert len(by_comp) == 1 + assert "models:main_model" in by_comp + assert by_comp["models:main_model"].cost == pytest.approx(0.30) + assert by_comp["models:main_model"].input_tokens == 300 + + def test_summary_structure(self, sample_reports): + reporter = UsageReporter.from_reports(sample_reports) + summary = reporter.summary() + + assert "total" in summary + assert "by_task" in summary + assert "by_component" in summary + assert summary["total"]["cost"] == pytest.approx(0.30) + assert summary["total"]["input_tokens"] == 300 + + def test_empty_reports(self): + reporter = UsageReporter.from_reports([]) + total = reporter.total() + + # Empty reports return a plain Usage with no cost + assert total.cost is None + assert isinstance(total, Usage) + + def test_skips_error_reports(self): + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": {"error": "setup failed"}, + }, + ] + reporter = UsageReporter.from_reports(reports) + total = reporter.total() + assert total.cost is None + assert isinstance(total, Usage) diff --git a/tests/test_interface/test_model_integration/test_api_contracts.py b/tests/test_interface/test_model_integration/test_api_contracts.py index 022d3ec5..b32732b2 100644 --- a/tests/test_interface/test_model_integration/test_api_contracts.py +++ b/tests/test_interface/test_model_integration/test_api_contracts.py @@ -580,3 +580,425 @@ def test_tool_call_response(self): assert response.usage is not None assert response.usage["input_tokens"] == 82 assert response.usage["output_tokens"] == 18 + + +# ============================================================================= +# Usage Extraction Contract Tests +# ============================================================================= +# +# These tests verify that each adapter correctly extracts ALL usage fields +# (including cache tokens, reasoning tokens, provider cost) from realistic +# API response payloads, and that the cost calculator produces correct costs. +# ============================================================================= + + +# -- OpenAI usage-rich fixture ------------------------------------------------ + +OPENAI_USAGE_RICH_RESPONSE = { + "id": "chatcmpl-usage-test", + "object": "chat.completion", + "created": 1700000000, + "model": "gpt-4o", + "system_fingerprint": "fp_usage_test", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Hello!", + "refusal": None, + }, + "logprobs": None, + "finish_reason": "stop", + } + ], + "usage": { + "prompt_tokens": 500, + "completion_tokens": 200, + "total_tokens": 700, + "prompt_tokens_details": { + "cached_tokens": 300, + }, + "completion_tokens_details": { + "reasoning_tokens": 80, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0, + }, + }, +} + + +# -- Anthropic usage-rich fixture -------------------------------------------- + +ANTHROPIC_USAGE_RICH_RESPONSE = { + "id": "msg_usage_test", + "type": "message", + "role": "assistant", + "content": [{"type": "text", "text": "Hello!"}], + "model": "claude-sonnet-4-5-20250514", + "stop_reason": "end_turn", + "stop_sequence": None, + "usage": { + "input_tokens": 1000, + "output_tokens": 200, + "cache_read_input_tokens": 600, + "cache_creation_input_tokens": 100, + }, +} + + +# -- Google usage-rich fixture ----------------------------------------------- + +GOOGLE_USAGE_RICH_RESPONSE = { + "candidates": [ + { + "content": { + "parts": [{"text": "Hello!"}], + "role": "model", + }, + "finishReason": "STOP", + } + ], + "usageMetadata": { + "promptTokenCount": 500, + "candidatesTokenCount": 200, + "totalTokenCount": 700, + "thoughtsTokenCount": 120, + }, + "modelVersion": "gemini-2.0-flash-thinking", +} + + +class TestOpenAIUsageExtraction: + """Verify OpenAI adapter extracts all usage fields correctly.""" + + @respx.mock + def test_extracts_cached_and_reasoning_tokens(self): + """Cached tokens and reasoning tokens are extracted from nested details.""" + pytest.importorskip("openai") + from openai import OpenAI + from maseval.interface.inference.openai import OpenAIModelAdapter + + respx.post("https://api.openai.com/v1/chat/completions").respond( + 200, json=OPENAI_USAGE_RICH_RESPONSE + ) + + client = OpenAI(api_key="test-key-not-real") + adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o") + response = adapter.chat([{"role": "user", "content": "Hello"}]) + + assert response.usage["input_tokens"] == 500 + assert response.usage["output_tokens"] == 200 + assert response.usage["total_tokens"] == 700 + assert response.usage["cached_input_tokens"] == 300 + assert response.usage["reasoning_tokens"] == 80 + + @respx.mock + def test_cost_calculation_with_cached_tokens(self): + """Full pipeline: OpenAI adapter + StaticPricingCalculator with caching. + + input_tokens=500, cached_input_tokens=300 + Non-cached: 200 * $2.5e-6 = $0.0005 + Cached: 300 * $1.25e-6 = $0.000375 + Output: 200 * $10e-6 = $0.002 + Total = $0.002875 + """ + pytest.importorskip("openai") + from openai import OpenAI + from maseval.interface.inference.openai import OpenAIModelAdapter + from maseval.core.usage import StaticPricingCalculator, TokenUsage + + respx.post("https://api.openai.com/v1/chat/completions").respond( + 200, json=OPENAI_USAGE_RICH_RESPONSE + ) + + calc = StaticPricingCalculator({ + "gpt-4o": { + "input": 2.5e-6, + "output": 10e-6, + "cached_input": 1.25e-6, + }, + }) + + client = OpenAI(api_key="test-key-not-real") + adapter = OpenAIModelAdapter( + client=client, model_id="gpt-4o", cost_calculator=calc + ) + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.input_tokens == 500 + assert total.cached_input_tokens == 300 + assert total.reasoning_tokens == 80 + assert total.cost == pytest.approx(0.002875) + + +class TestAnthropicUsageExtraction: + """Verify Anthropic adapter extracts all usage fields correctly.""" + + @respx.mock + def test_extracts_cache_read_and_creation_tokens(self): + """Both cache_read and cache_creation tokens are extracted.""" + pytest.importorskip("anthropic") + from anthropic import Anthropic + from maseval.interface.inference.anthropic import AnthropicModelAdapter + + respx.post("https://api.anthropic.com/v1/messages").respond( + 200, json=ANTHROPIC_USAGE_RICH_RESPONSE + ) + + client = Anthropic(api_key="test-key-not-real") + adapter = AnthropicModelAdapter( + client=client, model_id="claude-sonnet-4-5-20250514" + ) + response = adapter.chat([{"role": "user", "content": "Hello"}]) + + assert response.usage["input_tokens"] == 1000 + assert response.usage["output_tokens"] == 200 + assert response.usage["total_tokens"] == 1200 # computed by adapter + assert response.usage["cached_input_tokens"] == 600 + assert response.usage["cache_creation_input_tokens"] == 100 + + @respx.mock + def test_cost_calculation_with_cache_creation(self): + """Full pipeline: Anthropic adapter + StaticPricingCalculator with cache creation. + + input_tokens=1000, cached=600, cache_creation=100 + Non-cached: (1000 - 600 - 100) = 300 * $3e-6 = $0.0009 + Cached: 600 * $0.3e-6 = $0.00018 + Cache creation: 100 * $3.75e-6 = $0.000375 + Output: 200 * $15e-6 = $0.003 + Total = $0.004455 + """ + pytest.importorskip("anthropic") + from anthropic import Anthropic + from maseval.interface.inference.anthropic import AnthropicModelAdapter + from maseval.core.usage import StaticPricingCalculator, TokenUsage + + respx.post("https://api.anthropic.com/v1/messages").respond( + 200, json=ANTHROPIC_USAGE_RICH_RESPONSE + ) + + calc = StaticPricingCalculator({ + "claude-sonnet-4-5-20250514": { + "input": 3e-6, + "output": 15e-6, + "cached_input": 0.3e-6, + "cache_creation_input": 3.75e-6, + }, + }) + + client = Anthropic(api_key="test-key-not-real") + adapter = AnthropicModelAdapter( + client=client, + model_id="claude-sonnet-4-5-20250514", + cost_calculator=calc, + ) + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.cached_input_tokens == 600 + assert total.cache_creation_input_tokens == 100 + assert total.cost == pytest.approx(0.004455) + + +class TestGoogleGenAIUsageExtraction: + """Verify Google GenAI adapter extracts all usage fields correctly.""" + + @respx.mock + def test_extracts_thoughts_as_reasoning_tokens(self): + """Google's thoughtsTokenCount maps to reasoning_tokens.""" + pytest.importorskip("google.genai") + from google import genai + from maseval.interface.inference.google_genai import GoogleGenAIModelAdapter + + respx.route( + method="POST", + url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*", + ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE) + + client = genai.Client( + api_key="test-key-not-real", + http_options={"api_version": "v1beta"}, + ) + adapter = GoogleGenAIModelAdapter( + client=client, model_id="gemini-2.0-flash-thinking" + ) + response = adapter.chat([{"role": "user", "content": "Hello"}]) + + assert response.usage["input_tokens"] == 500 + assert response.usage["output_tokens"] == 200 + assert response.usage["total_tokens"] == 700 + assert response.usage["reasoning_tokens"] == 120 + + @respx.mock + def test_cost_calculation_basic(self): + """Full pipeline: Google adapter + StaticPricingCalculator. + + 500 input * $0.075e-6 = $0.0000375 + 200 output * $0.3e-6 = $0.00006 + Total = $0.0000975 + """ + pytest.importorskip("google.genai") + from google import genai + from maseval.interface.inference.google_genai import GoogleGenAIModelAdapter + from maseval.core.usage import StaticPricingCalculator, TokenUsage + + respx.route( + method="POST", + url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*", + ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE) + + calc = StaticPricingCalculator({ + "gemini-2.0-flash-thinking": { + "input": 0.075e-6, + "output": 0.3e-6, + }, + }) + + client = genai.Client( + api_key="test-key-not-real", + http_options={"api_version": "v1beta"}, + ) + adapter = GoogleGenAIModelAdapter( + client=client, + model_id="gemini-2.0-flash-thinking", + cost_calculator=calc, + ) + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.reasoning_tokens == 120 + assert total.cost == pytest.approx(0.0000975) + + +class TestLiteLLMUsageExtraction: + """Verify LiteLLM adapter extracts all usage fields correctly.""" + + def test_extracts_cached_and_cache_creation_tokens(self): + """LiteLLM's prompt_tokens_details with cached_tokens and cache_creation_tokens.""" + pytest.importorskip("litellm") + from unittest.mock import patch, MagicMock + from maseval.interface.inference.litellm import LiteLLMModelAdapter + + mock_prompt_details = MagicMock() + mock_prompt_details.cached_tokens = 400 + mock_prompt_details.cache_creation_tokens = 50 + + mock_completion_details = MagicMock() + mock_completion_details.reasoning_tokens = 60 + + mock_usage = MagicMock() + mock_usage.prompt_tokens = 800 + mock_usage.completion_tokens = 150 + mock_usage.total_tokens = 950 + mock_usage.prompt_tokens_details = mock_prompt_details + mock_usage.completion_tokens_details = mock_completion_details + + mock_response = MagicMock() + mock_response.choices = [MagicMock()] + mock_response.choices[0].message.content = "Hello!" + mock_response.choices[0].message.role = "assistant" + mock_response.choices[0].message.tool_calls = None + mock_response.choices[0].finish_reason = "stop" + mock_response.model = "claude-sonnet-4-5-20250514" + mock_response.usage = mock_usage + mock_response._hidden_params = {"response_cost": 0.0042} + + with patch("litellm.completion", return_value=mock_response): + adapter = LiteLLMModelAdapter(model_id="claude-sonnet-4-5-20250514") + response = adapter.chat([{"role": "user", "content": "Hello"}]) + + assert response.usage["input_tokens"] == 800 + assert response.usage["output_tokens"] == 150 + assert response.usage["total_tokens"] == 950 + assert response.usage["cached_input_tokens"] == 400 + assert response.usage["cache_creation_input_tokens"] == 50 + assert response.usage["reasoning_tokens"] == 60 + + def test_provider_cost_from_hidden_params(self): + """LiteLLM's _hidden_params.response_cost is extracted as provider cost. + + Provider cost ($0.0042) should take precedence over calculator. + """ + pytest.importorskip("litellm") + from unittest.mock import patch, MagicMock + from maseval.interface.inference.litellm import LiteLLMModelAdapter + from maseval.core.usage import StaticPricingCalculator, TokenUsage + + mock_usage = MagicMock() + mock_usage.prompt_tokens = 100 + mock_usage.completion_tokens = 50 + mock_usage.total_tokens = 150 + mock_usage.prompt_tokens_details = None + mock_usage.completion_tokens_details = None + + mock_response = MagicMock() + mock_response.choices = [MagicMock()] + mock_response.choices[0].message.content = "Hello!" + mock_response.choices[0].message.role = "assistant" + mock_response.choices[0].message.tool_calls = None + mock_response.choices[0].finish_reason = "stop" + mock_response.model = "gpt-4o" + mock_response.usage = mock_usage + mock_response._hidden_params = {"response_cost": 0.0042} + + # Calculator would compute a different cost — provider should win + calc = StaticPricingCalculator({ + "gpt-4o": {"input": 0.01, "output": 0.02}, + }) + + with patch("litellm.completion", return_value=mock_response): + adapter = LiteLLMModelAdapter( + model_id="gpt-4o", cost_calculator=calc + ) + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(0.0042) + + def test_calculator_used_when_no_provider_cost(self): + """When _hidden_params has no cost, calculator is used. + + 100 input * $0.01 + 50 output * $0.02 = $2.00 + """ + pytest.importorskip("litellm") + from unittest.mock import patch, MagicMock + from maseval.interface.inference.litellm import LiteLLMModelAdapter + from maseval.core.usage import StaticPricingCalculator, TokenUsage + + mock_usage = MagicMock() + mock_usage.prompt_tokens = 100 + mock_usage.completion_tokens = 50 + mock_usage.total_tokens = 150 + mock_usage.prompt_tokens_details = None + mock_usage.completion_tokens_details = None + + mock_response = MagicMock() + mock_response.choices = [MagicMock()] + mock_response.choices[0].message.content = "Hello!" + mock_response.choices[0].message.role = "assistant" + mock_response.choices[0].message.tool_calls = None + mock_response.choices[0].finish_reason = "stop" + mock_response.model = "gpt-4o" + mock_response.usage = mock_usage + mock_response._hidden_params = {} + + calc = StaticPricingCalculator({ + "gpt-4o": {"input": 0.01, "output": 0.02}, + }) + + with patch("litellm.completion", return_value=mock_response): + adapter = LiteLLMModelAdapter( + model_id="gpt-4o", cost_calculator=calc + ) + adapter.chat([{"role": "user", "content": "Hello"}]) + total = adapter.gather_usage() + + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(2.00) From 60f205aa0eecb1dd9401df6f83720e6502c91ea4 Mon Sep 17 00:00:00 2001 From: cemde Date: Fri, 13 Mar 2026 17:56:01 +0100 Subject: [PATCH 07/19] upgraded testing --- maseval/core/registry.py | 9 +- .../test_benchmark/test_usage_collection.py | 97 ++++++++ tests/test_core/test_registry.py | 227 ++++++++++++++++++ 3 files changed, 331 insertions(+), 2 deletions(-) create mode 100644 tests/test_core/test_benchmark/test_usage_collection.py diff --git a/maseval/core/registry.py b/maseval/core/registry.py index e34fc972..cf2193e8 100644 --- a/maseval/core/registry.py +++ b/maseval/core/registry.py @@ -319,9 +319,14 @@ def collect_usage(self) -> Dict[str, Any]: usage[category] = {} usage[category][comp_name] = usage_dict - # Accumulate into persistent aggregates (thread-safe) + # Accumulate into persistent aggregates (thread-safe). + # _usage_total starts as Usage(cost=None); adding to it would + # poison the cost (None + X = None). Assign directly on first use. with self._usage_lock: - self._usage_total = self._usage_total + component_usage + if self._usage_total.cost is None and not self._usage_total.units: + self._usage_total = component_usage + else: + self._usage_total = self._usage_total + component_usage if key in self._usage_by_component: self._usage_by_component[key] = self._usage_by_component[key] + component_usage else: diff --git a/tests/test_core/test_benchmark/test_usage_collection.py b/tests/test_core/test_benchmark/test_usage_collection.py new file mode 100644 index 00000000..fdf2a74e --- /dev/null +++ b/tests/test_core/test_benchmark/test_usage_collection.py @@ -0,0 +1,97 @@ +"""Test usage collection through the benchmark execution loop. + +These tests verify that benchmark.run() collects usage from registered +model adapters and includes it in report dicts. +""" + +import pytest +from maseval import TaskQueue +from maseval.core.usage import StaticPricingCalculator + + +@pytest.mark.core +class TestBenchmarkUsageCollection: + """Tests for usage collection during benchmark execution.""" + + def test_usage_in_report(self): + """Benchmark run includes a 'usage' key in each report.""" + from conftest import DummyBenchmark + + tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}]) + benchmark = DummyBenchmark() + + reports = benchmark.run(tasks, agent_data={"model": "test"}) + + assert "usage" in reports[0] + usage = reports[0]["usage"] + assert "metadata" in usage + assert "models" in usage + assert "agents" in usage + + def test_usage_has_correct_structure(self): + """Usage dict has the expected category keys and metadata.""" + from conftest import DummyBenchmark + + tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}]) + benchmark = DummyBenchmark() + + reports = benchmark.run(tasks, agent_data={"model": "test"}) + + usage = reports[0]["usage"] + assert "metadata" in usage + assert "total_components" in usage["metadata"] + assert "timestamp" in usage["metadata"] + + def test_model_with_usage_appears_in_report(self): + """A model adapter that reports usage has its tokens in the report.""" + from conftest import DummyModelAdapter, DummyBenchmark + + class UsageBenchmark(DummyBenchmark): + def get_model_adapter(self, model_id, **kwargs): + return DummyModelAdapter( + model_id=model_id, + usage={ + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + }, + ) + + tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}]) + benchmark = UsageBenchmark() + + reports = benchmark.run(tasks, agent_data={"model": "test"}) + + # The DummyBenchmark doesn't register a model via register(), so + # the model's usage won't appear unless the benchmark hooks it up. + # This test verifies the usage structure exists. + assert "usage" in reports[0] + + def test_usage_persists_across_task_repetitions(self): + """Benchmark.usage accumulates across multiple tasks.""" + from conftest import DummyBenchmark + + tasks = TaskQueue.from_list([ + {"query": "Task 1", "environment_data": {}}, + {"query": "Task 2", "environment_data": {}}, + ]) + benchmark = DummyBenchmark() + benchmark.run(tasks, agent_data={"model": "test"}) + + # Both tasks should have produced reports with usage + assert len(benchmark.reports) == 2 + assert "usage" in benchmark.reports[0] + assert "usage" in benchmark.reports[1] + + def test_usage_property_returns_total(self): + """benchmark.usage returns the running total.""" + from conftest import DummyBenchmark + + tasks = TaskQueue.from_list([{"query": "Test", "environment_data": {}}]) + benchmark = DummyBenchmark() + benchmark.run(tasks, agent_data={"model": "test"}) + + # usage property should return a Usage object (even if empty) + total = benchmark.usage + assert total is not None + # cost may be None if DummyModelAdapter doesn't provide usage diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py index 59408775..c62db645 100644 --- a/tests/test_core/test_registry.py +++ b/tests/test_core/test_registry.py @@ -256,3 +256,230 @@ def worker(worker_id: int): for worker_id, traces in results.items(): assert f"agent_{worker_id}" in traces["agents"] assert len(traces["agents"]) == 1 + + +# ==================== Usage Tracking Tests ==================== + + +class MockUsageComponent(TraceableMixin): + """Component that implements UsageTrackableMixin for testing.""" + + def __init__(self, name: str, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0): + super().__init__() + self._name = name + self._cost = cost + self._input_tokens = input_tokens + self._output_tokens = output_tokens + + def gather_traces(self) -> Dict[str, Any]: + return {"name": self._name} + + def gather_usage(self): + from maseval.core.usage import TokenUsage + return TokenUsage( + cost=self._cost, + input_tokens=self._input_tokens, + output_tokens=self._output_tokens, + total_tokens=self._input_tokens + self._output_tokens, + ) + + +class MockBrokenUsageComponent(TraceableMixin): + """Component whose gather_usage raises an exception.""" + + def __init__(self): + super().__init__() + + def gather_traces(self) -> Dict[str, Any]: + return {} + + def gather_usage(self): + raise RuntimeError("Usage collection failed") + + +# Ensure MockUsageComponent also inherits UsageTrackableMixin +from maseval.core.usage import UsageTrackableMixin + + +class UsageAwareComponent(TraceableMixin, UsageTrackableMixin): + """Component with both tracing and usage tracking.""" + + def __init__(self, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0): + TraceableMixin.__init__(self) + self._cost = cost + self._input_tokens = input_tokens + self._output_tokens = output_tokens + + def gather_traces(self) -> Dict[str, Any]: + return {"traced": True} + + def gather_usage(self): + from maseval.core.usage import TokenUsage + return TokenUsage( + cost=self._cost, + input_tokens=self._input_tokens, + output_tokens=self._output_tokens, + total_tokens=self._input_tokens + self._output_tokens, + ) + + +class BrokenUsageComponent(TraceableMixin, UsageTrackableMixin): + """Component whose gather_usage raises an exception.""" + + def __init__(self): + TraceableMixin.__init__(self) + + def gather_traces(self) -> Dict[str, Any]: + return {} + + def gather_usage(self): + raise RuntimeError("Usage collection failed") + + +@pytest.mark.core +class TestRegistryUsageCollection: + """Tests for usage tracking through the component registry.""" + + def test_register_usage_trackable_component(self): + """UsageTrackableMixin component is registered in the usage registry.""" + registry = ComponentRegistry() + component = UsageAwareComponent(cost=0.05, input_tokens=100, output_tokens=50) + + registry.register("models", "main_model", component) + + assert "models:main_model" in registry._usage_registry + assert registry._usage_registry["models:main_model"] is component + + def test_non_usage_component_not_in_usage_registry(self): + """Components without UsageTrackableMixin are NOT in the usage registry.""" + registry = ComponentRegistry() + component = MockTraceableComponent("test") + + registry.register("agents", "my_agent", component) + + assert "agents:my_agent" in registry._trace_registry + assert "agents:my_agent" not in registry._usage_registry + + def test_collect_usage_basic(self): + """collect_usage returns structured dict with usage from registered components.""" + from maseval.core.usage import TokenUsage + + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200) + registry.register("models", "main_model", model) + + usage = registry.collect_usage() + + assert "metadata" in usage + assert "models" in usage + assert "main_model" in usage["models"] + + model_usage = usage["models"]["main_model"] + assert model_usage["cost"] == 0.10 + assert model_usage["input_tokens"] == 500 + assert model_usage["output_tokens"] == 200 + assert model_usage["total_tokens"] == 700 + + def test_collect_usage_multiple_components(self): + """Multiple components across categories are all collected.""" + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200) + tool = UsageAwareComponent(cost=0.05, input_tokens=0, output_tokens=0) + + registry.register("models", "main_model", model) + registry.register("tools", "search_tool", tool) + + usage = registry.collect_usage() + + assert "main_model" in usage["models"] + assert "search_tool" in usage["tools"] + assert usage["models"]["main_model"]["cost"] == 0.10 + assert usage["tools"]["search_tool"]["cost"] == 0.05 + + def test_collect_usage_injects_grouping_fields(self): + """Registry injects category and component_name into usage records.""" + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50) + registry.register("models", "main_model", model) + + usage = registry.collect_usage() + + model_usage = usage["models"]["main_model"] + assert model_usage["category"] == "models" + assert model_usage["component_name"] == "main_model" + + def test_total_usage_accumulates(self): + """total_usage property reflects accumulated usage across collect_usage calls.""" + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50) + registry.register("models", "main_model", model) + + # First collection + registry.collect_usage() + total1 = registry.total_usage + assert total1.cost == pytest.approx(0.10) + + # Clear and re-register (simulates next repetition) + registry.clear() + model2 = UsageAwareComponent(cost=0.20, input_tokens=200, output_tokens=100) + registry.register("models", "main_model", model2) + + # Second collection + registry.collect_usage() + total2 = registry.total_usage + assert total2.cost == pytest.approx(0.30) + + def test_usage_by_component_accumulates(self): + """usage_by_component accumulates per key across repetitions.""" + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50) + registry.register("models", "main_model", model) + registry.collect_usage() + + # Clear and re-register for second repetition + registry.clear() + model2 = UsageAwareComponent(cost=0.20, input_tokens=200, output_tokens=100) + registry.register("models", "main_model", model2) + registry.collect_usage() + + by_comp = registry.usage_by_component + assert "models:main_model" in by_comp + + total = by_comp["models:main_model"] + assert total.input_tokens == 300 + assert total.output_tokens == 150 + assert total.cost == pytest.approx(0.30) + + def test_usage_persists_across_clear(self): + """clear() does NOT reset total_usage or usage_by_component.""" + registry = ComponentRegistry() + model = UsageAwareComponent(cost=0.10, input_tokens=100, output_tokens=50) + registry.register("models", "main_model", model) + registry.collect_usage() + + # Clear only removes per-repetition state + registry.clear() + + assert registry.total_usage.cost == pytest.approx(0.10) + assert "models:main_model" in registry.usage_by_component + + def test_collect_usage_handles_error_gracefully(self): + """If gather_usage raises, the error is captured in the usage dict.""" + registry = ComponentRegistry() + broken = BrokenUsageComponent() + registry.register("models", "bad_model", broken) + + usage = registry.collect_usage() + + assert "bad_model" in usage["models"] + assert "error" in usage["models"]["bad_model"] + assert "RuntimeError" in usage["models"]["bad_model"]["error_type"] + + def test_collect_usage_empty_registry(self): + """collect_usage with no components returns empty structure.""" + registry = ComponentRegistry() + usage = registry.collect_usage() + + assert usage["metadata"]["total_components"] == 0 + assert usage["models"] == {} + assert usage["agents"] == {} From 742ffb5bfef2a9a6ce9e39fd9cb32ed0e532ad9b Mon Sep 17 00:00:00 2001 From: cemde Date: Fri, 13 Mar 2026 18:13:50 +0100 Subject: [PATCH 08/19] upgraded testing --- tests/test_core/test_usage.py | 228 ++++++++++++++++++++++++++++++++++ 1 file changed, 228 insertions(+) diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py index fa6da9a6..ca9b551d 100644 --- a/tests/test_core/test_usage.py +++ b/tests/test_core/test_usage.py @@ -679,3 +679,231 @@ def test_skips_error_reports(self): total = reporter.total() assert total.cost is None assert isinstance(total, Usage) + + def test_by_task_accumulates_repeats(self): + """by_task sums usage when a task_id appears in multiple reports.""" + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "models": { + "m": { + "cost": 0.10, + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": None, + "category": "models", + "component_name": "m", + "kind": "llm", + } + } + }, + }, + { + "task_id": "task_1", + "repeat_idx": 1, + "usage": { + "models": { + "m": { + "cost": 0.20, + "input_tokens": 200, + "output_tokens": 100, + "total_tokens": 300, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": None, + "category": "models", + "component_name": "m", + "kind": "llm", + } + } + }, + }, + ] + reporter = UsageReporter.from_reports(reports) + by_task = reporter.by_task() + + assert len(by_task) == 1 + assert by_task["task_1"].cost == pytest.approx(0.30) + assert by_task["task_1"].input_tokens == 300 + + def test_plain_usage_fallback(self): + """_usage_from_dict returns plain Usage when no token fields present.""" + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "tools": { + "my_tool": { + "cost": 0.05, + "units": {"api_calls": 3}, + "provider": None, + "category": "tools", + "component_name": "my_tool", + "kind": "tool", + } + } + }, + }, + ] + reporter = UsageReporter.from_reports(reports) + total = reporter.total() + + assert total.cost == pytest.approx(0.05) + assert isinstance(total, Usage) + assert not isinstance(total, TokenUsage) + + def test_metadata_key_skipped(self): + """The 'metadata' key in usage dicts is not treated as a component.""" + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "metadata": {"timestamp": "2025-01-01", "total_components": 1}, + "models": { + "m": { + "cost": 0.10, + "input_tokens": 50, + "output_tokens": 25, + "total_tokens": 75, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": None, + "category": "models", + "component_name": "m", + "kind": "llm", + } + }, + }, + }, + ] + reporter = UsageReporter.from_reports(reports) + total = reporter.total() + + # Only the model's cost, metadata should not contribute + assert total.cost == pytest.approx(0.10) + assert total.input_tokens == 50 + + def test_skips_component_with_error(self): + """Components with error dicts are skipped, others still counted.""" + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "models": { + "good_model": { + "cost": 0.10, + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": None, + "category": "models", + "component_name": "good_model", + "kind": "llm", + }, + "bad_model": { + "error": "Failed to gather usage", + "error_type": "RuntimeError", + }, + } + }, + }, + ] + reporter = UsageReporter.from_reports(reports) + total = reporter.total() + + assert total.cost == pytest.approx(0.10) + assert total.input_tokens == 100 + + def test_environment_direct_usage(self): + """Environment/user usage (direct dicts with 'cost') are parsed.""" + reports = [ + { + "task_id": "task_1", + "repeat_idx": 0, + "usage": { + "environment": { + "cost": 0.05, + "units": {"steps": 10}, + "provider": None, + "category": "environment", + "component_name": "env", + "kind": "env", + }, + "models": { + "m": { + "cost": 0.10, + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + "cached_input_tokens": 0, + "cache_creation_input_tokens": 0, + "reasoning_tokens": 0, + "audio_tokens": 0, + "units": {}, + "provider": None, + "category": "models", + "component_name": "m", + "kind": "llm", + } + }, + }, + }, + ] + reporter = UsageReporter.from_reports(reports) + total = reporter.total() + + assert total.cost == pytest.approx(0.15) + + +# ============================================================================= +# StaticPricingCalculator — Utility Methods +# ============================================================================= + + +class TestStaticPricingCalculatorUtilities: + """Tests for add_model, models property, and gather_config.""" + + def test_add_model(self): + calc = StaticPricingCalculator({}) + calc.add_model("new-model", {"input": 0.01, "output": 0.02}) + + usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + cost = calc.calculate_cost(usage, "new-model") + assert cost == pytest.approx(2.00) + + def test_models_property(self): + calc = StaticPricingCalculator({ + "model-a": {"input": 0.01, "output": 0.02}, + "model-b": {"input": 0.001, "output": 0.002}, + }) + assert sorted(calc.models) == ["model-a", "model-b"] + + def test_gather_config(self): + pricing = {"model-a": {"input": 0.01, "output": 0.02}} + calc = StaticPricingCalculator(pricing) + config = calc.gather_config() + + assert config["type"] == "StaticPricingCalculator" + assert config["pricing"] == pricing From 6af219b9a0ead64b2b6b9a5d91bdb732a8e35d52 Mon Sep 17 00:00:00 2001 From: cemde Date: Fri, 13 Mar 2026 18:48:02 +0100 Subject: [PATCH 09/19] fixed linting and tests --- maseval/core/model.py | 2 +- tests/conftest.py | 2 +- .../test_benchmark/test_usage_collection.py | 11 +- tests/test_core/test_registry.py | 11 +- tests/test_core/test_usage.py | 160 +++++++++++------- .../test_api_contracts.py | 104 ++++++------ 6 files changed, 162 insertions(+), 128 deletions(-) diff --git a/maseval/core/model.py b/maseval/core/model.py index 4c4a7f6a..e8bb0a3f 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -100,7 +100,7 @@ class ChatResponse: content: Optional[str] = None tool_calls: Optional[List[Dict[str, Any]]] = None role: str = "assistant" - usage: Optional[Dict[str, int]] = None + usage: Optional[Dict[str, Any]] = None model: Optional[str] = None stop_reason: Optional[str] = None diff --git a/tests/conftest.py b/tests/conftest.py index 6bd1cc54..00407458 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -48,7 +48,7 @@ def __init__( model_id: str = "test-model", responses: Optional[List[Optional[str]]] = None, tool_calls: Optional[List[Optional[List[Dict[str, Any]]]]] = None, - usage: Optional[Dict[str, int]] = None, + usage: Optional[Dict[str, Any]] = None, stop_reason: Optional[str] = None, seed: Optional[int] = None, ): diff --git a/tests/test_core/test_benchmark/test_usage_collection.py b/tests/test_core/test_benchmark/test_usage_collection.py index fdf2a74e..ac42a92a 100644 --- a/tests/test_core/test_benchmark/test_usage_collection.py +++ b/tests/test_core/test_benchmark/test_usage_collection.py @@ -6,7 +6,6 @@ import pytest from maseval import TaskQueue -from maseval.core.usage import StaticPricingCalculator @pytest.mark.core @@ -71,10 +70,12 @@ def test_usage_persists_across_task_repetitions(self): """Benchmark.usage accumulates across multiple tasks.""" from conftest import DummyBenchmark - tasks = TaskQueue.from_list([ - {"query": "Task 1", "environment_data": {}}, - {"query": "Task 2", "environment_data": {}}, - ]) + tasks = TaskQueue.from_list( + [ + {"query": "Task 1", "environment_data": {}}, + {"query": "Task 2", "environment_data": {}}, + ] + ) benchmark = DummyBenchmark() benchmark.run(tasks, agent_data={"model": "test"}) diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py index c62db645..17b30c5b 100644 --- a/tests/test_core/test_registry.py +++ b/tests/test_core/test_registry.py @@ -13,6 +13,7 @@ from maseval.core.registry import ComponentRegistry from maseval.core.tracing import TraceableMixin from maseval.core.config import ConfigurableMixin +from maseval.core.usage import UsageTrackableMixin # ==================== Test Components ==================== @@ -276,6 +277,7 @@ def gather_traces(self) -> Dict[str, Any]: def gather_usage(self): from maseval.core.usage import TokenUsage + return TokenUsage( cost=self._cost, input_tokens=self._input_tokens, @@ -297,10 +299,6 @@ def gather_usage(self): raise RuntimeError("Usage collection failed") -# Ensure MockUsageComponent also inherits UsageTrackableMixin -from maseval.core.usage import UsageTrackableMixin - - class UsageAwareComponent(TraceableMixin, UsageTrackableMixin): """Component with both tracing and usage tracking.""" @@ -315,6 +313,7 @@ def gather_traces(self) -> Dict[str, Any]: def gather_usage(self): from maseval.core.usage import TokenUsage + return TokenUsage( cost=self._cost, input_tokens=self._input_tokens, @@ -362,7 +361,6 @@ def test_non_usage_component_not_in_usage_registry(self): def test_collect_usage_basic(self): """collect_usage returns structured dict with usage from registered components.""" - from maseval.core.usage import TokenUsage registry = ComponentRegistry() model = UsageAwareComponent(cost=0.10, input_tokens=500, output_tokens=200) @@ -445,7 +443,10 @@ def test_usage_by_component_accumulates(self): by_comp = registry.usage_by_component assert "models:main_model" in by_comp + from maseval.core.usage import TokenUsage + total = by_comp["models:main_model"] + assert isinstance(total, TokenUsage) assert total.input_tokens == 300 assert total.output_tokens == 150 assert total.cost == pytest.approx(0.30) diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py index ca9b551d..5348585b 100644 --- a/tests/test_core/test_usage.py +++ b/tests/test_core/test_usage.py @@ -31,9 +31,7 @@ class TestTokenUsageConstruction: def test_from_chat_response_basic(self): """Minimal usage dict maps to the right fields.""" - tu = TokenUsage.from_chat_response_usage( - {"input_tokens": 100, "output_tokens": 50, "total_tokens": 150} - ) + tu = TokenUsage.from_chat_response_usage({"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}) assert tu.input_tokens == 100 assert tu.output_tokens == 50 assert tu.total_tokens == 150 @@ -158,6 +156,7 @@ def test_none_cost_propagates(self): assert total.cost is None # Token fields still sum correctly + assert isinstance(total, TokenUsage) assert total.input_tokens == 300 assert total.output_tokens == 80 @@ -206,9 +205,11 @@ def test_basic_cost(self): 50 output * $0.02 = $1.00 Total = $2.00 """ - calc = StaticPricingCalculator({ - "test-model": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "test-model": {"input": 0.01, "output": 0.02}, + } + ) usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) cost = calc.calculate_cost(usage, "test-model") @@ -223,13 +224,15 @@ def test_cached_input_tokens(self): Output: 100 * $0.015 = $1.50 Total = $2.34 """ - calc = StaticPricingCalculator({ - "claude-sonnet-4-5": { - "input": 0.003, - "output": 0.015, - "cached_input": 0.0003, - }, - }) + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + }, + } + ) usage = TokenUsage(input_tokens=1000, output_tokens=100, total_tokens=1100, cached_input_tokens=800) cost = calc.calculate_cost(usage, "claude-sonnet-4-5") @@ -245,14 +248,16 @@ def test_cache_creation_tokens(self): Output: 100 * $0.015 = $1.50 Total = $3.03 """ - calc = StaticPricingCalculator({ - "claude-sonnet-4-5": { - "input": 0.003, - "output": 0.015, - "cached_input": 0.0003, - "cache_creation_input": 0.00375, - }, - }) + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + "cache_creation_input": 0.00375, + }, + } + ) usage = TokenUsage( input_tokens=1000, output_tokens=100, @@ -273,9 +278,11 @@ def test_cache_creation_defaults_to_input_rate(self): Output: 100 * $0.015 = $1.50 Total = $4.50 """ - calc = StaticPricingCalculator({ - "claude-sonnet-4-5": {"input": 0.003, "output": 0.015}, - }) + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5": {"input": 0.003, "output": 0.015}, + } + ) usage = TokenUsage( input_tokens=1000, output_tokens=100, @@ -307,9 +314,11 @@ def test_real_world_anthropic_pricing(self): 200 output * $0.000015 = $0.003 Total = $0.0045 """ - calc = StaticPricingCalculator({ - "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6}, - }) + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5": {"input": 3e-6, "output": 15e-6}, + } + ) usage = TokenUsage(input_tokens=500, output_tokens=200, total_tokens=700) cost = calc.calculate_cost(usage, "claude-sonnet-4-5") @@ -322,9 +331,11 @@ def test_real_world_openai_pricing(self): 500 output * $0.000010 = $0.005 Total = $0.0075 """ - calc = StaticPricingCalculator({ - "gpt-4o": {"input": 2.5e-6, "output": 10e-6}, - }) + calc = StaticPricingCalculator( + { + "gpt-4o": {"input": 2.5e-6, "output": 10e-6}, + } + ) usage = TokenUsage(input_tokens=1000, output_tokens=500, total_tokens=1500) cost = calc.calculate_cost(usage, "gpt-4o") @@ -341,7 +352,7 @@ class TestLiteLLMCostCalculator: def test_passes_cache_tokens_to_cost_per_token(self): """Verify cache_read and cache_creation tokens are forwarded.""" - litellm = pytest.importorskip("litellm") + pytest.importorskip("litellm") from unittest.mock import patch from maseval.interface.usage import LiteLLMCostCalculator @@ -372,9 +383,11 @@ def test_model_id_map_remapping(self): from unittest.mock import patch from maseval.interface.usage import LiteLLMCostCalculator - calc = LiteLLMCostCalculator(model_id_map={ - "gemini-2.0-flash": "gemini/gemini-2.0-flash", - }) + calc = LiteLLMCostCalculator( + model_id_map={ + "gemini-2.0-flash": "gemini/gemini-2.0-flash", + } + ) usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) with patch("litellm.cost_per_token", return_value=(0.001, 0.002)) as mock_cpt: @@ -394,12 +407,14 @@ def test_custom_pricing_overrides_litellm(self): from unittest.mock import patch from maseval.interface.usage import LiteLLMCostCalculator - calc = LiteLLMCostCalculator(custom_pricing={ - "my-model": { - "input_cost_per_token": 0.0001, - "output_cost_per_token": 0.0002, - }, - }) + calc = LiteLLMCostCalculator( + custom_pricing={ + "my-model": { + "input_cost_per_token": 0.0001, + "output_cost_per_token": 0.0002, + }, + } + ) usage = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) with patch("litellm.cost_per_token") as mock_cpt: @@ -444,9 +459,11 @@ def test_basic_pipeline(self): """ from tests.conftest import DummyModelAdapter - calc = StaticPricingCalculator({ - "test-model": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "test-model": {"input": 0.01, "output": 0.02}, + } + ) adapter = DummyModelAdapter( model_id="test-model", usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}, @@ -470,9 +487,11 @@ def test_pipeline_multiple_calls_accumulate(self): """ from tests.conftest import DummyModelAdapter - calc = StaticPricingCalculator({ - "test-model": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "test-model": {"input": 0.01, "output": 0.02}, + } + ) adapter = DummyModelAdapter( model_id="test-model", usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150}, @@ -483,6 +502,7 @@ def test_pipeline_multiple_calls_accumulate(self): adapter.chat([{"role": "user", "content": "World"}]) total = adapter.gather_usage() + assert isinstance(total, TokenUsage) assert total.input_tokens == 200 assert total.output_tokens == 100 assert total.cost == pytest.approx(4.00) @@ -496,9 +516,11 @@ def test_pipeline_provider_cost_takes_precedence(self): """ from tests.conftest import DummyModelAdapter - calc = StaticPricingCalculator({ - "test-model": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "test-model": {"input": 0.01, "output": 0.02}, + } + ) adapter = DummyModelAdapter( model_id="test-model", usage={"input_tokens": 100, "output_tokens": 50, "total_tokens": 150, "cost": 0.99}, @@ -522,6 +544,7 @@ def test_pipeline_no_calculator_no_provider_cost(self): adapter.chat([{"role": "user", "content": "Hello"}]) total = adapter.gather_usage() + assert isinstance(total, TokenUsage) assert total.input_tokens == 100 assert total.cost is None @@ -536,13 +559,15 @@ def test_pipeline_with_cached_tokens(self): """ from tests.conftest import DummyModelAdapter - calc = StaticPricingCalculator({ - "claude-sonnet-4-5": { - "input": 0.003, - "output": 0.015, - "cached_input": 0.0003, - }, - }) + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5": { + "input": 0.003, + "output": 0.015, + "cached_input": 0.0003, + }, + } + ) adapter = DummyModelAdapter( model_id="claude-sonnet-4-5", usage={ @@ -557,6 +582,7 @@ def test_pipeline_with_cached_tokens(self): adapter.chat([{"role": "user", "content": "Hello"}]) total = adapter.gather_usage() + assert isinstance(total, TokenUsage) assert total.cached_input_tokens == 800 assert total.cost == pytest.approx(2.34) @@ -625,6 +651,7 @@ def test_total(self, sample_reports): reporter = UsageReporter.from_reports(sample_reports) total = reporter.total() + assert isinstance(total, TokenUsage) assert total.cost == pytest.approx(0.30) assert total.input_tokens == 300 assert total.output_tokens == 150 @@ -635,8 +662,10 @@ def test_by_task(self, sample_reports): by_task = reporter.by_task() assert len(by_task) == 2 + assert isinstance(by_task["task_1"], TokenUsage) assert by_task["task_1"].cost == pytest.approx(0.10) assert by_task["task_1"].input_tokens == 100 + assert isinstance(by_task["task_2"], TokenUsage) assert by_task["task_2"].cost == pytest.approx(0.20) assert by_task["task_2"].input_tokens == 200 @@ -646,8 +675,10 @@ def test_by_component(self, sample_reports): assert len(by_comp) == 1 assert "models:main_model" in by_comp - assert by_comp["models:main_model"].cost == pytest.approx(0.30) - assert by_comp["models:main_model"].input_tokens == 300 + total = by_comp["models:main_model"] + assert isinstance(total, TokenUsage) + assert total.cost == pytest.approx(0.30) + assert total.input_tokens == 300 def test_summary_structure(self, sample_reports): reporter = UsageReporter.from_reports(sample_reports) @@ -734,6 +765,7 @@ def test_by_task_accumulates_repeats(self): by_task = reporter.by_task() assert len(by_task) == 1 + assert isinstance(by_task["task_1"], TokenUsage) assert by_task["task_1"].cost == pytest.approx(0.30) assert by_task["task_1"].input_tokens == 300 @@ -796,6 +828,7 @@ def test_metadata_key_skipped(self): total = reporter.total() # Only the model's cost, metadata should not contribute + assert isinstance(total, TokenUsage) assert total.cost == pytest.approx(0.10) assert total.input_tokens == 50 @@ -833,6 +866,7 @@ def test_skips_component_with_error(self): reporter = UsageReporter.from_reports(reports) total = reporter.total() + assert isinstance(total, TokenUsage) assert total.cost == pytest.approx(0.10) assert total.input_tokens == 100 @@ -894,10 +928,12 @@ def test_add_model(self): assert cost == pytest.approx(2.00) def test_models_property(self): - calc = StaticPricingCalculator({ - "model-a": {"input": 0.01, "output": 0.02}, - "model-b": {"input": 0.001, "output": 0.002}, - }) + calc = StaticPricingCalculator( + { + "model-a": {"input": 0.01, "output": 0.02}, + "model-b": {"input": 0.001, "output": 0.002}, + } + ) assert sorted(calc.models) == ["model-a", "model-b"] def test_gather_config(self): diff --git a/tests/test_interface/test_model_integration/test_api_contracts.py b/tests/test_interface/test_model_integration/test_api_contracts.py index b32732b2..088d7057 100644 --- a/tests/test_interface/test_model_integration/test_api_contracts.py +++ b/tests/test_interface/test_model_integration/test_api_contracts.py @@ -680,14 +680,13 @@ def test_extracts_cached_and_reasoning_tokens(self): from openai import OpenAI from maseval.interface.inference.openai import OpenAIModelAdapter - respx.post("https://api.openai.com/v1/chat/completions").respond( - 200, json=OPENAI_USAGE_RICH_RESPONSE - ) + respx.post("https://api.openai.com/v1/chat/completions").respond(200, json=OPENAI_USAGE_RICH_RESPONSE) client = OpenAI(api_key="test-key-not-real") adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o") response = adapter.chat([{"role": "user", "content": "Hello"}]) + assert response.usage is not None assert response.usage["input_tokens"] == 500 assert response.usage["output_tokens"] == 200 assert response.usage["total_tokens"] == 700 @@ -709,22 +708,20 @@ def test_cost_calculation_with_cached_tokens(self): from maseval.interface.inference.openai import OpenAIModelAdapter from maseval.core.usage import StaticPricingCalculator, TokenUsage - respx.post("https://api.openai.com/v1/chat/completions").respond( - 200, json=OPENAI_USAGE_RICH_RESPONSE - ) + respx.post("https://api.openai.com/v1/chat/completions").respond(200, json=OPENAI_USAGE_RICH_RESPONSE) - calc = StaticPricingCalculator({ - "gpt-4o": { - "input": 2.5e-6, - "output": 10e-6, - "cached_input": 1.25e-6, - }, - }) + calc = StaticPricingCalculator( + { + "gpt-4o": { + "input": 2.5e-6, + "output": 10e-6, + "cached_input": 1.25e-6, + }, + } + ) client = OpenAI(api_key="test-key-not-real") - adapter = OpenAIModelAdapter( - client=client, model_id="gpt-4o", cost_calculator=calc - ) + adapter = OpenAIModelAdapter(client=client, model_id="gpt-4o", cost_calculator=calc) adapter.chat([{"role": "user", "content": "Hello"}]) total = adapter.gather_usage() @@ -745,16 +742,13 @@ def test_extracts_cache_read_and_creation_tokens(self): from anthropic import Anthropic from maseval.interface.inference.anthropic import AnthropicModelAdapter - respx.post("https://api.anthropic.com/v1/messages").respond( - 200, json=ANTHROPIC_USAGE_RICH_RESPONSE - ) + respx.post("https://api.anthropic.com/v1/messages").respond(200, json=ANTHROPIC_USAGE_RICH_RESPONSE) client = Anthropic(api_key="test-key-not-real") - adapter = AnthropicModelAdapter( - client=client, model_id="claude-sonnet-4-5-20250514" - ) + adapter = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5-20250514") response = adapter.chat([{"role": "user", "content": "Hello"}]) + assert response.usage is not None assert response.usage["input_tokens"] == 1000 assert response.usage["output_tokens"] == 200 assert response.usage["total_tokens"] == 1200 # computed by adapter @@ -777,19 +771,19 @@ def test_cost_calculation_with_cache_creation(self): from maseval.interface.inference.anthropic import AnthropicModelAdapter from maseval.core.usage import StaticPricingCalculator, TokenUsage - respx.post("https://api.anthropic.com/v1/messages").respond( - 200, json=ANTHROPIC_USAGE_RICH_RESPONSE + respx.post("https://api.anthropic.com/v1/messages").respond(200, json=ANTHROPIC_USAGE_RICH_RESPONSE) + + calc = StaticPricingCalculator( + { + "claude-sonnet-4-5-20250514": { + "input": 3e-6, + "output": 15e-6, + "cached_input": 0.3e-6, + "cache_creation_input": 3.75e-6, + }, + } ) - calc = StaticPricingCalculator({ - "claude-sonnet-4-5-20250514": { - "input": 3e-6, - "output": 15e-6, - "cached_input": 0.3e-6, - "cache_creation_input": 3.75e-6, - }, - }) - client = Anthropic(api_key="test-key-not-real") adapter = AnthropicModelAdapter( client=client, @@ -824,11 +818,10 @@ def test_extracts_thoughts_as_reasoning_tokens(self): api_key="test-key-not-real", http_options={"api_version": "v1beta"}, ) - adapter = GoogleGenAIModelAdapter( - client=client, model_id="gemini-2.0-flash-thinking" - ) + adapter = GoogleGenAIModelAdapter(client=client, model_id="gemini-2.0-flash-thinking") response = adapter.chat([{"role": "user", "content": "Hello"}]) + assert response.usage is not None assert response.usage["input_tokens"] == 500 assert response.usage["output_tokens"] == 200 assert response.usage["total_tokens"] == 700 @@ -852,12 +845,14 @@ def test_cost_calculation_basic(self): url__regex=r".*generativelanguage\.googleapis\.com.*models.*generateContent.*", ).respond(200, json=GOOGLE_USAGE_RICH_RESPONSE) - calc = StaticPricingCalculator({ - "gemini-2.0-flash-thinking": { - "input": 0.075e-6, - "output": 0.3e-6, - }, - }) + calc = StaticPricingCalculator( + { + "gemini-2.0-flash-thinking": { + "input": 0.075e-6, + "output": 0.3e-6, + }, + } + ) client = genai.Client( api_key="test-key-not-real", @@ -913,6 +908,7 @@ def test_extracts_cached_and_cache_creation_tokens(self): adapter = LiteLLMModelAdapter(model_id="claude-sonnet-4-5-20250514") response = adapter.chat([{"role": "user", "content": "Hello"}]) + assert response.usage is not None assert response.usage["input_tokens"] == 800 assert response.usage["output_tokens"] == 150 assert response.usage["total_tokens"] == 950 @@ -948,14 +944,14 @@ def test_provider_cost_from_hidden_params(self): mock_response._hidden_params = {"response_cost": 0.0042} # Calculator would compute a different cost — provider should win - calc = StaticPricingCalculator({ - "gpt-4o": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "gpt-4o": {"input": 0.01, "output": 0.02}, + } + ) with patch("litellm.completion", return_value=mock_response): - adapter = LiteLLMModelAdapter( - model_id="gpt-4o", cost_calculator=calc - ) + adapter = LiteLLMModelAdapter(model_id="gpt-4o", cost_calculator=calc) adapter.chat([{"role": "user", "content": "Hello"}]) total = adapter.gather_usage() @@ -989,14 +985,14 @@ def test_calculator_used_when_no_provider_cost(self): mock_response.usage = mock_usage mock_response._hidden_params = {} - calc = StaticPricingCalculator({ - "gpt-4o": {"input": 0.01, "output": 0.02}, - }) + calc = StaticPricingCalculator( + { + "gpt-4o": {"input": 0.01, "output": 0.02}, + } + ) with patch("litellm.completion", return_value=mock_response): - adapter = LiteLLMModelAdapter( - model_id="gpt-4o", cost_calculator=calc - ) + adapter = LiteLLMModelAdapter(model_id="gpt-4o", cost_calculator=calc) adapter.chat([{"role": "user", "content": "Hello"}]) total = adapter.gather_usage() From 3d772bc02096d382dc3a89e43ca1d23e8db323f1 Mon Sep 17 00:00:00 2001 From: cemde Date: Sun, 15 Mar 2026 20:30:08 +0100 Subject: [PATCH 10/19] added agent usage tracking --- maseval/core/agent.py | 3 +- maseval/interface/agents/camel.py | 69 +++-- maseval/interface/agents/langgraph.py | 83 ++++++ maseval/interface/agents/llamaindex.py | 30 ++ maseval/interface/agents/smolagents.py | 66 +++-- .../test_camel_integration.py | 173 ++++++++++- .../test_langgraph_integration.py | 136 +++++++++ .../test_llamaindex_integration.py | 207 +++++++++++++ .../test_smolagents_integration.py | 280 ++++++++++++++++-- 9 files changed, 974 insertions(+), 73 deletions(-) diff --git a/maseval/core/agent.py b/maseval/core/agent.py index 97011527..e481843c 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -5,9 +5,10 @@ from .history import MessageHistory from .tracing import TraceableMixin from .config import ConfigurableMixin +from .usage import UsageTrackableMixin -class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin): +class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): """Wraps an agent from any framework to provide a standard interface. This Adapter provides: diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py index 6166d108..b6ccebba 100644 --- a/maseval/interface/agents/camel.py +++ b/maseval/interface/agents/camel.py @@ -19,6 +19,7 @@ from maseval import AgentAdapter, MessageHistory, LLMUser, User from maseval.core.tracing import TraceableMixin from maseval.core.config import ConfigurableMixin +from maseval.core.usage import TokenUsage, Usage __all__ = [ "CamelAgentAdapter", @@ -135,9 +136,12 @@ class CamelAgentAdapter(AgentAdapter): for msg in agent_adapter.get_messages(): print(f"{msg['role']}: {msg['content']}") - # Gather execution traces with token usage and tool calls + # Gather aggregated usage + usage = agent_adapter.gather_usage() + print(f"Total tokens: {usage.total_tokens}") + + # Gather execution traces with tool call counts traces = agent_adapter.gather_traces() - print(f"Total tokens: {traces['total_tokens']}") print(f"Tool calls: {traces['total_tool_calls']}") # Gather configuration @@ -383,25 +387,16 @@ def _convert_base_message(self, msg) -> Dict[str, Any]: def gather_traces(self) -> Dict[str, Any]: """Gather execution traces from this CAMEL agent. - Extends the base class to include CAMEL-specific execution data - with aggregated statistics from all responses. + Extends the base class to include CAMEL-specific per-step execution + data. Aggregated usage totals are available via ``gather_usage()``. Returns: - Dictionary containing: - - Base traces (type, gathered_at, name, messages, logs, etc.) - - total_steps: Number of step() calls made - - total_input_tokens: Aggregated input tokens - - total_output_tokens: Aggregated output tokens - - total_tokens: Aggregated total tokens - - total_tool_calls: Total number of tool calls made - - last_terminated: Whether the last response indicated termination + Dictionary containing base traces plus step count, tool call count, + and termination status. """ base_traces = super().gather_traces() _check_camel_installed() - # Calculate aggregated statistics from responses - total_input_tokens = 0 - total_output_tokens = 0 total_tool_calls = 0 last_terminated = False @@ -411,24 +406,12 @@ def gather_traces(self) -> Dict[str, Any]: if hasattr(response, "info") and response.info: info = response.info - - # Aggregate token usage - if "usage" in info and isinstance(info["usage"], dict): - usage = info["usage"] - total_input_tokens += usage.get("prompt_tokens", 0) - total_output_tokens += usage.get("completion_tokens", 0) - - # Count tool calls if "tool_calls" in info and info["tool_calls"]: total_tool_calls += len(info["tool_calls"]) - # Add aggregated statistics base_traces.update( { "total_steps": len(self._responses), - "total_input_tokens": total_input_tokens, - "total_output_tokens": total_output_tokens, - "total_tokens": total_input_tokens + total_output_tokens, "total_tool_calls": total_tool_calls, "last_terminated": last_terminated, } @@ -436,6 +419,38 @@ def gather_traces(self) -> Dict[str, Any]: return base_traces + def gather_usage(self) -> Usage: + """Gather aggregated token usage across all CAMEL agent responses. + + Walks stored ``ChatAgentResponse`` objects and sums their + ``info["usage"]`` dicts (which contain ``prompt_tokens`` and + ``completion_tokens``). + + Returns: + Aggregated token usage, or empty ``Usage`` if no responses or no usage data. + """ + total_input = 0 + total_output = 0 + has_usage = False + + for response in self._responses: + if hasattr(response, "info") and response.info: + info = response.info + if "usage" in info and isinstance(info["usage"], dict): + usage_dict = info["usage"] + total_input += usage_dict.get("prompt_tokens", 0) + total_output += usage_dict.get("completion_tokens", 0) + has_usage = True + + if not has_usage: + return Usage() + + return TokenUsage( + input_tokens=total_input, + output_tokens=total_output, + total_tokens=total_input + total_output, + ) + def gather_config(self) -> Dict[str, Any]: """Gather configuration from this CAMEL agent. diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py index 5831c81d..6749e549 100644 --- a/maseval/interface/agents/langgraph.py +++ b/maseval/interface/agents/langgraph.py @@ -9,6 +9,7 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser +from maseval.core.usage import TokenUsage, Usage __all__ = ["LangGraphAgentAdapter", "LangGraphLLMUser"] @@ -213,6 +214,88 @@ def gather_config(self) -> dict[str, Any]: return base_config + def gather_usage(self) -> Usage: + """Gather aggregated token usage from LangGraph message metadata. + + Walks messages from the last graph execution (or persistent state) + and sums their ``usage_metadata``, including detailed token breakdowns + for caching, reasoning, and audio tokens when available. + + Returns: + Aggregated token usage, or empty ``Usage`` if no messages or no usage data. + """ + _check_langgraph_installed() + messages = self._get_usage_messages() + if not messages: + return Usage() + + total_input = 0 + total_output = 0 + cached_input = 0 + cache_creation_input = 0 + reasoning = 0 + audio_input = 0 + audio_output = 0 + has_usage = False + + for msg in messages: + if not (hasattr(msg, "usage_metadata") and msg.usage_metadata): + continue + meta = msg.usage_metadata + has_usage = True + + # Core counts — usage_metadata is a TypedDict (dict-like) + if isinstance(meta, dict): + total_input += meta.get("input_tokens", 0) + total_output += meta.get("output_tokens", 0) + # Detailed breakdowns (optional) + input_details = meta.get("input_token_details", {}) or {} + output_details = meta.get("output_token_details", {}) or {} + else: + total_input += getattr(meta, "input_tokens", 0) + total_output += getattr(meta, "output_tokens", 0) + input_details = getattr(meta, "input_token_details", {}) or {} + output_details = getattr(meta, "output_token_details", {}) or {} + + if isinstance(input_details, dict): + cached_input += input_details.get("cache_read", 0) + cache_creation_input += input_details.get("cache_creation", 0) + audio_input += input_details.get("audio", 0) + if isinstance(output_details, dict): + reasoning += output_details.get("reasoning", 0) + audio_output += output_details.get("audio", 0) + + if not has_usage: + return Usage() + + return TokenUsage( + input_tokens=total_input, + output_tokens=total_output, + total_tokens=total_input + total_output, + cached_input_tokens=cached_input, + cache_creation_input_tokens=cache_creation_input, + reasoning_tokens=reasoning, + audio_tokens=audio_input + audio_output, + ) + + def _get_usage_messages(self) -> list: + """Get messages for usage extraction, preferring persistent state.""" + # Try persistent state first + if self._langgraph_config and hasattr(self.agent, "get_state"): + try: + state = self.agent.get_state(self._langgraph_config) + messages = state.values.get("messages", []) + if messages: + return messages + except Exception: + pass + + # Fall back to cached result + if self._last_result and isinstance(self._last_result, dict): + return self._last_result.get("messages", []) + + return [] + def _run_agent(self, query: str) -> Any: _check_langgraph_installed() from langchain_core.messages import HumanMessage diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py index facf3794..e0e063d4 100644 --- a/maseval/interface/agents/llamaindex.py +++ b/maseval/interface/agents/llamaindex.py @@ -10,6 +10,7 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser +from maseval.core.usage import TokenUsage, Usage __all__ = ["LlamaIndexAgentAdapter", "LlamaIndexLLMUser"] @@ -215,6 +216,35 @@ def gather_config(self) -> Dict[str, Any]: return base_config + def gather_usage(self) -> Usage: + """Gather aggregated token usage from LlamaIndex execution logs. + + Sums token counts recorded in ``self.logs`` during agent execution. + LlamaIndex does not provide built-in cumulative usage tracking, so + this aggregates per-call usage extracted from LLM responses. + + Returns: + Aggregated token usage, or empty ``Usage`` if no usage data was recorded. + """ + total_input = 0 + total_output = 0 + has_usage = False + + for log_entry in self.logs: + if "input_tokens" in log_entry or "output_tokens" in log_entry: + total_input += log_entry.get("input_tokens", 0) + total_output += log_entry.get("output_tokens", 0) + has_usage = True + + if not has_usage: + return Usage() + + return TokenUsage( + input_tokens=total_input, + output_tokens=total_output, + total_tokens=total_input + total_output, + ) + def _run_agent(self, query: str) -> Any: """Run the LlamaIndex agent and cache execution state. diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py index 9e4169ef..4fcfe1db 100644 --- a/maseval/interface/agents/smolagents.py +++ b/maseval/interface/agents/smolagents.py @@ -7,6 +7,7 @@ from typing import TYPE_CHECKING, Any, Dict, List from maseval import AgentAdapter, MessageHistory, LLMUser +from maseval.core.usage import TokenUsage, Usage __all__ = ["SmolAgentAdapter", "SmolAgentLLMUser"] @@ -67,9 +68,12 @@ class SmolAgentAdapter(AgentAdapter): for msg in agent_adapter.get_messages(): print(f"{msg['role']}: {msg['content']}") - # Gather execution traces with timing and token usage + # Gather aggregated usage + usage = agent_adapter.gather_usage() + print(f"Total tokens: {usage.total_tokens}") + + # Gather execution traces with timing traces = agent_adapter.gather_traces() - print(f"Total tokens: {traces['total_tokens']}") print(f"Total duration: {traces['total_duration_seconds']}s") # Use in benchmark @@ -254,13 +258,12 @@ def logs(self) -> List[Dict[str, Any]]: # type: ignore[override] def gather_traces(self) -> dict: """Gather traces including message history and monitoring data. - Extends the base class to include smolagents' built-in monitoring data: - - Token usage (input, output, total) per step and aggregated - - Timing/duration per step and aggregated - - Step-level details including actions and observations + Extends the base class to include smolagents' per-step monitoring data + (token usage, timing, actions, observations). Aggregated usage totals + are available via ``gather_usage()``. Returns: - Dict containing messages and monitoring statistics + Dict containing messages and per-step monitoring statistics. """ base_logs = super().gather_traces() _check_smolagents_installed() @@ -268,17 +271,14 @@ def gather_traces(self) -> dict: # Extract monitoring data from agent's memory steps if hasattr(self.agent, "memory") and hasattr(self.agent.memory, "steps"): steps_stats = [] - total_input_tokens = 0 - total_output_tokens = 0 total_duration = 0.0 - # Import ActionStep for type checking from smolagents.memory import ActionStep, PlanningStep for step in self.agent.memory.steps: # Process ActionStep and PlanningStep (both have token_usage and timing) if isinstance(step, (ActionStep, PlanningStep)): - step_info = { + step_info: Dict[str, Any] = { "step_number": getattr(step, "step_number", None), } @@ -288,13 +288,11 @@ def gather_traces(self) -> dict: if step.timing.duration is not None: total_duration += step.timing.duration - # Add token usage information + # Add per-step token usage if hasattr(step, "token_usage") and step.token_usage: step_info["input_tokens"] = step.token_usage.input_tokens step_info["output_tokens"] = step.token_usage.output_tokens step_info["total_tokens"] = step.token_usage.total_tokens - total_input_tokens += step.token_usage.input_tokens - total_output_tokens += step.token_usage.output_tokens # Add action details for ActionStep if isinstance(step, ActionStep): @@ -312,13 +310,9 @@ def gather_traces(self) -> dict: steps_stats.append(step_info) - # Add aggregated statistics base_logs.update( { "total_steps": len(steps_stats), - "total_input_tokens": total_input_tokens, - "total_output_tokens": total_output_tokens, - "total_tokens": total_input_tokens + total_output_tokens, "total_duration_seconds": total_duration, "steps_detail": steps_stats, } @@ -326,6 +320,42 @@ def gather_traces(self) -> dict: return base_logs + def gather_usage(self) -> Usage: + """Gather aggregated token usage across all agent steps. + + Walks smolagents' memory steps (ActionStep and PlanningStep) and sums + their ``token_usage`` into a single ``TokenUsage``. + + Returns: + Aggregated token usage, or empty ``Usage`` if no steps or no usage data. + """ + _check_smolagents_installed() + + if not (hasattr(self.agent, "memory") and hasattr(self.agent.memory, "steps")): + return Usage() + + from smolagents.memory import ActionStep, PlanningStep + + total_input = 0 + total_output = 0 + has_usage = False + + for step in self.agent.memory.steps: + if isinstance(step, (ActionStep, PlanningStep)): + if hasattr(step, "token_usage") and step.token_usage: + total_input += step.token_usage.input_tokens + total_output += step.token_usage.output_tokens + has_usage = True + + if not has_usage: + return Usage() + + return TokenUsage( + input_tokens=total_input, + output_tokens=total_output, + total_tokens=total_input + total_output, + ) + def gather_config(self) -> dict[str, Any]: """Gather configuration from this SmolAgent. diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py index 46c55620..86be5722 100644 --- a/tests/test_interface/test_agent_integration/test_camel_integration.py +++ b/tests/test_interface/test_agent_integration/test_camel_integration.py @@ -84,13 +84,11 @@ def test_camel_adapter_gather_traces_with_response(): traces = adapter.gather_traces() - # New API uses last_terminated and aggregated stats + # New API uses last_terminated and aggregated stats (token totals moved to gather_usage) assert "last_terminated" in traces assert traces["last_terminated"] is True assert "total_steps" in traces assert traces["total_steps"] == 1 - assert "total_tokens" in traces - assert traces["total_tokens"] == 15 def test_camel_adapter_gather_config_basic(): @@ -1079,3 +1077,172 @@ def test_workforce_tracer_truncates_long_content(): assert len(traces["completed_tasks"][0]["content"]) == 200 assert len(traces["completed_tasks"][0]["result"]) == 200 + + +# ============================================================================= +# gather_usage() Tests +# ============================================================================= + + +def test_camel_adapter_gather_usage_with_responses(): + """Test that gather_usage() aggregates token usage across CAMEL responses.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + mock_agent = create_mock_camel_agent() + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent") + + # Simulate responses with usage data + adapter._responses.append( + MockCamelResponse( + content="Response 1", + terminated=False, + info={"usage": {"prompt_tokens": 100, "completion_tokens": 50, "total_tokens": 150}}, + ) + ) + adapter._responses.append( + MockCamelResponse( + content="Response 2", + terminated=True, + info={"usage": {"prompt_tokens": 200, "completion_tokens": 80, "total_tokens": 280}}, + ) + ) + + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 300 # 100 + 200 + assert usage.output_tokens == 130 # 50 + 80 + assert usage.total_tokens == 430 + + +def test_camel_adapter_gather_usage_no_responses(): + """Test that gather_usage() returns empty Usage when no responses exist.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import Usage + + mock_agent = create_mock_camel_agent() + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent") + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + + +def test_camel_adapter_gather_usage_responses_without_usage(): + """Test that gather_usage() handles responses without usage info.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import Usage + + mock_agent = create_mock_camel_agent() + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test_agent") + + # Response without info or usage + adapter._responses.append(MockCamelResponse(content="Response", terminated=True, info={})) + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + + +# ============================================================================= +# End-to-End Usage Collection Tests +# (real ChatAgent + StubModel execution, not pre-populated mock data) +# ============================================================================= + + +def test_e2e_camel_gather_usage_single_step(): + """Run a real CAMEL ChatAgent with StubModel, verify gather_usage() returns real token counts.""" + from camel.agents import ChatAgent + from camel.models import StubModel + from camel.types import ModelType + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + # StubModel returns CompletionUsage(prompt_tokens=10, completion_tokens=10) + stub = StubModel(model_type=ModelType.STUB) + agent = ChatAgent(system_message="You are a helpful assistant.", model=stub) + adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test") + + result = adapter.run("What is 2+2?") + + assert result == "Lorem Ipsum" + assert len(adapter._responses) == 1 + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 10 + assert usage.output_tokens == 10 + assert usage.total_tokens == 20 + + +def test_e2e_camel_gather_usage_multi_step(): + """Run a real CAMEL ChatAgent multiple times, verify usage aggregation.""" + from camel.agents import ChatAgent + from camel.models import StubModel + from camel.types import ModelType + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + stub = StubModel(model_type=ModelType.STUB) + agent = ChatAgent(system_message="You are helpful.", model=stub) + adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test") + + adapter.run("First query") + adapter.run("Second query") + adapter.run("Third query") + + assert len(adapter._responses) == 3 + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 30 # 10 * 3 + assert usage.output_tokens == 30 # 10 * 3 + assert usage.total_tokens == 60 # 20 * 3 + + +def test_e2e_camel_gather_usage_empty_before_run(): + """Verify gather_usage() returns empty Usage before run, real TokenUsage after.""" + from camel.agents import ChatAgent + from camel.models import StubModel + from camel.types import ModelType + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage + + stub = StubModel(model_type=ModelType.STUB) + agent = ChatAgent(system_message="You are helpful.", model=stub) + adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test") + + # Before run: no usage + usage_before = adapter.gather_usage() + assert isinstance(usage_before, Usage) + assert not isinstance(usage_before, MasevalTokenUsage) + + # After run: real usage from the StubModel pipeline + adapter.run("test query") + usage_after = adapter.gather_usage() + assert isinstance(usage_after, MasevalTokenUsage) + assert usage_after.input_tokens > 0 + assert usage_after.output_tokens > 0 + + +def test_e2e_camel_logs_contain_usage(): + """Verify adapter.logs also contain usage data from real execution.""" + from camel.agents import ChatAgent + from camel.models import StubModel + from camel.types import ModelType + from maseval.interface.agents.camel import CamelAgentAdapter + + stub = StubModel(model_type=ModelType.STUB) + agent = ChatAgent(system_message="You are helpful.", model=stub) + adapter = CamelAgentAdapter(agent_instance=agent, name="e2e_test") + + adapter.run("Tell me something") + + assert len(adapter.logs) == 1 + assert adapter.logs[0]["status"] == "success" + assert adapter.logs[0]["input_tokens"] == 10 + assert adapter.logs[0]["output_tokens"] == 10 + assert adapter.logs[0]["total_tokens"] == 20 diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py index c972a464..b39c4791 100644 --- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py +++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py @@ -240,3 +240,139 @@ def agent_node(state: State) -> State: assert log_entry.get("input_tokens") in [None, 0] assert log_entry.get("output_tokens") in [None, 0] assert log_entry.get("total_tokens") in [None, 0] + + +# ============================================================================= +# gather_usage() Tests +# ============================================================================= + + +def test_langgraph_adapter_gather_usage_with_metadata(): + """Test that gather_usage() extracts token usage from message metadata.""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + from langgraph.graph import StateGraph, END + from typing_extensions import TypedDict + from langchain_core.messages import AIMessage + from langchain_core.messages.ai import UsageMetadata + + class State(TypedDict): + messages: list + + def agent_node(state: State) -> State: + messages = state["messages"] + response = AIMessage( + content="Response", + usage_metadata=UsageMetadata( + input_tokens=150, + output_tokens=60, + total_tokens=210, + ), + ) + return {"messages": messages + [response]} + + graph = StateGraph(State) # type: ignore[arg-type] + graph.add_node("agent", agent_node) + graph.set_entry_point("agent") + graph.add_edge("agent", END) + compiled = graph.compile() + + adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent") + adapter.run("Test query") + + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 150 + assert usage.output_tokens == 60 + assert usage.total_tokens == 210 + + +def test_langgraph_adapter_gather_usage_with_token_details(): + """Test that gather_usage() extracts detailed token breakdowns.""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + from langgraph.graph import StateGraph, END + from typing_extensions import TypedDict + from langchain_core.messages import AIMessage + from langchain_core.messages.ai import UsageMetadata + + class State(TypedDict): + messages: list + + def agent_node(state: State) -> State: + messages = state["messages"] + response = AIMessage( + content="Response", + usage_metadata=UsageMetadata( + input_tokens=200, + output_tokens=100, + total_tokens=300, + input_token_details={"cache_read": 50, "cache_creation": 30}, + output_token_details={"reasoning": 40}, + ), + ) + return {"messages": messages + [response]} + + graph = StateGraph(State) # type: ignore[arg-type] + graph.add_node("agent", agent_node) + graph.set_entry_point("agent") + graph.add_edge("agent", END) + compiled = graph.compile() + + adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent") + adapter.run("Test query") + + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 200 + assert usage.output_tokens == 100 + assert usage.cached_input_tokens == 50 + assert usage.cache_creation_input_tokens == 30 + assert usage.reasoning_tokens == 40 + + +def test_langgraph_adapter_gather_usage_no_metadata(): + """Test that gather_usage() returns empty Usage when no usage_metadata.""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import Usage + from langgraph.graph import StateGraph, END + from typing_extensions import TypedDict + from langchain_core.messages import AIMessage + + class State(TypedDict): + messages: list + + def agent_node(state: State) -> State: + messages = state["messages"] + response = AIMessage(content="Response") # No usage_metadata + return {"messages": messages + [response]} + + graph = StateGraph(State) # type: ignore[arg-type] + graph.add_node("agent", agent_node) + graph.set_entry_point("agent") + graph.add_edge("agent", END) + compiled = graph.compile() + + adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test_agent") + adapter.run("Test query") + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + + +def test_langgraph_adapter_gather_usage_before_run(): + """Test that gather_usage() returns empty Usage before any run.""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import Usage + from unittest.mock import Mock + + adapter = LangGraphAgentAdapter(agent_instance=Mock(), name="test_agent") + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py index 52779f65..49998292 100644 --- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py +++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py @@ -393,3 +393,210 @@ def test_llamaindex_adapter_error_logging(): assert log_entry["error"] == "Test error" assert log_entry["error_type"] == "ValueError" assert "duration_seconds" in log_entry + + +# ============================================================================= +# gather_usage() Tests +# ============================================================================= + + +def test_llamaindex_adapter_gather_usage_with_logs(): + """Test that gather_usage() aggregates token usage from execution logs.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + from unittest.mock import Mock + + adapter = LlamaIndexAgentAdapter(Mock(), "test_agent") + + # Simulate logs with token usage (as populated by _run_agent) + adapter.logs.append( + { + "timestamp": "2026-01-01T00:00:00", + "query": "Query 1", + "status": "success", + "duration_seconds": 1.0, + "input_tokens": 100, + "output_tokens": 50, + "total_tokens": 150, + } + ) + adapter.logs.append( + { + "timestamp": "2026-01-01T00:00:01", + "query": "Query 2", + "status": "success", + "duration_seconds": 0.5, + "input_tokens": 200, + "output_tokens": 80, + "total_tokens": 280, + } + ) + + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 300 # 100 + 200 + assert usage.output_tokens == 130 # 50 + 80 + assert usage.total_tokens == 430 + + +def test_llamaindex_adapter_gather_usage_no_logs(): + """Test that gather_usage() returns empty Usage with no logs.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from maseval.core.usage import Usage + from unittest.mock import Mock + + adapter = LlamaIndexAgentAdapter(Mock(), "test_agent") + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + + +def test_llamaindex_adapter_gather_usage_logs_without_tokens(): + """Test that gather_usage() returns empty Usage when logs have no token fields.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from maseval.core.usage import Usage + from unittest.mock import Mock + + adapter = LlamaIndexAgentAdapter(Mock(), "test_agent") + + # Log without token fields (error case, or model didn't report usage) + adapter.logs.append( + { + "timestamp": "2026-01-01T00:00:00", + "query": "Query", + "status": "error", + "duration_seconds": 0.1, + "error": "Something went wrong", + } + ) + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + + +# ============================================================================= +# End-to-End Usage Collection Tests +# (real ReActAgent + AgentWorkflow execution, not pre-populated mock data) +# ============================================================================= + + +def _make_llamaindex_adapter(prompt_tokens_per_call: int = 10, completion_tokens_per_call: int = 20): + """Create adapter wrapping a ReActAgent + AgentWorkflow with a mock LLM that reports usage. + + The mock LLM returns CompletionResponse with token usage in raw, using + ReAct format ("Thought: ... Answer: ...") so the output parser works. + """ + from types import SimpleNamespace + from typing import Any + from llama_index.core.base.llms.types import CompletionResponse, CompletionResponseGen + from llama_index.core.llms.custom import CustomLLM + from llama_index.core.llms import LLMMetadata + from llama_index.core.agent.workflow.react_agent import ReActAgent + from llama_index.core.agent.workflow import AgentWorkflow + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + + pt = prompt_tokens_per_call + ct = completion_tokens_per_call + + class _MockLLMWithUsage(CustomLLM): + prompt_tokens: int = pt + completion_tokens: int = ct + + @property + def metadata(self) -> LLMMetadata: + return LLMMetadata(num_output=256) + + def complete(self, prompt: str, formatted: bool = False, **kwargs: Any) -> CompletionResponse: + usage = SimpleNamespace( + prompt_tokens=self.prompt_tokens, + completion_tokens=self.completion_tokens, + total_tokens=self.prompt_tokens + self.completion_tokens, + ) + return CompletionResponse( + text="Thought: I can answer directly.\nAnswer: Mock answer.", + raw=SimpleNamespace(usage=usage), + ) + + def stream_complete(self, prompt: str, formatted: bool = False, **kwargs: Any) -> CompletionResponseGen: + raise NotImplementedError + + llm = _MockLLMWithUsage() + agent = ReActAgent(name="test_agent", description="test", llm=llm, tools=[], streaming=False) + workflow = AgentWorkflow(agents=[agent], root_agent="test_agent") + return LlamaIndexAgentAdapter(workflow, "test_agent") + + +def test_e2e_llamaindex_gather_usage_single_run(): + """Run a real ReActAgent → adapter.run() → gather_usage() returns real token counts.""" + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + adapter = _make_llamaindex_adapter(prompt_tokens_per_call=10, completion_tokens_per_call=20) + + result = adapter.run("Hello?") + assert isinstance(result, str) + assert len(result) > 0 + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 10 + assert usage.output_tokens == 20 + assert usage.total_tokens == 30 + + +def test_e2e_llamaindex_gather_usage_accumulates(): + """Multiple adapter.run() calls accumulate in gather_usage().""" + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + adapter = _make_llamaindex_adapter(prompt_tokens_per_call=15, completion_tokens_per_call=25) + + adapter.run("First query") + adapter.run("Second query") + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 30 # 15 + 15 + assert usage.output_tokens == 50 # 25 + 25 + assert usage.total_tokens == 80 + + +def test_e2e_llamaindex_gather_usage_empty_before_run(): + """Verify gather_usage() returns empty Usage before run, real TokenUsage after.""" + from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage + + adapter = _make_llamaindex_adapter(prompt_tokens_per_call=50, completion_tokens_per_call=100) + + # Before run: no usage + usage_before = adapter.gather_usage() + assert isinstance(usage_before, Usage) + assert not isinstance(usage_before, MasevalTokenUsage) + + # After run: real usage from the LLM + adapter.run("test query") + usage_after = adapter.gather_usage() + assert isinstance(usage_after, MasevalTokenUsage) + assert usage_after.input_tokens == 50 + assert usage_after.output_tokens == 100 + assert usage_after.total_tokens == 150 + + +def test_e2e_llamaindex_logs_populated_by_real_execution(): + """Verify adapter.logs is populated by _run_agent, not manually.""" + adapter = _make_llamaindex_adapter(prompt_tokens_per_call=50, completion_tokens_per_call=100) + + assert len(adapter.logs) == 0 + + adapter.run("Test query") + + assert len(adapter.logs) == 1 + log = adapter.logs[0] + assert log["status"] == "success" + assert log["input_tokens"] == 50 + assert log["output_tokens"] == 100 + assert log["total_tokens"] == 150 + assert "timestamp" in log + assert "duration_seconds" in log diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py index d3c40400..cc605fa0 100644 --- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py +++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py @@ -113,19 +113,10 @@ def test_smolagents_adapter_gather_traces_with_monitoring(): # Call gather_traces traces = agent_adapter.gather_traces() - # Verify aggregated statistics + # Verify aggregated statistics (token totals moved to gather_usage) assert "total_steps" in traces assert traces["total_steps"] == 2 - assert "total_input_tokens" in traces - assert traces["total_input_tokens"] == 300 # 100 + 200 - - assert "total_output_tokens" in traces - assert traces["total_output_tokens"] == 150 # 50 + 100 - - assert "total_tokens" in traces - assert traces["total_tokens"] == 450 # 300 + 150 - assert "total_duration_seconds" in traces assert traces["total_duration_seconds"] == pytest.approx(1.2, abs=0.01) # 0.5 + 0.7 @@ -171,19 +162,10 @@ def test_smolagents_adapter_gather_traces_without_monitoring(): # Call gather_traces traces = agent_adapter.gather_traces() - # Verify aggregated statistics show zero usage + # Verify aggregated statistics (token totals moved to gather_usage) assert "total_steps" in traces assert traces["total_steps"] == 0 - assert "total_input_tokens" in traces - assert traces["total_input_tokens"] == 0 - - assert "total_output_tokens" in traces - assert traces["total_output_tokens"] == 0 - - assert "total_tokens" in traces - assert traces["total_tokens"] == 0 - assert "total_duration_seconds" in traces assert traces["total_duration_seconds"] == 0.0 @@ -224,11 +206,8 @@ def test_smolagents_adapter_gather_traces_with_planning_step(): # Call gather_traces traces = agent_adapter.gather_traces() - # Verify aggregated statistics + # Verify aggregated statistics (token totals moved to gather_usage) assert traces["total_steps"] == 1 - assert traces["total_input_tokens"] == 500 - assert traces["total_output_tokens"] == 200 - assert traces["total_tokens"] == 700 assert traces["total_duration_seconds"] == pytest.approx(1.0, abs=0.01) # Verify step details @@ -478,3 +457,256 @@ def test_smolagents_adapter_logs_empty_when_no_steps(): # Should be empty assert isinstance(logs, list) assert len(logs) == 0 + + +# ============================================================================= +# gather_usage() Tests +# ============================================================================= + + +def test_smolagents_adapter_gather_usage_with_steps(): + """Test that gather_usage() aggregates token usage across all memory steps.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + from smolagents.memory import ActionStep, PlanningStep, AgentMemory + from smolagents.monitoring import TokenUsage, Timing + from smolagents.models import ChatMessage, MessageRole + from unittest.mock import Mock + import time + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + + start = time.time() + + # ActionStep with usage + step1 = ActionStep( + step_number=1, + timing=Timing(start_time=start, end_time=start + 0.5), + observations_images=[], + ) + step1.token_usage = TokenUsage(input_tokens=100, output_tokens=50) + mock_agent.memory.steps.append(step1) + + # PlanningStep with usage + step2 = PlanningStep( + timing=Timing(start_time=start + 0.5, end_time=start + 1.0), + model_input_messages=[], + model_output_message=ChatMessage(role=MessageRole.ASSISTANT, content="Plan"), + plan="My plan", + ) + step2.token_usage = TokenUsage(input_tokens=200, output_tokens=80) + mock_agent.memory.steps.append(step2) + + mock_agent.write_memory_to_messages = Mock(return_value=[]) + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent") + + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 300 # 100 + 200 + assert usage.output_tokens == 130 # 50 + 80 + assert usage.total_tokens == 430 + + +def test_smolagents_adapter_gather_usage_no_steps(): + """Test that gather_usage() returns empty Usage when no steps exist.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import Usage + from smolagents.memory import AgentMemory + from unittest.mock import Mock + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + mock_agent.write_memory_to_messages = Mock(return_value=[]) + + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent") + + usage = adapter.gather_usage() + + assert isinstance(usage, Usage) + assert usage.cost is None + assert usage.input_tokens == 0 if hasattr(usage, "input_tokens") else True + + +def test_smolagents_adapter_gather_usage_steps_without_token_usage(): + """Test that gather_usage() returns empty Usage when steps have no token_usage.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import Usage + from smolagents.memory import ActionStep, AgentMemory + from smolagents.monitoring import Timing + from unittest.mock import Mock + import time + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + + start = time.time() + step = ActionStep( + step_number=1, + timing=Timing(start_time=start, end_time=start + 0.5), + observations_images=[], + ) + # No token_usage set — defaults to None + mock_agent.memory.steps.append(step) + mock_agent.write_memory_to_messages = Mock(return_value=[]) + + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent") + + usage = adapter.gather_usage() + + # Should return plain Usage (not TokenUsage) since no usage data + assert isinstance(usage, Usage) + assert usage.cost is None + + +# ============================================================================= +# End-to-End Usage Collection Tests +# (real ToolCallingAgent execution, not pre-populated mock data) +# ============================================================================= + + +class _FakeModelForUsageTest: + """Deterministic fake model that returns canned responses with token usage. + + Subclasses smolagents Model via duck-typing (same generate() signature). + Uses a list of responses; cycles the last one if calls exceed the list. + """ + + def __init__(self, responses=None): + from smolagents.models import ( + ChatMessage, + ChatMessageToolCall, + ChatMessageToolCallFunction, + MessageRole, + ) + from smolagents.monitoring import TokenUsage + + self.model_id = "fake-model-for-test" + self._call_count = 0 + self._responses = responses or [ + ChatMessage( + role=MessageRole.ASSISTANT, + content="Here is the answer.", + tool_calls=[ + ChatMessageToolCall( + function=ChatMessageToolCallFunction( + name="final_answer", + arguments={"answer": "42"}, + ), + id="call_001", + type="function", + ) + ], + token_usage=TokenUsage(input_tokens=150, output_tokens=30), + ) + ] + + def generate(self, messages, stop_sequences=None, response_format=None, tools_to_call_from=None, **kwargs): + idx = min(self._call_count, len(self._responses) - 1) + self._call_count += 1 + return self._responses[idx] + + +def test_e2e_smolagents_gather_usage_single_step(): + """Run a real ToolCallingAgent → adapter.run() → gather_usage() returns real token counts.""" + from smolagents import ToolCallingAgent + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + agent = ToolCallingAgent(tools=[], model=_FakeModelForUsageTest(), max_steps=3, verbosity_level=0) + adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent") + + result = adapter.run("What is the meaning of life?") + assert result == "42" + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 150 + assert usage.output_tokens == 30 + assert usage.total_tokens == 180 + + +def test_e2e_smolagents_gather_usage_multi_step(): + """Run a real agent through tool call + final answer, verify usage aggregation.""" + from smolagents import ToolCallingAgent + from smolagents.models import ( + ChatMessage, + ChatMessageToolCall, + ChatMessageToolCallFunction, + MessageRole, + ) + from smolagents.monitoring import TokenUsage + from smolagents.tools import Tool + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + + class AddTool(Tool): + name = "add_numbers" + description = "Adds two numbers" + inputs = {"a": {"type": "number", "description": "First"}, "b": {"type": "number", "description": "Second"}} + output_type = "number" + + def forward(self, a, b): + return a + b + + responses = [ + ChatMessage( + role=MessageRole.ASSISTANT, + content="Let me add.", + tool_calls=[ + ChatMessageToolCall( + function=ChatMessageToolCallFunction(name="add_numbers", arguments={"a": 20, "b": 22}), + id="call_001", + type="function", + ) + ], + token_usage=TokenUsage(input_tokens=200, output_tokens=40), + ), + ChatMessage( + role=MessageRole.ASSISTANT, + content="The sum is 42.", + tool_calls=[ + ChatMessageToolCall( + function=ChatMessageToolCallFunction(name="final_answer", arguments={"answer": "42"}), + id="call_002", + type="function", + ) + ], + token_usage=TokenUsage(input_tokens=350, output_tokens=20), + ), + ] + + agent = ToolCallingAgent(tools=[AddTool()], model=_FakeModelForUsageTest(responses), max_steps=5, verbosity_level=0) + adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent") + + result = adapter.run("What is 20 + 22?") + assert result == "42" + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 550 # 200 + 350 + assert usage.output_tokens == 60 # 40 + 20 + assert usage.total_tokens == 610 + + +def test_e2e_smolagents_gather_usage_empty_before_run(): + """Verify gather_usage() returns empty Usage before run, real TokenUsage after.""" + from smolagents import ToolCallingAgent + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import Usage, TokenUsage as MasevalTokenUsage + + agent = ToolCallingAgent(tools=[], model=_FakeModelForUsageTest(), max_steps=3, verbosity_level=0) + adapter = SmolAgentAdapter(agent_instance=agent, name="test_agent") + + # Before run: no usage + usage_before = adapter.gather_usage() + assert isinstance(usage_before, Usage) + assert not isinstance(usage_before, MasevalTokenUsage) + + # After run: real usage from the model + adapter.run("test query") + usage_after = adapter.gather_usage() + assert isinstance(usage_after, MasevalTokenUsage) + assert usage_after.input_tokens > 0 + assert usage_after.output_tokens > 0 From 8b9a374f5666c524135dc11df3d21342de600354 Mon Sep 17 00:00:00 2001 From: cemde Date: Sun, 15 Mar 2026 20:37:32 +0100 Subject: [PATCH 11/19] updated usage tracking guide --- docs/guides/usage-tracking.md | 260 +++++++++++++++------------------- 1 file changed, 112 insertions(+), 148 deletions(-) diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md index 6b249a05..fba1915e 100644 --- a/docs/guides/usage-tracking.md +++ b/docs/guides/usage-tracking.md @@ -2,12 +2,7 @@ ## Overview -MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for: - -- **Cost control**: Track how much each benchmark run costs across providers -- **Budgeting**: Compare cost across models, tasks, and components -- **Billing**: Support custom credit systems (university clusters, internal APIs) -- **Analysis**: Understand token usage patterns per task, agent, or model +MASEval tracks how much each benchmark run consumes — tokens, API calls, dollars — so you can compare models, stay within budget, and explain where money went. !!! info "Usage vs Cost" @@ -17,39 +12,52 @@ MASEval provides first-class usage and cost tracking to monitor resource consump Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator. -## Core Concepts - -**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata. +## What Gets Tracked Automatically -**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.). +**Model adapters** track every `chat()` call — input tokens, output tokens, cached tokens, reasoning tokens. No setup needed. -**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`. +**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures — memory steps, response metadata, message annotations, or execution logs respectively. -**`CostCalculator`**: Protocol for pluggable cost computation from token counts. +**Benchmarks** collect usage from all registered components after each task and include it in reports. -## Automatic LLM Usage Tracking +## Getting Started -All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally. +### Reading Model Usage ```python from maseval.interface.inference import OpenAIModelAdapter model = OpenAIModelAdapter(client=client, model_id="gpt-4") -# Make some calls model.chat([{"role": "user", "content": "Hello"}]) model.chat([{"role": "user", "content": "How are you?"}]) -# Inspect accumulated usage +# Accumulated usage across both calls usage = model.gather_usage() -print(usage.input_tokens) # e.g., 25 -print(usage.output_tokens) # e.g., 42 -print(usage.cost) # None (no cost calculator configured) +print(f"{usage.input_tokens} in, {usage.output_tokens} out") +print(f"Cost: ${usage.cost}") # None if no cost calculator configured +``` + +### Reading Agent Usage + +Agent adapters expose the same `gather_usage()` interface. Each adapter knows how to extract usage from its framework's internals: + +```python +from maseval.interface.agents import SmolAgentAdapter + +adapter = SmolAgentAdapter(agent, name="researcher") +adapter.run("What's the capital of France?") + +# Usage is aggregated from the agent's memory steps +usage = adapter.gather_usage() +print(f"{usage.input_tokens} in, {usage.output_tokens} out") ``` +This works across all supported frameworks — smolagents, CAMEL, LangGraph, and LlamaIndex. The adapter handles the framework-specific extraction; you always call `gather_usage()`. + ### In Benchmarks -Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key: +Usage is collected automatically alongside traces and configs after each task. Each report includes a `"usage"` key: ```python results = benchmark.run() @@ -65,29 +73,18 @@ benchmark.usage # -> Usage (grand total across all tasks) benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals) ``` -## Cost Calculation - -Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones. - -### Cost Priority - -When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order: +## Adding Cost Tracking -1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. -2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. -3. **None** — if neither source provides cost, `Usage.cost` stays `None`. +Most LLM APIs return token counts but not cost. MASEval provides two built-in cost calculators. -### StaticPricingCalculator +### Quick Start: LiteLLM Pricing -Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`. +The easiest path — uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more: ```python -from maseval import StaticPricingCalculator +from maseval.interface.usage import LiteLLMCostCalculator -calculator = StaticPricingCalculator({ - "gpt-4": {"input": 0.00003, "output": 0.00006}, - "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, -}) +calculator = LiteLLMCostCalculator() model = OpenAIModelAdapter( client=client, @@ -96,78 +93,90 @@ model = OpenAIModelAdapter( ) response = model.chat([{"role": "user", "content": "Hello"}]) -print(model.gather_usage().cost) # e.g., 0.00234 +print(f"Cost: ${model.gather_usage().cost:.4f}") ``` -Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate: +!!! tip "LiteLLMModelAdapter already reports cost" + + If you're using `LiteLLMModelAdapter`, it extracts provider-reported cost automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup. + +If your model ID doesn't match LiteLLM's naming (e.g., using Google's OpenAI-compatible endpoint), remap it: ```python -calculator = StaticPricingCalculator({ - "claude-sonnet-4-5": { - "input": 0.000003, - "output": 0.000015, - "cached_input": 0.0000003, # 10x cheaper for cached tokens - }, +calculator = LiteLLMCostCalculator(model_id_map={ + "gemini-2.0-flash": "gemini/gemini-2.0-flash", }) ``` -For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents: +You can also override pricing for specific models while using LiteLLM's database for the rest: ```python -calculator = StaticPricingCalculator({ - "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token +calculator = LiteLLMCostCalculator(custom_pricing={ + "my-finetuned-gpt4": { + "input_cost_per_token": 0.00006, + "output_cost_per_token": 0.00012, + }, }) ``` -### LiteLLMCostCalculator +### Manual Pricing -Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more. +When you know your rates, use `StaticPricingCalculator` — zero dependencies, fully explicit: ```python -from maseval.interface.usage import LiteLLMCostCalculator - -calculator = LiteLLMCostCalculator() +from maseval import StaticPricingCalculator -model = OpenAIModelAdapter( - client=client, - model_id="gpt-4", - cost_calculator=calculator, -) +calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, +}) ``` -!!! tip "LiteLLMModelAdapter already reports cost" - - If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup. +Pricing is **per token** (not per 1K or 1M). For cached tokens, add a `"cached_input"` rate: -#### Custom Pricing Overrides +```python +calculator = StaticPricingCalculator({ + "claude-sonnet-4-5": { + "input": 0.000003, + "output": 0.000015, + "cached_input": 0.0000003, # 10x cheaper + }, +}) +``` -Override pricing for specific models while using LiteLLM's database for the rest: +The cost unit is whatever your pricing represents — USD, EUR, university credits: ```python -calculator = LiteLLMCostCalculator(custom_pricing={ - "my-finetuned-gpt4": { - "input_cost_per_token": 0.00006, - "output_cost_per_token": 0.00012, - }, +calculator = StaticPricingCalculator({ + "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token }) ``` -#### Model ID Remapping +### Sharing a Calculator Across Models -When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap: +A single calculator instance works for multiple model adapters — the `model_id` is passed on each cost computation: ```python -calculator = LiteLLMCostCalculator(model_id_map={ - "gemini-2.0-flash": "gemini/gemini-2.0-flash", - "my-custom-gpt4": "gpt-4", +calculator = StaticPricingCalculator({ + "gpt-4": {"input": 0.00003, "output": 0.00006}, + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, }) + +model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) +model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator) ``` -The map is applied before both custom pricing and LiteLLM lookup. +### How Cost Is Resolved + +When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order: + +1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. +2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. +3. **None** — if neither source provides cost, `usage.cost` stays `None`. -### Custom Cost Calculator +### Writing a Custom Calculator -Implement the `CostCalculator` protocol for custom pricing logic: +Implement the `CostCalculator` protocol — a single method: ```python from maseval import CostCalculator, TokenUsage @@ -181,29 +190,40 @@ class MyCostCalculator: return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens ``` -The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model. +Return `None` if you don't have pricing for the given model. -### Sharing Calculators Across Adapters +## Post-hoc Analysis -A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing: +After a benchmark completes, `UsageReporter` lets you slice usage by task, component, or both: ```python -calculator = StaticPricingCalculator({ - "gpt-4": {"input": 0.00003, "output": 0.00006}, - "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, -}) +from maseval import UsageReporter -model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) -model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator) +reporter = UsageReporter.from_reports(benchmark.reports) + +# Grand total +total = reporter.total() +print(f"Total cost: ${total.cost:.4f}") +print(f"Total tokens: {total.input_tokens + total.output_tokens}") + +# Where did the money go? +for component, usage in reporter.by_component().items(): + print(f" {component}: ${usage.cost:.4f}") + +# Which tasks were expensive? +for task_id, usage in reporter.by_task().items(): + print(f" {task_id}: ${usage.cost:.4f}") + +# Full nested summary dict +summary = reporter.summary() ``` -## Non-LLM Usage Tracking +## Tracking Non-LLM Resources -Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`: +Tools, environments, and other components can track arbitrary usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`. Here's an example for a paid API: ```python from maseval import Usage, UsageTrackableMixin -from maseval.core.tracing import TraceableMixin class BloombergEnvironment(Environment, UsageTrackableMixin): def __init__(self, task_data): @@ -226,67 +246,11 @@ class BloombergEnvironment(Environment, UsageTrackableMixin): return sum(self._usage_records, Usage()) ``` -Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model. - -## Post-hoc Analysis with UsageReporter - -`UsageReporter` provides sliced analysis across all benchmark reports: - -```python -from maseval import UsageReporter - -reporter = UsageReporter.from_reports(benchmark.reports) - -# Grand total -total = reporter.total() -print(f"Total cost: ${total.cost:.4f}") -print(f"Total tokens: {total.input_tokens + total.output_tokens}") - -# Per-task breakdown -for task_id, usage in reporter.by_task().items(): - print(f" {task_id}: ${usage.cost:.4f}") - -# Per-component breakdown -for component, usage in reporter.by_component().items(): - print(f" {component}: ${usage.cost:.4f}") - -# Full nested summary dict -summary = reporter.summary() -``` - -## Usage Data Model - -### Usage - -Generic record for any billable resource: - -| Field | Type | Description | -|-------|------|-------------| -| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. | -| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). | -| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). | -| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). | -| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). | -| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). | - -`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch. - -### TokenUsage - -Extends `Usage` with LLM-specific token counts: - -| Field | Type | Description | -|-------|------|-------------| -| `input_tokens` | `int` | Input/prompt tokens. | -| `output_tokens` | `int` | Output/completion tokens. | -| `total_tokens` | `int` | Total tokens. | -| `cached_input_tokens` | `int` | Tokens served from cache. | -| `reasoning_tokens` | `int` | Reasoning/thinking tokens. | -| `audio_tokens` | `int` | Audio processing tokens. | +Non-LLM components set cost directly — there is no calculator involvement. Each component knows its own billing model. ## Evaluator Usage -Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically: +Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register it in the benchmark and its usage is collected separately from agent usage: ```python class MyBenchmark(Benchmark): @@ -296,11 +260,11 @@ class MyBenchmark(Benchmark): return [MyLLMEvaluator(judge_model)] ``` -The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage. +The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, so you can distinguish evaluation cost from agent cost. ## Tips -**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates. +**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` when you know your rates. **For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming. From 276ee662449e63d59d080fa961d21869ef4977c0 Mon Sep 17 00:00:00 2001 From: cemde Date: Sun, 15 Mar 2026 21:23:06 +0100 Subject: [PATCH 12/19] fixed tests and usage tracking --- docs/guides/usage-tracking.md | 86 ++++++++++++++++--- maseval/core/model.py | 13 +-- maseval/core/registry.py | 7 +- maseval/core/usage.py | 59 ++++++++----- tests/test_core/test_usage.py | 25 ++++-- .../test_camel_integration.py | 4 +- .../test_langgraph_integration.py | 4 +- .../test_llamaindex_integration.py | 4 +- .../test_smolagents_integration.py | 4 +- 9 files changed, 144 insertions(+), 62 deletions(-) diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md index fba1915e..b59005c4 100644 --- a/docs/guides/usage-tracking.md +++ b/docs/guides/usage-tracking.md @@ -2,7 +2,7 @@ ## Overview -MASEval tracks how much each benchmark run consumes — tokens, API calls, dollars — so you can compare models, stay within budget, and explain where money went. +MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars) so you can compare models, stay within budget, and explain where money went. !!! info "Usage vs Cost" @@ -14,9 +14,9 @@ MASEval tracks how much each benchmark run consumes — tokens, API calls, dolla ## What Gets Tracked Automatically -**Model adapters** track every `chat()` call — input tokens, output tokens, cached tokens, reasoning tokens. No setup needed. +**Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed. -**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures — memory steps, response metadata, message annotations, or execution logs respectively. +**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). **Benchmarks** collect usage from all registered components after each task and include it in reports. @@ -35,7 +35,7 @@ model.chat([{"role": "user", "content": "How are you?"}]) # Accumulated usage across both calls usage = model.gather_usage() print(f"{usage.input_tokens} in, {usage.output_tokens} out") -print(f"Cost: ${usage.cost}") # None if no cost calculator configured +print(f"Cost: ${usage.cost}") # $0.0 if no cost calculator configured ``` ### Reading Agent Usage @@ -53,7 +53,7 @@ usage = adapter.gather_usage() print(f"{usage.input_tokens} in, {usage.output_tokens} out") ``` -This works across all supported frameworks — smolagents, CAMEL, LangGraph, and LlamaIndex. The adapter handles the framework-specific extraction; you always call `gather_usage()`. +This works across all supported frameworks (smolagents, CAMEL, LangGraph, and LlamaIndex). The adapter handles the framework-specific extraction; you always call `gather_usage()`. ### In Benchmarks @@ -79,7 +79,7 @@ Most LLM APIs return token counts but not cost. MASEval provides two built-in co ### Quick Start: LiteLLM Pricing -The easiest path — uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more: +The easiest path. Uses LiteLLM's [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) covering OpenAI, Anthropic, Google, Mistral, and many more: ```python from maseval.interface.usage import LiteLLMCostCalculator @@ -121,7 +121,7 @@ calculator = LiteLLMCostCalculator(custom_pricing={ ### Manual Pricing -When you know your rates, use `StaticPricingCalculator` — zero dependencies, fully explicit: +When you know your rates, use `StaticPricingCalculator`. Zero dependencies, fully explicit: ```python from maseval import StaticPricingCalculator @@ -144,7 +144,7 @@ calculator = StaticPricingCalculator({ }) ``` -The cost unit is whatever your pricing represents — USD, EUR, university credits: +The cost unit is whatever your pricing represents (USD, EUR, university credits): ```python calculator = StaticPricingCalculator({ @@ -154,7 +154,7 @@ calculator = StaticPricingCalculator({ ### Sharing a Calculator Across Models -A single calculator instance works for multiple model adapters — the `model_id` is passed on each cost computation: +A single calculator instance works for multiple model adapters. The `model_id` is passed on each cost computation: ```python calculator = StaticPricingCalculator({ @@ -170,13 +170,13 @@ model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cos When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order: -1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. -2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. -3. **None** — if neither source provides cost, `usage.cost` stays `None`. +1. **Provider-reported cost**: e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. +2. **CostCalculator**: if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. +3. **Zero**: if neither source provides cost, `usage.cost` stays `0.0`. ### Writing a Custom Calculator -Implement the `CostCalculator` protocol — a single method: +Implement the `CostCalculator` protocol (a single method): ```python from maseval import CostCalculator, TokenUsage @@ -218,6 +218,64 @@ for task_id, usage in reporter.by_task().items(): summary = reporter.summary() ``` +## How Usage Addition Works + +`Usage` records can be added together with `+` or `sum()`. Understanding how fields combine helps you interpret aggregated results. + +### Cost + +`cost` defaults to `0.0`. Addition is straightforward numeric addition: + +```python +from maseval import Usage + +a = Usage(cost=0.05) +b = Usage(cost=0.03) +a + b # cost=0.08 + +# Components without cost tracking default to 0.0, so they don't affect the total +agent_usage = Usage() # cost=0.0 (default) +model_usage = Usage(cost=0.12) +agent_usage + model_usage # cost=0.12 + +# sum() works with Usage() as the starting value +records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)] +sum(records, Usage()) # cost=0.35 +``` + +### Units + +`units` dicts are merged by key. Matching keys are summed, new keys are added: + +```python +a = Usage(units={"api_calls": 3, "data_points": 100}) +b = Usage(units={"api_calls": 2, "images": 5}) +total = a + b +# total.units == {"api_calls": 5, "data_points": 100, "images": 5} +``` + +### Grouping Fields + +`provider`, `category`, `component_name`, and `kind` track where a record came from. When two records are added: + +- **Same value** → preserved +- **Different values** → becomes `None` (meaning "aggregated across multiple") + +```python +a = Usage(cost=0.05, provider="openai", kind="llm") +b = Usage(cost=0.03, provider="openai", kind="llm") +total = a + b +# total.provider == "openai" (both match) +# total.kind == "llm" (both match) + +c = Usage(cost=0.10, provider="anthropic", kind="llm") +mixed = a + c +# mixed.provider is None (openai ≠ anthropic → aggregated over) +# mixed.kind == "llm" (both match) +``` + +This lets you tell at a glance whether a summed record came from one source or many. + ## Tracking Non-LLM Resources Tools, environments, and other components can track arbitrary usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`. Here's an example for a paid API: @@ -246,7 +304,7 @@ class BloombergEnvironment(Environment, UsageTrackableMixin): return sum(self._usage_records, Usage()) ``` -Non-LLM components set cost directly — there is no calculator involvement. Each component knows its own billing model. +Non-LLM components set cost directly. There is no calculator involvement; each component knows its own billing model. ## Evaluator Usage diff --git a/maseval/core/model.py b/maseval/core/model.py index e8bb0a3f..86399d7f 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -308,12 +308,15 @@ def chat( # Record token usage if available if result.usage: - cost = result.usage.get("cost") if isinstance(result.usage.get("cost"), (int, float)) else None + raw_cost = result.usage.get("cost") + cost = raw_cost if isinstance(raw_cost, (int, float)) else 0.0 token_usage = TokenUsage.from_chat_response_usage(result.usage, cost=cost, kind="llm") # If no provider-reported cost, try the cost calculator - if token_usage.cost is None and self._cost_calculator is not None: - token_usage.cost = self._cost_calculator.calculate_cost(token_usage, self.model_id) + if token_usage.cost == 0.0 and self._cost_calculator is not None: + calculated = self._cost_calculator.calculate_cost(token_usage, self.model_id) + if calculated is not None: + token_usage.cost = calculated self._usage_records.append(token_usage) @@ -398,10 +401,10 @@ def gather_usage(self) -> Usage: """Gather accumulated token usage from all chat calls. Returns: - Summed TokenUsage across all calls, or empty Usage if no calls were made. + Summed TokenUsage across all calls, or empty TokenUsage if no calls were made. """ if not self._usage_records: - return Usage() + return TokenUsage() result = self._usage_records[0] for record in self._usage_records[1:]: result = result + record diff --git a/maseval/core/registry.py b/maseval/core/registry.py index cf2193e8..d5e3813c 100644 --- a/maseval/core/registry.py +++ b/maseval/core/registry.py @@ -320,13 +320,8 @@ def collect_usage(self) -> Dict[str, Any]: usage[category][comp_name] = usage_dict # Accumulate into persistent aggregates (thread-safe). - # _usage_total starts as Usage(cost=None); adding to it would - # poison the cost (None + X = None). Assign directly on first use. with self._usage_lock: - if self._usage_total.cost is None and not self._usage_total.units: - self._usage_total = component_usage - else: - self._usage_total = self._usage_total + component_usage + self._usage_total = self._usage_total + component_usage if key in self._usage_by_component: self._usage_by_component[key] = self._usage_by_component[key] + component_usage else: diff --git a/maseval/core/usage.py b/maseval/core/usage.py index 2a53931c..651e4d6d 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -12,10 +12,13 @@ inherit `UsageTrackableMixin` have their usage automatically collected by the registry via `gather_usage()`. -Cost calculators are optional — if no calculator is provided to a -``ModelAdapter``, cost is only set when the provider reports it directly -(e.g., LiteLLM's ``response._hidden_params.response_cost``). For automatic -pricing via LiteLLM's bundled model database, see ``maseval.interface.usage``. +``Usage.cost`` defaults to ``0.0``, so ``Usage()`` works as a starting value +for accumulation (e.g., ``sum(records, Usage())``). Cost calculators are +optional — if no calculator is provided to a ``ModelAdapter``, cost stays +at ``0.0`` unless the provider reports it directly (e.g., LiteLLM's +``response._hidden_params.response_cost``). +For automatic pricing via LiteLLM's bundled model database, see +``maseval.interface.usage``. """ from __future__ import annotations @@ -29,13 +32,26 @@ class Usage: """Generic usage record for any billable resource. Represents accumulated cost and countable units for a component or - aggregated group. Grouping fields (`provider`, `category`, - `component_name`, `kind`) identify what scope the record covers. - When two records are summed, matching grouping fields are preserved; - mismatches become `None` (meaning "aggregated over"). + aggregated group. All fields default to zero, so ``Usage()`` can be + used as a starting value for accumulation with ``+`` and ``sum()``. + + Note: + ``cost`` defaults to ``0.0``. This means adding a ``Usage()`` + to another record never changes the cost: + ``Usage() + Usage(cost=0.05)`` gives ``cost=0.05``. + Components that track cost start at ``0.0`` and accumulate upward. + Components that *do not* track cost (e.g., agent adapters that only + count tokens) also default to ``0.0`` — their cost simply has no + effect when summed with components that do report cost. + + Grouping fields (``provider``, ``category``, ``component_name``, ``kind``) + identify what scope the record covers. When two records are summed, + matching grouping fields are preserved; mismatches become ``None`` + (meaning "aggregated over"). Attributes: - cost: Total cost in USD. `None` means unknown/not reported. + cost: Total cost in USD (or whatever unit your calculator uses). + Defaults to ``0.0``. units: Arbitrary countable units (e.g., ``{"api_calls": 3}``). provider: Provider identifier (e.g., ``"anthropic"``, ``"bloomberg"``). category: Registry category (e.g., ``"models"``, ``"tools"``). @@ -52,14 +68,21 @@ class Usage: assert total.units == {"api_calls": 3} assert total.provider == "bloomberg" - # Mismatched fields become None + # Usage() is the zero element + assert (usage + Usage()).cost == 0.05 + + # Accumulate with sum() + records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)] + assert sum(records, Usage()).cost == 0.35 + + # Mismatched grouping fields become None mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm") assert mixed.provider is None # aggregated over assert mixed.kind is None # aggregated over ``` """ - cost: Optional[float] = None + cost: float = 0.0 units: Dict[str, int | float] = field(default_factory=dict) provider: Optional[str] = None category: Optional[str] = None @@ -70,11 +93,7 @@ def __add__(self, other: Usage) -> Usage: if not isinstance(other, Usage): return NotImplemented - # Sum costs: both known -> sum, either unknown -> None - if self.cost is not None and other.cost is not None: - cost = self.cost + other.cost - else: - cost = None + cost = self.cost + other.cost # Sum units units: Dict[str, int | float] = dict(self.units) @@ -212,7 +231,7 @@ def from_chat_response_usage( cls, usage_dict: Dict[str, int], *, - cost: Optional[float] = None, + cost: float = 0.0, provider: Optional[str] = None, category: Optional[str] = None, component_name: Optional[str] = None, @@ -224,7 +243,7 @@ def from_chat_response_usage( Args: usage_dict: The usage dict from ``ChatResponse.usage``. - cost: Cost in USD if known (e.g., from provider-reported cost). + cost: Cost in USD (e.g., from provider-reported cost). Defaults to ``0.0``. provider: Provider identifier. category: Registry category. component_name: Component name. @@ -507,7 +526,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage: has_tokens = "input_tokens" in d if has_tokens: return TokenUsage( - cost=d.get("cost"), + cost=d.get("cost", 0.0), units=d.get("units", {}), provider=d.get("provider"), category=d.get("category"), @@ -522,7 +541,7 @@ def _usage_from_dict(d: Dict[str, Any]) -> Usage: audio_tokens=d.get("audio_tokens", 0), ) return Usage( - cost=d.get("cost"), + cost=d.get("cost", 0.0), units=d.get("units", {}), provider=d.get("provider"), category=d.get("category"), diff --git a/tests/test_core/test_usage.py b/tests/test_core/test_usage.py index 5348585b..383aa0f1 100644 --- a/tests/test_core/test_usage.py +++ b/tests/test_core/test_usage.py @@ -39,7 +39,7 @@ def test_from_chat_response_basic(self): assert tu.cache_creation_input_tokens == 0 assert tu.reasoning_tokens == 0 assert tu.audio_tokens == 0 - assert tu.cost is None + assert tu.cost == 0.0 def test_from_chat_response_all_fields(self): """All optional fields are mapped when present.""" @@ -148,18 +148,26 @@ def test_sum_multiple(self): assert total.output_tokens == 30 assert total.total_tokens == 90 - def test_none_cost_propagates(self): - """If either cost is None, sum cost is None.""" + def test_zero_cost_preserves_known(self): + """Adding a zero-cost usage preserves the known cost.""" a = TokenUsage(cost=0.10, input_tokens=100, output_tokens=50, total_tokens=150) - b = TokenUsage(cost=None, input_tokens=200, output_tokens=30, total_tokens=230) + b = TokenUsage(input_tokens=200, output_tokens=30, total_tokens=230) total = a + b - assert total.cost is None + assert total.cost == pytest.approx(0.10) # Token fields still sum correctly assert isinstance(total, TokenUsage) assert total.input_tokens == 300 assert total.output_tokens == 80 + def test_both_zero_cost_stays_zero(self): + """Summing two zero-cost usages gives zero cost.""" + a = TokenUsage(input_tokens=100, output_tokens=50, total_tokens=150) + b = TokenUsage(input_tokens=200, output_tokens=30, total_tokens=230) + total = a + b + + assert total.cost == 0.0 + def test_grouping_fields_match(self): """Matching grouping fields are preserved.""" a = TokenUsage(cost=0.10, provider="anthropic", kind="llm", input_tokens=100, output_tokens=50, total_tokens=150) @@ -546,7 +554,7 @@ def test_pipeline_no_calculator_no_provider_cost(self): assert isinstance(total, TokenUsage) assert total.input_tokens == 100 - assert total.cost is None + assert total.cost == 0.0 def test_pipeline_with_cached_tokens(self): """Pipeline correctly handles cached tokens in cost calculation. @@ -694,8 +702,7 @@ def test_empty_reports(self): reporter = UsageReporter.from_reports([]) total = reporter.total() - # Empty reports return a plain Usage with no cost - assert total.cost is None + assert total.cost == 0.0 assert isinstance(total, Usage) def test_skips_error_reports(self): @@ -708,7 +715,7 @@ def test_skips_error_reports(self): ] reporter = UsageReporter.from_reports(reports) total = reporter.total() - assert total.cost is None + assert total.cost == 0.0 assert isinstance(total, Usage) def test_by_task_accumulates_repeats(self): diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py index 86be5722..af5726f7 100644 --- a/tests/test_interface/test_agent_integration/test_camel_integration.py +++ b/tests/test_interface/test_agent_integration/test_camel_integration.py @@ -1127,7 +1127,7 @@ def test_camel_adapter_gather_usage_no_responses(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 def test_camel_adapter_gather_usage_responses_without_usage(): @@ -1144,7 +1144,7 @@ def test_camel_adapter_gather_usage_responses_without_usage(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 # ============================================================================= diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py index b39c4791..228d9758 100644 --- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py +++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py @@ -361,7 +361,7 @@ def agent_node(state: State) -> State: usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 def test_langgraph_adapter_gather_usage_before_run(): @@ -375,4 +375,4 @@ def test_langgraph_adapter_gather_usage_before_run(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py index 49998292..b738f7d3 100644 --- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py +++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py @@ -451,7 +451,7 @@ def test_llamaindex_adapter_gather_usage_no_logs(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 def test_llamaindex_adapter_gather_usage_logs_without_tokens(): @@ -476,7 +476,7 @@ def test_llamaindex_adapter_gather_usage_logs_without_tokens(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 # ============================================================================= diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py index cc605fa0..9b8eaba4 100644 --- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py +++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py @@ -525,7 +525,7 @@ def test_smolagents_adapter_gather_usage_no_steps(): usage = adapter.gather_usage() assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 assert usage.input_tokens == 0 if hasattr(usage, "input_tokens") else True @@ -557,7 +557,7 @@ def test_smolagents_adapter_gather_usage_steps_without_token_usage(): # Should return plain Usage (not TokenUsage) since no usage data assert isinstance(usage, Usage) - assert usage.cost is None + assert usage.cost == 0.0 # ============================================================================= From f1565b55078767dfc97223a03d04c883dc401c43 Mon Sep 17 00:00:00 2001 From: cemde Date: Sun, 15 Mar 2026 21:40:15 +0100 Subject: [PATCH 13/19] updated example --- .../five_a_day_benchmark.ipynb | 2 +- .../five_a_day_benchmark.py | 39 +++++++++++++++---- 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb index 2e4a6375..457c9a6b 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb @@ -660,7 +660,7 @@ { "cell_type": "code", "id": "amrylkbxkb7", - "source": "from maseval import UsageReporter\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\" Total cost: {f'${total.cost:.6f}' if total.cost is not None else 'N/A (no cost calculator)'}\")\nprint(f\" Total units: {dict(total.units) if total.units else '{}'}\")\nprint()\n\n# Per-component breakdown\nprint(\"Per-Component Breakdown\")\nprint(\"-\" * 60)\nfor component_key, usage in benchmark.usage_by_component.items():\n cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n units_str = dict(usage.units) if usage.units else \"\"\n print(f\" {component_key:<35} cost={cost_str} units={units_str}\")\nprint()\n\n# --- Post-hoc analysis with UsageReporter ---\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n cost_str = f\"${usage.cost:.6f}\" if usage.cost is not None else \"N/A\"\n print(f\" {task_id:<35} cost={cost_str}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))", + "source": "from collections import defaultdict\nfrom maseval import UsageReporter, TokenUsage\n\n\ndef _fmt_usage(usage):\n \"\"\"Format a Usage record for display.\"\"\"\n parts = [f\"cost=${usage.cost:.6f}\"]\n if isinstance(usage, TokenUsage):\n parts.append(f\"in={usage.input_tokens} out={usage.output_tokens}\")\n if usage.units:\n parts.append(f\"units={dict(usage.units)}\")\n return \" \".join(parts)\n\n\n# --- Live totals (available during and after execution) ---\nprint(\"Live Usage Totals\")\nprint(\"=\" * 60)\ntotal = benchmark.usage\nprint(f\" Total: {_fmt_usage(total)}\")\n\n# Group components by category\nby_category = defaultdict(dict)\nfor key, usage in benchmark.usage_by_component.items():\n category, name = key.split(\":\", 1)\n by_category[category][name] = usage\n\nfor category in [\"agents\", \"models\", \"tools\", \"simulators\", \"callbacks\"]:\n if category not in by_category:\n continue\n print(f\"\\n{category.capitalize()}:\")\n for name, usage in by_category[category].items():\n print(f\" {name:<35} {_fmt_usage(usage)}\")\n\n# Print any remaining categories not in the standard list\nfor category, components in by_category.items():\n if category in {\"agents\", \"models\", \"tools\", \"simulators\", \"callbacks\"}:\n continue\n print(f\"\\n{category.capitalize()}:\")\n for name, usage in components.items():\n print(f\" {name:<35} {_fmt_usage(usage)}\")\n\n# --- Post-hoc analysis with UsageReporter ---\nprint()\nreporter = UsageReporter.from_reports(results)\n\nprint(\"Per-Task Usage\")\nprint(\"-\" * 60)\nfor task_id, usage in reporter.by_task().items():\n print(f\" {task_id:<35} {_fmt_usage(usage)}\")\n\nprint()\nprint(\"Summary dict (for JSON export):\")\nprint(json.dumps(reporter.summary(), indent=2, default=str))", "metadata": {}, "execution_count": null, "outputs": [] diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py index a3972a9c..0b0e78de 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.py +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py @@ -961,24 +961,49 @@ def load_benchmark_data( results = benchmark.run(tasks=tasks, agent_data=agent_configs) # --- Usage summary --- + from collections import defaultdict + from maseval import TokenUsage + + def _fmt_usage(usage): + parts = [f"cost=${usage.cost:.6f}"] + if isinstance(usage, TokenUsage): + parts.append(f"in={usage.input_tokens} out={usage.output_tokens}") + if usage.units: + parts.append(f"units={dict(usage.units)}") + return " ".join(parts) + print("\n--- Usage Summary ---") total = benchmark.usage - cost_str = f"${total.cost:.6f}" if total.cost is not None else "N/A (no cost calculator)" - print(f"Total cost: {cost_str}") + print(f"Total: {_fmt_usage(total)}") + # Group components by category if benchmark.usage_by_component: - print("\nPer-component:") + by_category: dict[str, dict[str, object]] = defaultdict(dict) for key, usage in benchmark.usage_by_component.items(): - c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A" - print(f" {key:<35} cost={c} units={dict(usage.units) if usage.units else '{}'}") + category, name = key.split(":", 1) + by_category[category][name] = usage + + for category in ["agents", "models", "tools", "simulators", "callbacks"]: + if category not in by_category: + continue + print(f"\n{category.capitalize()}:") + for name, usage in by_category[category].items(): + print(f" {name:<35} {_fmt_usage(usage)}") + + # Print any remaining categories not in the standard list + for category, components in by_category.items(): + if category in {"agents", "models", "tools", "simulators", "callbacks"}: + continue + print(f"\n{category.capitalize()}:") + for name, usage in components.items(): + print(f" {name:<35} {_fmt_usage(usage)}") reporter = UsageReporter.from_reports(results) by_task = reporter.by_task() if by_task: print("\nPer-task:") for task_id, usage in by_task.items(): - c = f"${usage.cost:.6f}" if usage.cost is not None else "N/A" - print(f" {task_id:<35} cost={c}") + print(f" {task_id:<35} {_fmt_usage(usage)}") print("\n--- Benchmark Complete ---") print(f"Total tasks: {len(tasks)}") From 15687d9fdf3238a7c21df62272245a437769692f Mon Sep 17 00:00:00 2001 From: cemde Date: Sun, 15 Mar 2026 21:41:49 +0100 Subject: [PATCH 14/19] added dependency for restricted python --- pyproject.toml | 1 + uv.lock | 12 ++++++++++++ 2 files changed, 13 insertions(+) diff --git a/pyproject.toml b/pyproject.toml index 51227d46..a352a908 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -121,6 +121,7 @@ examples = [ "ipykernel>=6.0.0", "ipywidgets>=8.0.0", "accelerate>=1.11.0", + "RestrictedPython>=7.0", ] # Complete installation with absolutely everything (uses self-reference for DRY) diff --git a/uv.lock b/uv.lock index 689debe3..229b4bc8 100644 --- a/uv.lock +++ b/uv.lock @@ -3490,6 +3490,7 @@ all = [ { name = "python-dotenv" }, { name = "pyyaml" }, { name = "requests" }, + { name = "restrictedpython" }, { name = "ruamel-yaml" }, { name = "scikit-learn", version = "1.7.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, { name = "scikit-learn", version = "1.8.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, @@ -3567,6 +3568,7 @@ examples = [ { name = "meta-agents-research-environments" }, { name = "openai" }, { name = "python-dotenv" }, + { name = "restrictedpython" }, { name = "scikit-learn", version = "1.7.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, { name = "scikit-learn", version = "1.8.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, { name = "scipy", version = "1.15.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, @@ -3712,6 +3714,7 @@ requires-dist = [ { name = "python-dotenv", marker = "extra == 'examples'", specifier = ">=1.0.0" }, { name = "pyyaml", marker = "extra == 'multiagentbench'", specifier = ">=6.0" }, { name = "requests", marker = "extra == 'multiagentbench'", specifier = ">=2.28.0" }, + { name = "restrictedpython", marker = "extra == 'examples'", specifier = ">=7.0" }, { name = "rich", specifier = ">=14.1.0" }, { name = "ruamel-yaml", marker = "extra == 'multiagentbench'", specifier = ">=0.17.0" }, { name = "scikit-learn", marker = "extra == 'disco'", specifier = ">=1.7.2" }, @@ -6558,6 +6561,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/8e/67/afbb0978d5399bc9ea200f1d4489a23c9a1dad4eee6376242b8182389c79/respx-0.22.0-py2.py3-none-any.whl", hash = "sha256:631128d4c9aba15e56903fb5f66fb1eff412ce28dd387ca3a81339e52dbd3ad0", size = 25127, upload-time = "2024-12-19T22:33:57.837Z" }, ] +[[package]] +name = "restrictedpython" +version = "8.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/5f/1c/aec08bcb4ab14a1521579fbe21ceff2a634bb1f737f11cf7f9c8bb96e680/restrictedpython-8.1.tar.gz", hash = "sha256:4a69304aceacf6bee74bdf153c728221d4e3109b39acbfe00b3494927080d898", size = 838331, upload-time = "2025-10-19T14:11:32.531Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/1a/c0/3848f4006f7e164ee20833ca984067e4b3fc99fe7f1dfa88b4927e681299/restrictedpython-8.1-py3-none-any.whl", hash = "sha256:4769449c6cdb10f2071649ba386902befff0eff2a8fd6217989fa7b16aeae926", size = 27651, upload-time = "2025-10-19T14:11:30.201Z" }, +] + [[package]] name = "rfc3339-validator" version = "0.1.4" From 4b235d45ece5a1d2dc3fc01f8d5b34b711f3096d Mon Sep 17 00:00:00 2001 From: cemde Date: Mon, 16 Mar 2026 01:32:17 +0100 Subject: [PATCH 15/19] fixed cost tracking of agent frameworks --- CHANGELOG.md | 1 + docs/guides/usage-tracking.md | 47 +- .../five_a_day_benchmark.py | 2 +- maseval/core/agent.py | 99 +++- maseval/interface/agents/camel.py | 43 +- maseval/interface/agents/langgraph.py | 43 +- maseval/interface/agents/llamaindex.py | 45 +- maseval/interface/agents/smolagents.py | 45 +- .../test_camel_integration.py | 65 +++ .../test_langgraph_integration.py | 74 +++ .../test_llamaindex_integration.py | 47 ++ .../test_smolagents_integration.py | 111 ++++ usage_tracking/PLAN.md | 372 ------------- usage_tracking/api_usage_results.json | 523 ------------------ usage_tracking/api_usage_test.py | 154 ------ 15 files changed, 606 insertions(+), 1065 deletions(-) delete mode 100644 usage_tracking/PLAN.md delete mode 100644 usage_tracking/api_usage_results.json delete mode 100644 usage_tracking/api_usage_test.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 98a6f0d3..61e35d06 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45) - Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45) - `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45) +- Cost calculation for agent adapters. `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both the model ID and cost calculator are auto-detected (model ID from the framework's agent object, calculator via `LiteLLMCostCalculator` if litellm is installed). For LangGraph, `model_id` must be passed explicitly since graphs can contain multiple models. Explicit `cost_calculator` and `model_id` always override auto-detection. (PR: #45) - `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45) - Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45) - `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45) diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md index b59005c4..b2adc5c8 100644 --- a/docs/guides/usage-tracking.md +++ b/docs/guides/usage-tracking.md @@ -16,7 +16,7 @@ MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars) **Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed. -**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). +**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below). **Benchmarks** collect usage from all registered components after each task and include it in reports. @@ -55,6 +55,51 @@ print(f"{usage.input_tokens} in, {usage.output_tokens} out") This works across all supported frameworks (smolagents, CAMEL, LangGraph, and LlamaIndex). The adapter handles the framework-specific extraction; you always call `gather_usage()`. +### Agent Cost Tracking + +Agent adapters auto-detect cost when possible. For smolagents, CAMEL, and LlamaIndex, the adapter reads the model ID from the framework's agent object and uses `LiteLLMCostCalculator` if litellm is installed. No configuration needed: + +```python +# Cost tracking works automatically if litellm is installed +adapter = SmolAgentAdapter(agent, name="researcher") +adapter.run("What's the capital of France?") +print(f"Cost: ${adapter.gather_usage().cost:.4f}") +``` + +For **LangGraph**, the model ID cannot be auto-detected because a graph can contain multiple models across its nodes. Pass `model_id` explicitly: + +```python +adapter = LangGraphAgentAdapter( + compiled_graph, "agent", + model_id="gpt-4o-mini", # Required for cost tracking +) +``` + +To override auto-detection or use custom pricing, pass `cost_calculator` and/or `model_id`: + +```python +from maseval import StaticPricingCalculator + +calculator = StaticPricingCalculator({ + "my-model": {"input": 0.001, "output": 0.002}, +}) + +adapter = SmolAgentAdapter( + agent, name="researcher", + cost_calculator=calculator, + model_id="my-model", +) +``` + +| Framework | Model ID | Cost Calculator | +|-----------|----------|-----------------| +| smolagents | Auto (`agent.model.model_id`) | Auto (`LiteLLMCostCalculator`) | +| CAMEL | Auto (`agent.model_backend.model_type`) | Auto (`LiteLLMCostCalculator`) | +| LlamaIndex | Auto (`agent.llm.metadata.model_name`) | Auto (`LiteLLMCostCalculator`) | +| LangGraph | **Manual** (`model_id=...`) | Auto (`LiteLLMCostCalculator`) | + +If litellm is not installed, auto-creation of the calculator is skipped and cost stays at `0.0`. Tokens are always tracked regardless. + ### In Benchmarks Usage is collected automatically alongside traces and configs after each task. Each report includes a `"usage"` key: diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py index 0b0e78de..77bc00ee 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.py +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py @@ -967,7 +967,7 @@ def load_benchmark_data( def _fmt_usage(usage): parts = [f"cost=${usage.cost:.6f}"] if isinstance(usage, TokenUsage): - parts.append(f"in={usage.input_tokens} out={usage.output_tokens}") + parts.append(f"in={usage.input_tokens:>10,} out={usage.output_tokens:>10,}") if usage.units: parts.append(f"units={dict(usage.units)}") return " ".join(parts) diff --git a/maseval/core/agent.py b/maseval/core/agent.py index e481843c..19b68f0e 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -5,25 +5,56 @@ from .history import MessageHistory from .tracing import TraceableMixin from .config import ConfigurableMixin -from .usage import UsageTrackableMixin +from .usage import Usage, TokenUsage, UsageTrackableMixin, CostCalculator class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): """Wraps an agent from any framework to provide a standard interface. This Adapter provides: + - Unified execution interface via `run()` - Callback hooks for monitoring - Message history management via getter/setter - Framework-agnostic tracing + - Automatic cost calculation from token usage (when a cost calculator is available) + + Cost Tracking: + Agent adapters track token usage from the underlying framework. To also + compute cost, you can pass a ``cost_calculator`` and optionally a ``model_id``. + + Most framework adapters auto-detect both the model ID (from the framework's + agent object) and the cost calculator (using ``LiteLLMCostCalculator`` if + litellm is installed). This means cost tracking often works with zero + configuration. + + To override or disable auto-detection, pass explicit values:: + + adapter = SmolAgentAdapter( + agent, name="researcher", + cost_calculator=StaticPricingCalculator({...}), + model_id="my-custom-model", + ) + + Pass ``cost_calculator=None`` explicitly to disable cost calculation + even when auto-detection would otherwise enable it. """ - def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[AgentCallback]] = None): + def __init__( + self, + agent_instance: Any, + name: str, + callbacks: Optional[List[AgentCallback]] = None, + cost_calculator: Optional[CostCalculator] = None, + model_id: Optional[str] = None, + ): self.agent = agent_instance self.name = name self.callbacks = callbacks or [] self.messages: Optional[MessageHistory] = None self.logs: List[Dict[str, Any]] = [] + self._cost_calculator = cost_calculator + self._model_id = model_id def run(self, query: str) -> Any: """Executes the agent and returns the result.""" @@ -105,6 +136,70 @@ def get_messages(self) -> MessageHistory: """ return self.messages if self.messages is not None else MessageHistory() + def gather_usage(self) -> Usage: + """Gather usage with automatic cost calculation. + + Calls ``_gather_usage()`` for raw token counts, then applies + the cost calculator if one is available and cost is still ``0.0``. + + The ``model_id`` used for cost calculation is resolved in order: + + 1. Explicit ``model_id`` passed to ``__init__`` + 2. Auto-detected from the framework agent via ``_resolve_model_id()`` + + Subclasses should override ``_gather_usage()`` (not this method) + to provide framework-specific token extraction. + + Returns: + Usage (or TokenUsage) with cost filled in when possible. + """ + usage = self._gather_usage() + if isinstance(usage, TokenUsage) and usage.cost == 0.0: + calculator = self._resolve_cost_calculator() + if calculator is not None: + mid = self._model_id or self._resolve_model_id() + if mid: + cost = calculator.calculate_cost(usage, mid) + if cost is not None: + usage.cost = cost + return usage + + def _gather_usage(self) -> Usage: + """Gather raw token usage from the framework. + + Override this in subclasses to extract token counts from the + framework's native data structures. + + Returns: + Usage or TokenUsage with token counts (cost may be 0.0). + """ + return Usage() + + def _resolve_model_id(self) -> Optional[str]: + """Auto-detect the model ID from the framework agent. + + Override in subclasses to extract the model identifier from + the framework's agent object (e.g., ``self.agent.model.model_id`` + for smolagents). + + Returns: + Model ID string, or ``None`` if not detectable. + """ + return None + + def _resolve_cost_calculator(self) -> Optional[CostCalculator]: + """Resolve the cost calculator to use. + + Returns the explicit calculator if one was provided, otherwise + returns ``None``. Framework-specific subclasses can override this + to auto-create a calculator (e.g., ``LiteLLMCostCalculator``) + when the required dependencies are available. + + Returns: + A CostCalculator, or ``None`` if cost calculation is not available. + """ + return self._cost_calculator + def gather_traces(self) -> Dict[str, Any]: """Gather execution traces from this agent. diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py index b6ccebba..1c440687 100644 --- a/maseval/interface/agents/camel.py +++ b/maseval/interface/agents/camel.py @@ -175,7 +175,9 @@ class CamelAgentAdapter(AgentAdapter): camel-ai to be installed: `pip install maseval[camel]` """ - def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None): + def __init__( + self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, cost_calculator: Any = None, model_id: Optional[str] = None + ): """Initialize the CAMEL adapter. Note: We don't call super().__init__() to avoid initializing self.logs as a list, @@ -185,11 +187,19 @@ def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any] agent_instance: CAMEL ChatAgent instance name: Agent name for identification callbacks: Optional list of AgentCallback instances + cost_calculator: Optional cost calculator. If not provided, a + ``LiteLLMCostCalculator`` is created automatically when litellm + is available. + model_id: Optional model ID for cost calculation. If not provided, + auto-detected from ``agent.model_backend.model_type``. """ self.agent = agent_instance self.name = name self.callbacks = callbacks or [] self.messages = None + self._cost_calculator = cost_calculator + self._model_id = model_id + self._auto_calculator = None # Lazy-initialized # Store responses from each step() call self._responses: List[Any] = [] # Store errors that occur during execution (for comprehensive logging) @@ -419,7 +429,36 @@ def gather_traces(self) -> Dict[str, Any]: return base_traces - def gather_usage(self) -> Usage: + def _resolve_model_id(self): + """Auto-detect model ID from CAMEL agent. + + CAMEL's ChatAgent stores the model backend in ``model_backend`` + (or ``model`` for older versions). The backend has a ``model_type`` + enum whose ``.value`` is the model ID string. + """ + try: + backend = getattr(self.agent, "model_backend", None) or getattr(self.agent, "model", None) + if backend is not None and hasattr(backend, "model_type"): + model_type = backend.model_type + return model_type.value if hasattr(model_type, "value") else str(model_type) + except Exception: + pass + return None + + def _resolve_cost_calculator(self): + """Return the cost calculator, auto-creating one if litellm is available.""" + if self._cost_calculator is not None: + return self._cost_calculator + if self._auto_calculator is None: + try: + from maseval.interface.usage import LiteLLMCostCalculator + + self._auto_calculator = LiteLLMCostCalculator() + except (ImportError, Exception): + self._auto_calculator = False + return self._auto_calculator if self._auto_calculator is not False else None + + def _gather_usage(self) -> Usage: """Gather aggregated token usage across all CAMEL agent responses. Walks stored ``ChatAgentResponse`` objects and sums their diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py index 6749e549..a2e9fd06 100644 --- a/maseval/interface/agents/langgraph.py +++ b/maseval/interface/agents/langgraph.py @@ -117,19 +117,39 @@ def chatbot(state: MessagesState): langgraph to be installed: `pip install maseval[langgraph]` """ - def __init__(self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, config: Optional[Dict[str, Any]] = None): + def __init__( + self, + agent_instance: Any, + name: str, + callbacks: Optional[List[Any]] = None, + config: Optional[Dict[str, Any]] = None, + cost_calculator: Any = None, + model_id: Optional[str] = None, + ): """Initialize the LangGraph adapter. Args: agent_instance: Compiled LangGraph graph name: Agent name callbacks: Optional list of callbacks - config: Optional LangGraph config dict (for stateful graphs with checkpointer) - Should include 'configurable': {'thread_id': '...'} for persistent state + config: Optional LangGraph config dict (for stateful graphs with checkpointer). + Should include ``'configurable': {'thread_id': '...'}`` for persistent state. + cost_calculator: Optional cost calculator. If not provided, a + ``LiteLLMCostCalculator`` is created automatically when litellm + is available. + model_id: Model ID for cost calculation. LangGraph graphs can contain + multiple models across nodes, so the model ID cannot be auto-detected. + Pass the primary model's ID here to enable cost tracking:: + + LangGraphAgentAdapter( + graph, "agent", + model_id="gpt-4o-mini", + ) """ - super().__init__(agent_instance, name, callbacks) + super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id) self._langgraph_config = config self._last_result = None + self._auto_calculator = None # Lazy-initialized def get_messages(self) -> MessageHistory: """Get message history from LangGraph. @@ -214,7 +234,20 @@ def gather_config(self) -> dict[str, Any]: return base_config - def gather_usage(self) -> Usage: + def _resolve_cost_calculator(self): + """Return the cost calculator, auto-creating one if litellm is available.""" + if self._cost_calculator is not None: + return self._cost_calculator + if self._auto_calculator is None: + try: + from maseval.interface.usage import LiteLLMCostCalculator + + self._auto_calculator = LiteLLMCostCalculator() + except (ImportError, Exception): + self._auto_calculator = False + return self._auto_calculator if self._auto_calculator is not False else None + + def _gather_usage(self) -> Usage: """Gather aggregated token usage from LangGraph message metadata. Walks messages from the last graph execution (or persistent state) diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py index e0e063d4..5c1de402 100644 --- a/maseval/interface/agents/llamaindex.py +++ b/maseval/interface/agents/llamaindex.py @@ -118,6 +118,8 @@ def __init__( name: str, callbacks: Optional[List[Any]] = None, max_iterations: Optional[int] = None, + cost_calculator: Any = None, + model_id: Optional[str] = None, ): """Initialize the LlamaIndex adapter. @@ -131,11 +133,17 @@ def __init__( passing max_steps to it is silently swallowed by **kwargs. The actual iteration limit must be passed here so the adapter forwards it to AgentWorkflow.run(max_iterations=...). + cost_calculator: Optional cost calculator. If not provided, a + ``LiteLLMCostCalculator`` is created automatically when litellm + is available. + model_id: Optional model ID for cost calculation. If not provided, + auto-detected from ``agent.llm.metadata.model_name``. """ - super().__init__(agent_instance, name, callbacks) + super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id) self._last_result = None self._message_cache: List[Dict[str, Any]] = [] self._max_iterations = max_iterations + self._auto_calculator = None # Lazy-initialized def get_messages(self) -> MessageHistory: """Get message history from LlamaIndex. @@ -216,7 +224,40 @@ def gather_config(self) -> Dict[str, Any]: return base_config - def gather_usage(self) -> Usage: + def _resolve_model_id(self): + """Auto-detect model ID from LlamaIndex agent. + + LlamaIndex agents store their LLM in ``self.llm``, which has a + ``metadata`` property exposing ``LLMMetadata.model_name``. + """ + try: + return self.agent.llm.metadata.model_name + except AttributeError: + pass + # For AgentWorkflow, try the first agent's LLM + try: + if hasattr(self.agent, "agents"): + for agent in self.agent.agents: + if hasattr(agent, "llm"): + return agent.llm.metadata.model_name + except (AttributeError, StopIteration): + pass + return None + + def _resolve_cost_calculator(self): + """Return the cost calculator, auto-creating one if litellm is available.""" + if self._cost_calculator is not None: + return self._cost_calculator + if self._auto_calculator is None: + try: + from maseval.interface.usage import LiteLLMCostCalculator + + self._auto_calculator = LiteLLMCostCalculator() + except (ImportError, Exception): + self._auto_calculator = False + return self._auto_calculator if self._auto_calculator is not False else None + + def _gather_usage(self) -> Usage: """Gather aggregated token usage from LlamaIndex execution logs. Sums token counts recorded in ``self.logs`` during agent execution. diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py index 4fcfe1db..d0ae1905 100644 --- a/maseval/interface/agents/smolagents.py +++ b/maseval/interface/agents/smolagents.py @@ -4,7 +4,7 @@ pip install maseval[smolagents] """ -from typing import TYPE_CHECKING, Any, Dict, List +from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser from maseval.core.usage import TokenUsage, Usage @@ -102,16 +102,29 @@ class SmolAgentAdapter(AgentAdapter): smolagents to be installed: `pip install maseval[smolagents]` """ - def __init__(self, agent_instance, name: str, callbacks=None): + def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_calculator: Any = None, model_id: Optional[str] = None): """Initialize the Smolagent adapter. Note: We don't call super().__init__() to avoid initializing self.logs as a list, since we override it as a property that dynamically fetches from agent.memory. + + Args: + agent_instance: smolagents MultiStepAgent or similar + name: Agent name for identification + callbacks: Optional list of AgentCallback instances + cost_calculator: Optional cost calculator. If not provided, a + ``LiteLLMCostCalculator`` is created automatically when litellm + is available. + model_id: Optional model ID for cost calculation. If not provided, + auto-detected from ``agent.model.model_id``. """ self.agent = agent_instance self.name = name self.callbacks = callbacks or [] self.messages = None + self._cost_calculator = cost_calculator + self._model_id = model_id + self._auto_calculator = None # Lazy-initialized @property def logs(self) -> List[Dict[str, Any]]: # type: ignore[override] @@ -320,7 +333,33 @@ def gather_traces(self) -> dict: return base_logs - def gather_usage(self) -> Usage: + def _resolve_model_id(self): + """Auto-detect model ID from smolagents agent. + + All smolagents model classes (LiteLLMModel, OpenAIServerModel, + TransformersModel, etc.) inherit from ``Model`` which stores + ``model_id`` on the instance. + """ + try: + return self.agent.model.model_id + except AttributeError: + return None + + def _resolve_cost_calculator(self): + """Return the cost calculator, auto-creating one if litellm is available.""" + if self._cost_calculator is not None: + return self._cost_calculator + # Lazy auto-create: try LiteLLMCostCalculator once + if self._auto_calculator is None: + try: + from maseval.interface.usage import LiteLLMCostCalculator + + self._auto_calculator = LiteLLMCostCalculator() + except (ImportError, Exception): + self._auto_calculator = False # Sentinel: don't retry + return self._auto_calculator if self._auto_calculator is not False else None + + def _gather_usage(self) -> Usage: """Gather aggregated token usage across all agent steps. Walks smolagents' memory steps (ActionStep and PlanningStep) and sums diff --git a/tests/test_interface/test_agent_integration/test_camel_integration.py b/tests/test_interface/test_agent_integration/test_camel_integration.py index af5726f7..263c6b66 100644 --- a/tests/test_interface/test_agent_integration/test_camel_integration.py +++ b/tests/test_interface/test_agent_integration/test_camel_integration.py @@ -1228,6 +1228,71 @@ def test_e2e_camel_gather_usage_empty_before_run(): assert usage_after.output_tokens > 0 +# ============================================================================= +# Cost Calculation Tests +# ============================================================================= + + +def test_camel_adapter_cost_with_explicit_calculator(): + """Test that passing a cost_calculator computes cost from token usage.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator + from unittest.mock import Mock + + mock_agent = Mock() + mock_response = Mock() + mock_response.info = {"usage": {"prompt_tokens": 1000, "completion_tokens": 500}} + mock_response.terminated = False + mock_response.msgs = [Mock(content="response")] + + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test") + adapter._responses = [mock_response] + + calculator = StaticPricingCalculator({"gpt-4o-mini": {"input": 0.00001, "output": 0.00002}}) + adapter._cost_calculator = calculator + adapter._model_id = "gpt-4o-mini" + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.cost == pytest.approx(1000 * 0.00001 + 500 * 0.00002) + + +def test_camel_adapter_resolve_model_id(): + """Test that _resolve_model_id() reads from agent.model_backend.model_type.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from unittest.mock import Mock + + mock_agent = Mock() + mock_agent.model_backend.model_type.value = "gpt-4o-mini" + + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test") + assert adapter._resolve_model_id() == "gpt-4o-mini" + + +def test_camel_adapter_cost_auto_detect_model_id(): + """Test that cost calculation works with auto-detected model_id.""" + from maseval.interface.agents.camel import CamelAgentAdapter + from maseval.core.usage import StaticPricingCalculator + from unittest.mock import Mock + + mock_agent = Mock() + mock_agent.model_backend.model_type.value = "gpt-4o" + + mock_response = Mock() + mock_response.info = {"usage": {"prompt_tokens": 100, "completion_tokens": 50}} + mock_response.terminated = False + + adapter = CamelAgentAdapter(agent_instance=mock_agent, name="test") + adapter._responses = [mock_response] + + calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.001, "output": 0.002}}) + adapter._cost_calculator = calculator + # model_id not set — should be auto-detected + + usage = adapter.gather_usage() + assert usage.cost == pytest.approx(100 * 0.001 + 50 * 0.002) + + def test_e2e_camel_logs_contain_usage(): """Verify adapter.logs also contain usage data from real execution.""" from camel.agents import ChatAgent diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py index 228d9758..4a4d5e1b 100644 --- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py +++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py @@ -376,3 +376,77 @@ def test_langgraph_adapter_gather_usage_before_run(): assert isinstance(usage, Usage) assert usage.cost == 0.0 + + +# ============================================================================= +# Cost Calculation Tests +# ============================================================================= + + +def test_langgraph_adapter_cost_with_explicit_model_id(): + """Test that passing model_id + calculator computes cost for LangGraph.""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator + from langgraph.graph import StateGraph, END + from typing_extensions import TypedDict + from langchain_core.messages import AIMessage + from langchain_core.messages.ai import UsageMetadata + + class State(TypedDict): + messages: list + + def agent_node(state: State) -> State: + response = AIMessage( + content="Response", + usage_metadata=UsageMetadata(input_tokens=1000, output_tokens=500, total_tokens=1500), + ) + return {"messages": state["messages"] + [response]} + + graph = StateGraph(State) # type: ignore[arg-type] + graph.add_node("agent", agent_node) + graph.set_entry_point("agent") + graph.add_edge("agent", END) + compiled = graph.compile() + + calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.00001, "output": 0.00002}}) + adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test", model_id="gpt-4o", cost_calculator=calculator) + adapter.run("Test") + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.cost == pytest.approx(1000 * 0.00001 + 500 * 0.00002) + + +def test_langgraph_adapter_no_cost_without_model_id(): + """Test that LangGraph adapter cannot auto-detect model_id (by design).""" + from maseval.interface.agents.langgraph import LangGraphAgentAdapter + from maseval.core.usage import StaticPricingCalculator + from langgraph.graph import StateGraph, END + from typing_extensions import TypedDict + from langchain_core.messages import AIMessage + from langchain_core.messages.ai import UsageMetadata + + class State(TypedDict): + messages: list + + def agent_node(state: State) -> State: + response = AIMessage( + content="Response", + usage_metadata=UsageMetadata(input_tokens=100, output_tokens=50, total_tokens=150), + ) + return {"messages": state["messages"] + [response]} + + graph = StateGraph(State) # type: ignore[arg-type] + graph.add_node("agent", agent_node) + graph.set_entry_point("agent") + graph.add_edge("agent", END) + compiled = graph.compile() + + calculator = StaticPricingCalculator({"gpt-4o": {"input": 0.001, "output": 0.002}}) + # No model_id passed — cost should stay 0.0 despite calculator being available + adapter = LangGraphAgentAdapter(agent_instance=compiled, name="test", cost_calculator=calculator) + adapter.run("Test") + + usage = adapter.gather_usage() + assert usage.cost == 0.0 + assert usage.input_tokens == 100 diff --git a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py index b738f7d3..cc9c701b 100644 --- a/tests/test_interface/test_agent_integration/test_llamaindex_integration.py +++ b/tests/test_interface/test_agent_integration/test_llamaindex_integration.py @@ -600,3 +600,50 @@ def test_e2e_llamaindex_logs_populated_by_real_execution(): assert log["total_tokens"] == 150 assert "timestamp" in log assert "duration_seconds" in log + + +# ============================================================================= +# Cost Calculation Tests +# ============================================================================= + + +def test_llamaindex_adapter_cost_with_explicit_calculator(): + """Test that passing a cost_calculator computes cost from token usage.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator + from unittest.mock import Mock + + mock_agent = Mock(spec=[]) + adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test") + # Simulate logs from a run + adapter.logs = [{"input_tokens": 1000, "output_tokens": 500, "status": "success"}] + + calculator = StaticPricingCalculator({"gpt-4": {"input": 0.00003, "output": 0.00006}}) + adapter._cost_calculator = calculator + adapter._model_id = "gpt-4" + + usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) + assert usage.cost == pytest.approx(1000 * 0.00003 + 500 * 0.00006) + + +def test_llamaindex_adapter_resolve_model_id(): + """Test that _resolve_model_id() reads from agent.llm.metadata.model_name.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from unittest.mock import Mock + + mock_agent = Mock() + mock_agent.llm.metadata.model_name = "gpt-4o-mini" + + adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test") + assert adapter._resolve_model_id() == "gpt-4o-mini" + + +def test_llamaindex_adapter_resolve_model_id_missing(): + """Test that _resolve_model_id() returns None when agent has no llm.""" + from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter + from unittest.mock import Mock + + mock_agent = Mock(spec=[]) + adapter = LlamaIndexAgentAdapter(agent_instance=mock_agent, name="test") + assert adapter._resolve_model_id() is None diff --git a/tests/test_interface/test_agent_integration/test_smolagents_integration.py b/tests/test_interface/test_agent_integration/test_smolagents_integration.py index 9b8eaba4..562107ac 100644 --- a/tests/test_interface/test_agent_integration/test_smolagents_integration.py +++ b/tests/test_interface/test_agent_integration/test_smolagents_integration.py @@ -710,3 +710,114 @@ def test_e2e_smolagents_gather_usage_empty_before_run(): assert isinstance(usage_after, MasevalTokenUsage) assert usage_after.input_tokens > 0 assert usage_after.output_tokens > 0 + + +# --- Cost calculation tests --- + + +def test_smolagents_adapter_cost_with_explicit_calculator(): + """Test that passing a cost_calculator computes cost from token usage.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator + from smolagents.memory import ActionStep, AgentMemory + from smolagents.monitoring import TokenUsage, Timing + from unittest.mock import Mock + import time + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + mock_agent.model.model_id = "gpt-4o-mini" + + start = time.time() + step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[]) + step.token_usage = TokenUsage(input_tokens=1000, output_tokens=500) + mock_agent.memory.steps.append(step) + + calculator = StaticPricingCalculator({"gpt-4o-mini": {"input": 0.00001, "output": 0.00002}}) + + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test_agent", cost_calculator=calculator) + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.input_tokens == 1000 + assert usage.output_tokens == 500 + # Cost = 1000 * 0.00001 + 500 * 0.00002 = 0.01 + 0.01 = 0.02 + assert usage.cost == pytest.approx(0.02) + + +def test_smolagents_adapter_cost_with_explicit_model_id(): + """Test that explicit model_id overrides auto-detected one.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import StaticPricingCalculator + from smolagents.memory import ActionStep, AgentMemory + from smolagents.monitoring import TokenUsage, Timing + from unittest.mock import Mock + import time + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + mock_agent.model.model_id = "wrong-model" # Auto-detected, but overridden + + start = time.time() + step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[]) + step.token_usage = TokenUsage(input_tokens=100, output_tokens=50) + mock_agent.memory.steps.append(step) + + calculator = StaticPricingCalculator({"my-model": {"input": 0.001, "output": 0.002}}) + + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test", cost_calculator=calculator, model_id="my-model") + usage = adapter.gather_usage() + + # Should use "my-model" pricing, not "wrong-model" + assert usage.cost == pytest.approx(0.001 * 100 + 0.002 * 50) + + +def test_smolagents_adapter_resolve_model_id(): + """Test that _resolve_model_id() reads from agent.model.model_id.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from unittest.mock import Mock + + mock_agent = Mock() + mock_agent.model.model_id = "gpt-4o" + mock_agent.write_memory_to_messages = Mock(return_value=[]) + + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test") + assert adapter._resolve_model_id() == "gpt-4o" + + +def test_smolagents_adapter_resolve_model_id_missing(): + """Test that _resolve_model_id() returns None when model has no model_id.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from unittest.mock import Mock + + mock_agent = Mock(spec=[]) # No attributes at all + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test") + assert adapter._resolve_model_id() is None + + +def test_smolagents_adapter_no_cost_without_calculator(): + """Test that cost stays 0.0 when no calculator is available and auto-create fails.""" + from maseval.interface.agents.smolagents import SmolAgentAdapter + from maseval.core.usage import TokenUsage as MasevalTokenUsage + from smolagents.memory import ActionStep, AgentMemory + from smolagents.monitoring import TokenUsage, Timing + from unittest.mock import Mock, patch + import time + + mock_agent = Mock() + mock_agent.memory = AgentMemory(system_prompt="Test") + mock_agent.model.model_id = "some-model" + + start = time.time() + step = ActionStep(step_number=1, timing=Timing(start_time=start, end_time=start + 0.5), observations_images=[]) + step.token_usage = TokenUsage(input_tokens=100, output_tokens=50) + mock_agent.memory.steps.append(step) + + # Patch LiteLLMCostCalculator to simulate litellm not installed + with patch("maseval.interface.agents.smolagents.SmolAgentAdapter._resolve_cost_calculator", return_value=None): + adapter = SmolAgentAdapter(agent_instance=mock_agent, name="test") + usage = adapter.gather_usage() + + assert isinstance(usage, MasevalTokenUsage) + assert usage.cost == 0.0 + assert usage.input_tokens == 100 diff --git a/usage_tracking/PLAN.md b/usage_tracking/PLAN.md deleted file mode 100644 index 91683ee0..00000000 --- a/usage_tracking/PLAN.md +++ /dev/null @@ -1,372 +0,0 @@ -# Usage & Cost Tracking — Implementation Plan - -## Motivation - -Benchmarking multi-agent systems incurs real costs: LLM API calls (the primary driver), but also external service calls (e.g., Bloomberg data API, geocoding services, paid search APIs). MASEval currently extracts basic token counts into `ChatResponse.usage` but does not persist, enrich, or aggregate this data. We want first-class usage tracking that: - -- Captures token usage and cost per LLM call with provider-specific detail -- Supports non-token costs (external service calls billed per-request or per-unit) -- Aggregates across provider, task, component role, and total -- Is queryable live during benchmark execution (not just post-hoc) -- Captures usage even for failed tasks -- Requires zero changes from benchmark implementers for the common LLM case - -## Design Principles - -1. **LLM-first, not LLM-only.** The base abstraction is generic (cost + arbitrary units), with an LLM-specific subclass that adds token semantics. -2. **No hardcoded prices.** Pricing changes constantly. Users supply pricing or rely on provider-reported cost (e.g., OpenRouter). If neither is available, cost is `None`. -3. **Automatic for models, opt-in for tools.** ModelAdapter tracks usage automatically via the base `chat()` method. Tool/environment authors opt in via `UsageTrackableMixin`. -4. **Non-breaking.** `ChatResponse.usage` stays a `Dict[str, int]` with additional optional keys. Existing code that reads `usage["input_tokens"]` continues to work. -5. **First-class collection axis.** Usage is collected via `gather_usage()` / `collect_usage()`, parallel to `gather_traces()` / `collect_traces()` and `gather_config()` / `collect_configs()`. It is not embedded inside traces. -6. **Live queryable.** The registry maintains a running usage total across repetitions, queryable at any time via `benchmark.usage`. - ---- - -## Data Model - -### `Usage` (base) - -Generic usage record for any billable resource. Stored as a simple dataclass. - -``` -Usage - cost: Optional[float] # Total cost in USD (None = unknown) - units: Dict[str, int | float] # Countable units (e.g., {"api_calls": 3, "bytes": 1024}) - provider: Optional[str] # e.g., "anthropic", "openai", "bloomberg" - category: Optional[str] # e.g., "models", "evaluator_models", "tools" - component_name: Optional[str] # e.g., "main_model", "judge", "bloomberg_api" - kind: Optional[str] # e.g., "llm", "service", "local" -``` - -Supports `__add__`: costs sum (if both known, else None), units sum. Grouping fields (`provider`, `category`, `component_name`, `kind`) are preserved when they match, set to `None` on mismatch. `None` means "aggregated over" — e.g., `provider=None, category="models"` represents all models summed across providers. A fully `None` grouping is a grand total. - -### `TokenUsage(Usage)` (LLM-specific) - -Extends `Usage` with token fields that every LLM provider reports. - -``` -TokenUsage(Usage) - input_tokens: int - output_tokens: int - total_tokens: int - # Optional provider-specific detail - cached_input_tokens: int # Anthropic cache_read, OpenAI cached_tokens - reasoning_tokens: int # OpenAI reasoning, Google thoughts - audio_tokens: int # OpenAI audio -``` - -`TokenUsage.__add__` sums all token fields plus delegates to `Usage.__add__` for cost/units. - -Class method `TokenUsage.from_chat_response_usage(usage_dict) -> TokenUsage` maps the dict returned by adapters today into a `TokenUsage` instance, handling provider-specific key names. - ---- - -## UsageTrackableMixin - -Follows the established mixin pattern (`TraceableMixin`, `ConfigurableMixin`). Any component that inherits `UsageTrackableMixin` will have its usage automatically collected by the registry when registered. - -```python -class UsageTrackableMixin: - """Mixin that provides usage tracking capability to any component.""" - - def gather_usage(self) -> Usage: - """Return accumulated usage for this component. - - Subclasses must override this to return their accumulated Usage. - Base implementation returns an empty Usage. - """ - return Usage() -``` - -Components internally accumulate `Usage` records however they see fit (typically a list + sum). The mixin only defines the collection protocol — `gather_usage() -> Usage`. - -### Usage in components - -**ModelAdapter** (automatic): - -```python -class ModelAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): - def __init__(self, seed=None): - super().__init__() - self._usage_records: List[Usage] = [] - - def chat(self, messages, ...): - response = self._chat_impl(messages, ...) - if response.usage: - self._usage_records.append( - TokenUsage.from_chat_response_usage(response.usage) - ) - return response - - def gather_usage(self) -> Usage: - if not self._usage_records: - return Usage() - return sum(self._usage_records[1:], self._usage_records[0]) -``` - -**Non-model components** (opt-in): - -```python -class BloombergEnvironment(Environment, UsageTrackableMixin): - def __init__(self, task_data): - super().__init__(task_data) - self._usage_records: List[Usage] = [] - - def _call_bloomberg(self, query): - result = bloomberg_client.query(query) - self._usage_records.append(Usage( - cost=result.billed_amount, - units={"api_calls": 1, "data_points": result.count}, - )) - return result - - def gather_usage(self) -> Usage: - if not self._usage_records: - return Usage() - return sum(self._usage_records[1:], self._usage_records[0]) -``` - ---- - -## Registry Integration - -The `ComponentRegistry` gains a third collection axis for usage, parallel to traces and configs. - -### Per-repetition collection - -`collect_usage()` walks all registered `UsageTrackableMixin` components and calls `gather_usage()` on each. Returns a structured dict (same shape as `collect_traces()`/`collect_configs()`). This goes into `report["usage"]`. - -```python -def collect_usage(self) -> Dict[str, Any]: - """Collect usage from all registered UsageTrackableMixin components.""" - usage = { - "metadata": {...}, - "agents": {}, - "models": {}, - "tools": {}, - ... - "environment": None, - "user": None, - } - - for key, component in self._usage_registry.items(): - category, comp_name = key.split(":", 1) - component_usage = component.gather_usage() - - # Store in structured dict (same pattern as traces/configs) - ... - - # Accumulate into persistent aggregates - self._usage_total += component_usage - self._usage_by_component[key] += component_usage - - return usage -``` - -### Persistent aggregates (survive `clear()`) - -The registry maintains running totals that persist across task repetitions: - -```python -class ComponentRegistry: - def __init__(self): - # ... existing per-repetition state ... - - # Persistent usage aggregates (NOT cleared between repetitions) - self._usage_total: Usage = Usage() - self._usage_by_component: Dict[str, Usage] = {} - - def clear(self): - # Clears per-repetition registrations - # Does NOT clear _usage_total or _usage_by_component - - @property - def total_usage(self) -> Usage: - """Running total across all repetitions. Queryable at any time.""" - return self._usage_total - - @property - def usage_by_component(self) -> Dict[str, Usage]: - """Per-component running totals across all repetitions.""" - return dict(self._usage_by_component) -``` - -### Registration - -The `register()` method gains an `isinstance(component, UsageTrackableMixin)` check, parallel to the existing `TraceableMixin` and `ConfigurableMixin` checks: - -```python -def register(self, category, name, component): - # ... existing trace/config registration ... - - if isinstance(component, UsageTrackableMixin): - self._usage_registry[key] = component - self._usage_component_id_map[component_id] = key -``` - -`RegisterableComponent` type alias is updated to include `UsageTrackableMixin`. - ---- - -## Benchmark Integration - -### Report structure - -Each report gains a top-level `"usage"` key alongside `"traces"` and `"config"`: - -```python -report = { - "task_id": str(task.id), - "repeat_idx": repeat_idx, - "status": execution_status.value, - "traces": execution_traces, - "config": execution_configs, - "usage": execution_usage, # <-- new - "eval": eval_results, - "task": {...}, -} -``` - -### Live usage access - -```python -benchmark.usage # -> Usage (running grand total, delegates to registry) -benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals) -``` - -### Failed task usage - -`collect_usage()` is called alongside `collect_all_traces()` and `collect_all_configs()` — before error status is determined. If a task fails mid-execution, whatever usage was accumulated up to the failure point is still collected and aggregated. - ---- - -## Adapter `_chat_impl` Enrichment (per-provider) - -Each adapter enriches the `ChatResponse.usage` dict with provider-specific fields beyond the basic three. The base class `TokenUsage.from_chat_response_usage()` handles mapping. - -| Adapter | Extra fields to extract | -|---------|------------------------| -| OpenAI | `reasoning_tokens` from `completion_tokens_details`, `cached_input_tokens` from `prompt_tokens_details.cached_tokens` | -| Anthropic | `cached_input_tokens` from `cache_read_input_tokens` | -| Google | `reasoning_tokens` from `thoughts_token_count` | -| LiteLLM | `reasoning_tokens` + `cached_input_tokens` from details; `cost` from `response._hidden_params` if available | -| HuggingFace | No change (local inference, no API cost) | - ---- - -## UsageReporter (post-hoc) - -Post-run utility that walks `report["usage"]` across all reports for sliced analysis. - -``` -UsageReporter - @staticmethod from_reports(reports: List[Dict]) -> UsageReporter - - by_task() -> Dict[str, Usage] # keyed by task_id - by_component() -> Dict[str, Usage] # keyed by registry key (e.g., "models:main_model") - by_model() -> Dict[str, TokenUsage] # keyed by model_id (LLM-only) - total() -> Usage # grand total - - summary() -> Dict[str, Any] # nested dict with all breakdowns -``` - -Unlike the registry's live aggregates, `UsageReporter` can slice by task (since it sees the full report list with task IDs). - ---- - -## Evaluators - -Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. That model should be registered in the benchmark via `self.register("evaluator_models", "judge", model)` inside `setup_evaluators()`. Since `ModelAdapter` now inherits `UsageTrackableMixin`, its usage is automatically collected under `usage.evaluator_models.judge`. - -No changes to the `Evaluator` base class. This is a registration convention. - -## LLMUser / AgenticLLMUser - -These already hold a `ModelAdapter`. Their model's usage is collected automatically (since `ModelAdapter` inherits `UsageTrackableMixin` and `chat()` accumulates records). The model is already registered by the benchmark. No changes needed. - ---- - -## File Plan - -| File | Action | Content | -|------|--------|---------| -| `maseval/core/usage.py` | **Create** | `Usage`, `TokenUsage`, `UsageTrackableMixin` | -| `maseval/core/cost.py` | **Create** | `CostCalculator` protocol, `StaticPricingCalculator` | -| `maseval/core/registry.py` | **Edit** | Add `_usage_registry`, `_usage_total`, `_usage_by_component`, `collect_usage()`, `total_usage` property | -| `maseval/core/model.py` | **Edit** | Add `UsageTrackableMixin` to `ModelAdapter`, accumulate `TokenUsage` in `chat()`, implement `gather_usage()`, accept `cost_calculator` param | -| `maseval/core/benchmark.py` | **Edit** | Add `collect_all_usage()`, `usage` property, include `"usage"` in report dict | -| `maseval/core/reporting.py` | **Create** | `UsageReporter` post-hoc analysis utility | -| `maseval/interface/cost.py` | **Create** | `LiteLLMCostCalculator` (optional `litellm` dependency) | -| `maseval/interface/inference/openai.py` | **Edit** | Enrich `ChatResponse.usage` with `reasoning_tokens`, `cached_input_tokens`; accept `cost_calculator` | -| `maseval/interface/inference/anthropic.py` | **Edit** | Enrich with `cached_input_tokens`; accept `cost_calculator` | -| `maseval/interface/inference/google_genai.py` | **Edit** | Enrich with `reasoning_tokens`; accept `cost_calculator` | -| `maseval/interface/inference/litellm.py` | **Edit** | Enrich with detail tokens + provider-reported `cost`; accept `cost_calculator` | -| `maseval/interface/inference/huggingface.py` | **Edit** | Accept `cost_calculator` | -| `maseval/__init__.py` | **Edit** | Export `Usage`, `TokenUsage`, `UsageTrackableMixin`, `CostCalculator`, `StaticPricingCalculator`, `UsageReporter` | -| `tests/test_usage.py` | **Create** | Unit tests for data model, mixin, registry collection, aggregation, cost calculators | - -No changes to: `evaluator.py`, `user.py`, `agent.py`, `environment.py`, `callback.py`, `tracing.py`, `config.py`. - ---- - -## Cost Calculation - -Most LLM APIs return token counts but **not** cost. Cost calculation is a client-side concern. - -### CostCalculator protocol - -A `CostCalculator` is a simple protocol with one method: - -```python -class CostCalculator(Protocol): - def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: ... -``` - -`ModelAdapter` accepts an optional `cost_calculator` parameter. After each `chat()` call, if the provider didn't report cost and a calculator is present, the calculator fills in `TokenUsage.cost`. Provider-reported cost always takes precedence. - -### Built-in implementations - -| Calculator | Location | Dependencies | Use case | -|-----------|----------|-------------|----------| -| `StaticPricingCalculator` | `maseval.core.cost` | None | User-supplied per-model rates. Supports custom units (USD, EUR, credits). | -| `LiteLLMCostCalculator` | `maseval.interface.cost` | `litellm` | Automatic pricing via LiteLLM's bundled model database. Covers OpenAI, Anthropic, Google, Mistral, etc. | - -### Cost flow (priority order) - -1. **Provider-reported cost** — e.g., LiteLLM's `response._hidden_params.response_cost`. Set directly in `ChatResponse.usage["cost"]`. -2. **CostCalculator** — if no provider cost, `ModelAdapter.chat()` calls `calculator.calculate_cost(token_usage, model_id)`. -3. **None** — if neither source provides cost, `Usage.cost` stays `None`. - -### Examples - -```python -# Static pricing for a university cluster (credits per token) -calculator = StaticPricingCalculator({ - "llama-3-70b": {"input": 0.5, "output": 1.0}, -}) - -# Automatic pricing via LiteLLM's database -from maseval.interface.cost import LiteLLMCostCalculator -calculator = LiteLLMCostCalculator() - -# Pass to any model adapter -model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) -``` - -### Non-LLM components - -Non-LLM components (tools, environments) set cost directly in their `gather_usage()` implementation — there is no calculator involvement. Each component knows its own billing model. - ---- - -## Non-goals - -- **Hardcoded pricing tables** — prices change too often; delegated to LiteLLM or user-supplied. -- **Agent-internal model tracking** — models inside agent frameworks (AutoGen, LangGraph internals) are out of scope for now. -- **Billing integration** — no webhook/billing system integration. -- **Streaming usage** — not supported yet (usage is captured after completion). -- **Currency conversion** — `Usage.cost` is a bare float in whatever unit the calculator uses. Mixing units in one benchmark is a user error. - -## Open Questions - -1. **HuggingFace local inference**: Should we track compute-time as a "cost" proxy for local models? Probably not in v1. diff --git a/usage_tracking/api_usage_results.json b/usage_tracking/api_usage_results.json deleted file mode 100644 index 4dcd9b8e..00000000 --- a/usage_tracking/api_usage_results.json +++ /dev/null @@ -1,523 +0,0 @@ -{ - "direct__openai__gpt5_mini": { - "id": "chatcmpl-DFJoysUJJeWtuOVc5UIq3EnA580Ok", - "choices": [ - { - "finish_reason": "length", - "index": 0, - "logprobs": null, - "message": { - "content": "", - "refusal": null, - "role": "assistant", - "annotations": [], - "audio": null, - "function_call": null, - "tool_calls": null - } - } - ], - "created": 1772543484, - "model": "gpt-5-mini-2025-08-07", - "object": "chat.completion", - "service_tier": "default", - "system_fingerprint": null, - "usage": { - "completion_tokens": 64, - "prompt_tokens": 10, - "total_tokens": 74, - "completion_tokens_details": { - "accepted_prediction_tokens": 0, - "audio_tokens": 0, - "reasoning_tokens": 64, - "rejected_prediction_tokens": 0 - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0 - } - } - }, - "direct__anthropic__claude_haiku": { - "id": "msg_01UDvWsS78tyf4xQ1wwDsNop", - "content": [ - { - "citations": null, - "text": "# Hello! \ud83d\udc4b\n\nWelcome! I'm Claude, an AI assistant made by Anthropic. How can I help you today?", - "type": "text" - } - ], - "model": "claude-haiku-4-5-20251001", - "role": "assistant", - "stop_reason": "end_turn", - "stop_sequence": null, - "type": "message", - "usage": { - "cache_creation": { - "ephemeral_1h_input_tokens": 0, - "ephemeral_5m_input_tokens": 0 - }, - "cache_creation_input_tokens": 0, - "cache_read_input_tokens": 0, - "input_tokens": 11, - "output_tokens": 32, - "server_tool_use": null, - "service_tier": "standard", - "inference_geo": "not_available" - } - }, - "direct__google__gemini3_flash": { - "sdk_http_response": { - "headers": { - "content-type": "application/json; charset=UTF-8", - "vary": "Origin, X-Origin, Referer", - "content-encoding": "gzip", - "date": "Tue, 03 Mar 2026 13:11:32 GMT", - "server": "scaffolding on HTTPServer2", - "x-xss-protection": "0", - "x-frame-options": "SAMEORIGIN", - "x-content-type-options": "nosniff", - "server-timing": "gfet4t7; dur=4579", - "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000", - "transfer-encoding": "chunked" - }, - "body": null - }, - "candidates": [ - { - "content": { - "parts": [ - { - "media_resolution": null, - "code_execution_result": null, - "executable_code": null, - "file_data": null, - "function_call": null, - "function_response": null, - "inline_data": null, - "text": "Hello", - "thought": null, - "thought_signature": "EqwCCqkCAb4-9vtlfRlETMXR13Bw3xpBm-D3EzoUlVhmePvHy720UANX0hdyBGaq1d8FfiHVTMccuBl5r7sg3fy_2GoTexpytWLm15I7GfRloHt278ioOMDH3Ua8SVuGCIiRyIVSye1vkQw7p0KwZMzJ51fjhuBH-4_weZe24FglHg0p3eo79cKZIMz8eiWpcGtK6Xb25Gk1mXuKCi7GaifkKaOmhXTSjVZ-P-w5qERlTscMv-2YMD26Th8MUEg13PwFlz385A9RnHLH_oXkdr0lXAHemNj7dHdJEfNzjSgqCJdeVT3PCwH0v6-AIqdQuqD6jnvODLDPms5liN7VAVAAOZiq8tLDE771c3Xc-7eIPFdD3h9_cdvb82hjefYjEwC-aWNXQrl1SlVw0Un0", - "video_metadata": null - } - ], - "role": "model" - }, - "citation_metadata": null, - "finish_message": null, - "token_count": null, - "finish_reason": "MAX_TOKENS", - "avg_logprobs": null, - "grounding_metadata": null, - "index": 0, - "logprobs_result": null, - "safety_ratings": null, - "url_context_metadata": null - } - ], - "create_time": null, - "model_version": "gemini-3-flash-preview", - "prompt_feedback": null, - "response_id": "BN6maev1HoqB7M8P3fvBkAo", - "usage_metadata": { - "cache_tokens_details": null, - "cached_content_token_count": null, - "candidates_token_count": 1, - "candidates_tokens_details": null, - "prompt_token_count": 5, - "prompt_tokens_details": [ - { - "modality": "TEXT", - "token_count": 5 - } - ], - "thoughts_token_count": 59, - "tool_use_prompt_token_count": null, - "tool_use_prompt_tokens_details": null, - "total_token_count": 65, - "traffic_type": null - }, - "automatic_function_calling_history": [], - "parsed": null - }, - "litellm__openai__gpt5_mini": { - "id": "chatcmpl-DFJp7YfyGQH1HtypDlKDucycpUAdk", - "created": 1772543493, - "model": "gpt-5-mini-2025-08-07", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "length", - "index": 0, - "message": { - "content": "", - "role": "assistant", - "tool_calls": null, - "function_call": null, - "provider_specific_fields": { - "refusal": null - }, - "annotations": [] - }, - "provider_specific_fields": {} - } - ], - "usage": { - "completion_tokens": 64, - "prompt_tokens": 10, - "total_tokens": 74, - "completion_tokens_details": { - "accepted_prediction_tokens": 0, - "audio_tokens": 0, - "reasoning_tokens": 64, - "rejected_prediction_tokens": 0, - "text_tokens": null - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null - } - }, - "service_tier": "default" - }, - "litellm__anthropic__claude_haiku": { - "id": "chatcmpl-d251dec3-5b1a-424c-a432-cb71ea3d600f", - "created": 1772543495, - "model": "claude-haiku-4-5-20251001", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "stop", - "index": 0, - "message": { - "content": "Hello! \ud83d\udc4b How can I help you today?", - "role": "assistant", - "tool_calls": null, - "function_call": null, - "provider_specific_fields": { - "citations": null, - "thinking_blocks": null - } - } - } - ], - "usage": { - "completion_tokens": 16, - "prompt_tokens": 11, - "total_tokens": 27, - "completion_tokens_details": null, - "prompt_tokens_details": { - "audio_tokens": null, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null, - "cache_creation_tokens": 0, - "cache_creation_token_details": { - "ephemeral_5m_input_tokens": 0, - "ephemeral_1h_input_tokens": 0 - } - }, - "cache_creation_input_tokens": 0, - "cache_read_input_tokens": 0 - } - }, - "litellm__google__gemini3_flash": { - "id": "DN6macurC97hnsEPvs-FmA0", - "created": 1772543495, - "model": "gemini-3-flash-preview", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "length", - "index": 0, - "message": { - "content": "Hello", - "role": "assistant", - "tool_calls": null, - "function_call": null, - "images": [], - "thinking_blocks": [ - { - "type": "thinking", - "thinking": "{\"text\": \"Hello\"}", - "signature": "EpoCCpcCAb4+9vtdR++YPo/XeAmaLKPKkk7+YyeGjuHP9w646HEu9lG0xhb6qHOfkUTcH7xh08RlU6QXrTKAkXwfBAsSbiBfIBCGlzygFq+QGAS4LzUFaCLOD73MmSk7WiB393VWRw04NsxbhNtTH5aM9JFaxb7yvZMwWMckTON8L9Rv7gFlo6NmYjn01ct+kBKxleJzyD8d2AnAA4wMw9zqz8pLSAU9swKxmuqs0JkHt8WNRzwtw11xGt5zR909g/v/swLY/Oh+lcHiO7PMBsPHtBvzmPHTMM/ecn1VdA9sWqmoc8suFfzTaOPeegvtkhaytoZnaNZ/FoV9y9qVex5r8R0zvPd4ennA9/asI5P1i9HL0NedNJ78avW4" - } - ], - "provider_specific_fields": null - } - } - ], - "usage": { - "completion_tokens": 60, - "prompt_tokens": 5, - "total_tokens": 65, - "completion_tokens_details": { - "accepted_prediction_tokens": null, - "audio_tokens": null, - "reasoning_tokens": 59, - "rejected_prediction_tokens": null, - "text_tokens": 1 - }, - "prompt_tokens_details": { - "audio_tokens": null, - "cached_tokens": null, - "text_tokens": 5, - "image_tokens": null - } - }, - "vertex_ai_grounding_metadata": [], - "vertex_ai_url_context_metadata": [], - "vertex_ai_safety_results": [], - "vertex_ai_citation_metadata": [] - }, - "openrouter__openai__gpt5_mini": { - "id": "gen-1772543500-cToh8SauCW1u8pGlb4qQ", - "created": 1772543500, - "model": "openai/gpt-5-mini-2025-08-07", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "length", - "index": 0, - "message": { - "content": null, - "role": "assistant", - "tool_calls": null, - "function_call": null, - "provider_specific_fields": { - "refusal": null, - "reasoning": null, - "reasoning_details": [ - { - "type": "reasoning.summary", - "summary": "**Greet and respond warmly**\n\nThe user says, \"Hello, World!\" \u2014 it feels like a friendly greeting. I should definitely greet back warmly and ask how I can help them! It\u2019s a classic reference in programming, but they might just want a simple hello. I'm thinking of keeping my response concise. So, I\u2019ll reply with a friendly greeting and a question about what they\u2019d like to know or discuss. That seems like a nice approach!**Greet and engage**\n\nThe user is saying \"Hello, World!\" which feels like a classic greeting\u2014possibly a nod to programming culture. I think it\u2019s best to respond warmly, so I\u2019ll greet back with enthusiasm. Since they might just be saying hello, I\u2019ll keep it concise and friendly by asking how I can help. It\u2019s always nice to invite a conversation!", - "format": "openai-responses-v1", - "index": 0 - }, - { - "type": "reasoning.encrypted", - "data": "gAAAAABppt4SyBWDwFHFtAmspfjQOATVwEjInh21J297wGrovStyUcpp_QsIc3E18qz3EreGtoiQVrUwb4UnffV87xCLDfmoBtxDaSxzbYEJUgYNtjZ6hPr8peySEgtsGPypJmtVJJQ2In9BN-57EeNEqAifTsmKxtCPCm4KHRRAmiXI3zXpokxr8IldC8LYXFM4stVdsWBJxwBYWM6G_vV4VWgmJr15jHIk0tVhx2Rtoca5JQ-0MAf0mQQQbLHBAFnGKhNgBoi_Qnq06A87xSejoUkb7Lb8N6_1u9nFYyixciACYaJIqMeRU5timTRIivBsypP8GPgx-6HyCfqRGhi2nbd5HvTKw4vLTFbtDBR2lRUUsFfJXnbLZvBZO2jbWhYAPvQnnQjpbU5jXE6jPM8z-J4eyGvUg49u3n7fFqe-Nxph4Fuophnbz1-ZCdboejHXfbz9-zKcX-FaVhCkuT82gUWNBq09lLpmOjGERQr5EHguZbhGC1QKkSG59iXMTfRlPssV42xYDpWL1ci0Jbg96TAq8sEnlaY9AMtJTh0NH14Ou8rX3g-g7U2MDomJbcZtY8oNZtyY_3s7ENSatmmaCsX6eQsRuuhSOrxXZSz1l4Zyxes-TseYCQya0YPu3eCNA7-7qhYBWDbtxdBqaTyN9krqM9rkC_p4fQn3q4-2S-Wt9kElCX-SrdMR_qXYZPz8O4BsJwM1aA8gQQji5X8CnYFWTBkBLEQuv2MaR6dDuwvZUsWuCf41YJGw4GJHmdBdDbZflvgpVmuBRwk476MqDac6jXl2VlOgQ0v0zk4M6j5Hb29uCgUFDv0aTyf24wqAZdYRsKQOSLV2Wke38K1qLvUfn99yqkBllsFpdk0DsJJBG4axiK4Kr10BhhApNJokRqIjkT8HU7w5PDRLPryFoc6kuMIuS72RhOKXxZrDu7D_fuWHseOMyVrDULSYhf_GfZEIcnFwGBcIRhhQZG-lzSs_wssCojIGjRX0J67fOZk8YCCvjeabRCbbGTbHDXZxhRL_5Niwz0V4Jgd_97pOlIsVVOgS2-IuIc4445WjpkqGk6mRplBTZEPwZV2ny3v9w3aq-W6_lasXOOmv342RTXXo-pSKaZrowkI3rQUJ_fR5y7mumdDI82C-2onxbNfWI65PUgRW5KUXVgL4RXPu0yI0wu2z7LTyNaVoLSaF9wOtOzEtLux9Pf50EYjqlfD7niQoVR8Pv9D-1fhvrFDmeAzmgBdaqmCWhJWJgZUvtN43Wv2UNjk=", - "format": "openai-responses-v1", - "id": "rs_0f89d92952e937610169a6de0d3f28819085143727619d92cb", - "index": 1 - } - ] - } - }, - "provider_specific_fields": { - "native_finish_reason": "max_output_tokens" - } - } - ], - "usage": { - "completion_tokens": 64, - "prompt_tokens": 10, - "total_tokens": 74, - "completion_tokens_details": { - "accepted_prediction_tokens": null, - "audio_tokens": 0, - "reasoning_tokens": 64, - "rejected_prediction_tokens": null, - "text_tokens": null, - "image_tokens": 0 - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null, - "cache_write_tokens": 0, - "video_tokens": 0 - }, - "cost": 0.0001305, - "is_byok": false, - "cost_details": { - "upstream_inference_cost": 0.0001305, - "upstream_inference_prompt_cost": 2.5e-06, - "upstream_inference_completions_cost": 0.000128 - } - }, - "provider": "OpenAI" - }, - "openrouter__anthropic__claude_haiku": { - "id": "gen-1772543509-FAaKhTwazzoJmVDDd3ih", - "created": 1772543509, - "model": "anthropic/claude-4.5-haiku-20251001", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "stop", - "index": 0, - "message": { - "content": "Hello! \ud83d\udc4b How can I help you today?", - "role": "assistant", - "tool_calls": null, - "function_call": null, - "provider_specific_fields": { - "refusal": null, - "reasoning": null - } - }, - "provider_specific_fields": { - "native_finish_reason": "stop" - } - } - ], - "usage": { - "completion_tokens": 16, - "prompt_tokens": 11, - "total_tokens": 27, - "completion_tokens_details": { - "accepted_prediction_tokens": null, - "audio_tokens": 0, - "reasoning_tokens": 0, - "rejected_prediction_tokens": null, - "text_tokens": null, - "image_tokens": 0 - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null, - "cache_write_tokens": 0, - "video_tokens": 0 - }, - "cost": 9.1e-05, - "is_byok": false, - "cost_details": { - "upstream_inference_cost": 9.1e-05, - "upstream_inference_prompt_cost": 1.1e-05, - "upstream_inference_completions_cost": 8e-05 - } - }, - "provider": "Google" - }, - "openrouter__google__gemini3_flash": { - "id": "gen-1772543512-Mxn343CzRXITNLaWa3uw", - "created": 1772543512, - "model": "google/gemini-3-flash-preview-20251217", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "stop", - "index": 0, - "message": { - "content": "Hello, World! How can I help you today?", - "role": "assistant", - "tool_calls": null, - "function_call": null, - "provider_specific_fields": { - "refusal": null, - "reasoning": null, - "reasoning_details": [ - { - "type": "reasoning.encrypted", - "data": "CiEBjz1rX5mfLGj1Fml96xozj3K4fv7JeTBdOSaUUlxd96c=", - "format": "google-gemini-v1", - "index": 0 - } - ] - } - }, - "provider_specific_fields": { - "native_finish_reason": "STOP" - } - } - ], - "usage": { - "completion_tokens": 11, - "prompt_tokens": 4, - "total_tokens": 15, - "completion_tokens_details": { - "accepted_prediction_tokens": null, - "audio_tokens": 0, - "reasoning_tokens": 0, - "rejected_prediction_tokens": null, - "text_tokens": null, - "image_tokens": 0 - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null, - "cache_write_tokens": 0, - "video_tokens": 0 - }, - "cost": 3.5e-05, - "is_byok": false, - "cost_details": { - "upstream_inference_cost": 3.5e-05, - "upstream_inference_prompt_cost": 2e-06, - "upstream_inference_completions_cost": 3.3e-05 - } - }, - "provider": "Google" - }, - "openrouter__qwen__qwen3_30b": { - "id": "gen-1772543515-76qFgjV9ySYOE8mtplV6", - "created": 1772543515, - "model": "qwen/qwen3-30b-a3b-04-28", - "object": "chat.completion", - "system_fingerprint": null, - "choices": [ - { - "finish_reason": "length", - "index": 0, - "message": { - "content": null, - "role": "assistant", - "tool_calls": null, - "function_call": null, - "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,", - "provider_specific_fields": { - "refusal": null, - "reasoning": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like,", - "reasoning_content": "\nOkay, the user sent \"Hello, World!\" but didn't ask a specific question. I should respond politely and prompt them to ask for something. Let me make sure to keep it friendly and open-ended.\n\nI need to acknowledge their message and let them know I'm here to help. Maybe say something like," - } - }, - "provider_specific_fields": { - "native_finish_reason": "length" - } - } - ], - "usage": { - "completion_tokens": 64, - "prompt_tokens": 13, - "total_tokens": 77, - "completion_tokens_details": { - "accepted_prediction_tokens": null, - "audio_tokens": 0, - "reasoning_tokens": 75, - "rejected_prediction_tokens": null, - "text_tokens": null, - "image_tokens": 0 - }, - "prompt_tokens_details": { - "audio_tokens": 0, - "cached_tokens": 0, - "text_tokens": null, - "image_tokens": null, - "cache_write_tokens": 0, - "video_tokens": 0 - }, - "cost": 1.896e-05, - "is_byok": false, - "cost_details": { - "upstream_inference_cost": 1.896e-05, - "upstream_inference_prompt_cost": 1.04e-06, - "upstream_inference_completions_cost": 1.792e-05 - } - }, - "provider": "DeepInfra" - } -} \ No newline at end of file diff --git a/usage_tracking/api_usage_test.py b/usage_tracking/api_usage_test.py deleted file mode 100644 index 1c0a34b2..00000000 --- a/usage_tracking/api_usage_test.py +++ /dev/null @@ -1,154 +0,0 @@ -""" -Test script that calls GPT-5 mini, Claude Haiku 4.5, and Gemini 3 Flash in three -conditions each — (1) native client, (2) LiteLLM, (3) LiteLLM via OpenRouter — -plus Qwen 3 via LiteLLM+OpenRouter. Saves full response dicts to JSON for -usage/cost analysis. -""" - -import json -import os -import time -from pathlib import Path - -import anthropic -import litellm -import requests -from dotenv import load_dotenv -from google import genai -from google.genai import types -from openai import OpenAI - -load_dotenv() - -OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] -ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"] -GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"] -OPENROUTER_API_KEY = os.environ["OPENROUTER_API_KEY"] - -# LiteLLM reads OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY from env. -# For Gemini it expects GEMINI_API_KEY, so alias it. -os.environ.setdefault("GEMINI_API_KEY", GOOGLE_API_KEY) - -PROMPT = "Hello, World!" -MAX_TOKENS = 64 -TOTAL = 10 - -results = {} - - -def step(n: int, label: str): - print(f"{n}/{TOTAL} {label} ...") - - -# =========================================================================== # -# CONDITION 1 — Native SDKs (direct) -# =========================================================================== # - -# -- 1. GPT-5 mini (OpenAI) ------------------------------------------------ # -step(1, "GPT-5 mini — direct (OpenAI SDK)") -openai_client = OpenAI(api_key=OPENAI_API_KEY) -resp = openai_client.chat.completions.create( - model="gpt-5-mini", - messages=[{"role": "user", "content": PROMPT}], - max_completion_tokens=MAX_TOKENS, -) -results["direct__openai__gpt5_mini"] = resp.model_dump() -print(f" done — {resp.usage.total_tokens} tokens") - -# -- 2. Claude Haiku 4.5 (Anthropic) --------------------------------------- # -step(2, "Claude Haiku 4.5 — direct (Anthropic SDK)") -anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY) -resp = anthropic_client.messages.create( - model="claude-haiku-4-5-20251001", - max_tokens=MAX_TOKENS, - messages=[{"role": "user", "content": PROMPT}], -) -results["direct__anthropic__claude_haiku"] = resp.model_dump() -print(f" done — {resp.usage.input_tokens + resp.usage.output_tokens} tokens") - -# -- 3. Gemini 3 Flash (Google) -------------------------------------------- # -step(3, "Gemini 3 Flash — direct (Google GenAI SDK)") -google_client = genai.Client(api_key=GOOGLE_API_KEY) -resp = google_client.models.generate_content( - model="gemini-3-flash-preview", - contents=PROMPT, - config=types.GenerateContentConfig(max_output_tokens=MAX_TOKENS), -) -results["direct__google__gemini3_flash"] = resp.model_dump(mode="json") -total = resp.usage_metadata.total_token_count if resp.usage_metadata else "n/a" -print(f" done — {total} tokens") - - -# =========================================================================== # -# CONDITION 2 — LiteLLM (direct to providers) -# =========================================================================== # - -litellm_direct_models = { - "litellm__openai__gpt5_mini": "gpt-5-mini", - "litellm__anthropic__claude_haiku": "claude-haiku-4-5-20251001", - "litellm__google__gemini3_flash": "gemini/gemini-3-flash-preview", -} - -for i, (label, model) in enumerate(litellm_direct_models.items(), start=4): - step(i, f"{model} — LiteLLM (direct)") - resp = litellm.completion( - model=model, - messages=[{"role": "user", "content": PROMPT}], - max_tokens=MAX_TOKENS, - ) - results[label] = resp.model_dump() - usage_total = resp.usage.total_tokens if resp.usage else "n/a" - print(f" done — {usage_total} tokens") - - -# =========================================================================== # -# CONDITION 3 — LiteLLM via OpenRouter (+Qwen) -# =========================================================================== # - - -def fetch_openrouter_generation(gen_id: str) -> dict | None: - """Query OpenRouter's generation endpoint for cost metadata.""" - time.sleep(2) # brief wait for metadata to be available - r = requests.get( - f"https://openrouter.ai/api/v1/generation?id={gen_id}", - headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}, - ) - if r.status_code == 200: - return r.json() - return None - - -litellm_openrouter_models = { - "openrouter__openai__gpt5_mini": "openrouter/openai/gpt-5-mini", - "openrouter__anthropic__claude_haiku": "openrouter/anthropic/claude-haiku-4-5", - "openrouter__google__gemini3_flash": "openrouter/google/gemini-3-flash-preview", - "openrouter__qwen__qwen3_30b": "openrouter/qwen/qwen3-30b-a3b", -} - -for i, (label, model) in enumerate(litellm_openrouter_models.items(), start=7): - step(i, f"{model} — LiteLLM (OpenRouter)") - resp = litellm.completion( - model=model, - messages=[{"role": "user", "content": PROMPT}], - max_tokens=MAX_TOKENS, - ) - result = resp.model_dump() - - # Fetch OpenRouter generation metadata (cost, native tokens, etc.) - gen_meta = fetch_openrouter_generation(resp.id) - if gen_meta: - result["_openrouter_generation"] = gen_meta - - results[label] = result - usage_total = resp.usage.total_tokens if resp.usage else "n/a" - print(f" done — {usage_total} tokens") - - -# =========================================================================== # -# Save results -# =========================================================================== # -out_path = Path(__file__).parent / "api_usage_results.json" -with open(out_path, "w") as f: - json.dump(results, f, indent=2, default=str) - -print(f"\nResults saved to {out_path}") From aaf1662f75301fa973ca6ad7b4beb16f2ee1cc8b Mon Sep 17 00:00:00 2001 From: cemde Date: Mon, 16 Mar 2026 01:35:56 +0100 Subject: [PATCH 16/19] fixed type hinting issue --- .../test_agent_integration/test_langgraph_integration.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tests/test_interface/test_agent_integration/test_langgraph_integration.py b/tests/test_interface/test_agent_integration/test_langgraph_integration.py index 4a4d5e1b..1c2e9f82 100644 --- a/tests/test_interface/test_agent_integration/test_langgraph_integration.py +++ b/tests/test_interface/test_agent_integration/test_langgraph_integration.py @@ -420,7 +420,7 @@ def agent_node(state: State) -> State: def test_langgraph_adapter_no_cost_without_model_id(): """Test that LangGraph adapter cannot auto-detect model_id (by design).""" from maseval.interface.agents.langgraph import LangGraphAgentAdapter - from maseval.core.usage import StaticPricingCalculator + from maseval.core.usage import TokenUsage as MasevalTokenUsage, StaticPricingCalculator from langgraph.graph import StateGraph, END from typing_extensions import TypedDict from langchain_core.messages import AIMessage @@ -448,5 +448,6 @@ def agent_node(state: State) -> State: adapter.run("Test") usage = adapter.gather_usage() + assert isinstance(usage, MasevalTokenUsage) assert usage.cost == 0.0 assert usage.input_tokens == 100 From 3103ef86a8621fcae139a0798b78efce036fe7c8 Mon Sep 17 00:00:00 2001 From: cemde Date: Mon, 16 Mar 2026 19:37:29 +0100 Subject: [PATCH 17/19] [skip ci] fixed guide --- docs/guides/usage-tracking.md | 19 ++++++------------- maseval/core/usage.py | 3 +-- 2 files changed, 7 insertions(+), 15 deletions(-) diff --git a/docs/guides/usage-tracking.md b/docs/guides/usage-tracking.md index b2adc5c8..db29bb75 100644 --- a/docs/guides/usage-tracking.md +++ b/docs/guides/usage-tracking.md @@ -16,7 +16,7 @@ MASEval tracks how much each benchmark run consumes (tokens, API calls, dollars) **Model adapters** track every `chat()` call: input tokens, output tokens, cached tokens, reasoning tokens. No setup needed. -**Agent adapters** aggregate token usage from the underlying framework's execution. Each framework adapter (`SmolAgentAdapter`, `CamelAgentAdapter`, `LangGraphAgentAdapter`, `LlamaIndexAgentAdapter`) extracts usage from its framework's native data structures (memory steps, response metadata, message annotations, or execution logs respectively). Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below). +**Agent adapters** aggregate token usage from the underlying framework's execution. Cost is computed automatically when litellm is installed (see [Agent Cost Tracking](#agent-cost-tracking) below). **Benchmarks** collect usage from all registered components after each task and include it in reports. @@ -57,7 +57,7 @@ This works across all supported frameworks (smolagents, CAMEL, LangGraph, and Ll ### Agent Cost Tracking -Agent adapters auto-detect cost when possible. For smolagents, CAMEL, and LlamaIndex, the adapter reads the model ID from the framework's agent object and uses `LiteLLMCostCalculator` if litellm is installed. No configuration needed: +Agent adapters compute cost automatically when litellm is installed. The adapter detects the model ID from the framework's agent object and uses `LiteLLMCostCalculator` behind the scenes. No configuration needed: ```python # Cost tracking works automatically if litellm is installed @@ -66,16 +66,16 @@ adapter.run("What's the capital of France?") print(f"Cost: ${adapter.gather_usage().cost:.4f}") ``` -For **LangGraph**, the model ID cannot be auto-detected because a graph can contain multiple models across its nodes. Pass `model_id` explicitly: +If auto-detection doesn't work for your setup (e.g., the adapter can't find the model ID), pass `model_id` explicitly: ```python adapter = LangGraphAgentAdapter( compiled_graph, "agent", - model_id="gpt-4o-mini", # Required for cost tracking + model_id="gpt-4o-mini", ) ``` -To override auto-detection or use custom pricing, pass `cost_calculator` and/or `model_id`: +To use custom pricing instead, pass `cost_calculator` and/or `model_id`: ```python from maseval import StaticPricingCalculator @@ -91,13 +91,6 @@ adapter = SmolAgentAdapter( ) ``` -| Framework | Model ID | Cost Calculator | -|-----------|----------|-----------------| -| smolagents | Auto (`agent.model.model_id`) | Auto (`LiteLLMCostCalculator`) | -| CAMEL | Auto (`agent.model_backend.model_type`) | Auto (`LiteLLMCostCalculator`) | -| LlamaIndex | Auto (`agent.llm.metadata.model_name`) | Auto (`LiteLLMCostCalculator`) | -| LangGraph | **Manual** (`model_id=...`) | Auto (`LiteLLMCostCalculator`) | - If litellm is not installed, auto-creation of the calculator is skipped and cost stays at `0.0`. Tokens are always tracked regardless. ### In Benchmarks @@ -215,7 +208,7 @@ model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cos When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in priority order: -1. **Provider-reported cost**: e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. +1. **Provider-reported cost**: some providers (e.g., LiteLLM) include cost in the API response. This always wins. 2. **CostCalculator**: if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. 3. **Zero**: if neither source provides cost, `usage.cost` stays `0.0`. diff --git a/maseval/core/usage.py b/maseval/core/usage.py index 651e4d6d..81bf8bda 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -15,8 +15,7 @@ ``Usage.cost`` defaults to ``0.0``, so ``Usage()`` works as a starting value for accumulation (e.g., ``sum(records, Usage())``). Cost calculators are optional — if no calculator is provided to a ``ModelAdapter``, cost stays -at ``0.0`` unless the provider reports it directly (e.g., LiteLLM's -``response._hidden_params.response_cost``). +at ``0.0`` unless the provider reports it directly. For automatic pricing via LiteLLM's bundled model database, see ``maseval.interface.usage``. """ From acd73db990756a93e89aab9b69ccf35054720610 Mon Sep 17 00:00:00 2001 From: cemde Date: Tue, 17 Mar 2026 18:08:29 +0100 Subject: [PATCH 18/19] [skip ci] fix smaller issues --- AGENTS.md | 26 ++++++------- .../five_a_day_benchmark.py | 2 +- maseval/core/agent.py | 8 ++-- maseval/core/model.py | 4 +- maseval/core/usage.py | 7 +++- maseval/interface/agents/_cost.py | 38 +++++++++++++++++++ maseval/interface/agents/camel.py | 30 ++++++++------- maseval/interface/agents/langgraph.py | 23 +++++------ maseval/interface/agents/llamaindex.py | 25 ++++++------ maseval/interface/agents/smolagents.py | 33 ++++++++-------- pyproject.toml | 8 ++-- tests/test_core/test_registry.py | 37 ------------------ 12 files changed, 121 insertions(+), 120 deletions(-) create mode 100644 maseval/interface/agents/_cost.py diff --git a/AGENTS.md b/AGENTS.md index 2ba2a935..d16d775b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -265,12 +265,11 @@ mkdocs serve 1. Create a feature branch (never commit to `main`) 2. Make changes following code style guidelines -3. Run formatters and linters: `ruff format . && ruff check . --fix` -4. Run tests: `pytest -v` -5. Update documentation if needed -6. Open PR against `main` branch -7. Request review from `cemde` -8. Ensure all CI checks pass +3. Run `just all` before committing. This formats, lints, typechecks, and tests in one step. See the `justfile` for all available recipes. +4. Update documentation if needed +5. Open PR against `main` branch +6. Request review from `cemde` +7. Ensure all CI checks pass **CI Pipeline:** GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge. @@ -301,18 +300,15 @@ Example workflow: ## Common Tasks Quick Reference ```bash -# Fresh environment setup -uv sync --all-extras --all-groups +# Fresh environment setup / Update after pulling changes +just install # uv sync --all-extras --all-groups -# Before committing -uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check +# Before committing (format, lint, typecheck, test) +just all # Run example uv run python examples/amazon_collab.py -# Update after pulling changes -uv sync --all-extras --all-groups - # Add optional dependency uv add --optional @@ -320,6 +316,8 @@ uv add --optional uv run pytest tests/test_core/test_agent.py -v ``` +For more comments see `justfile`. + ## Security and Confidentiality **IMPORTANT:** This project contains confidential research material. @@ -540,4 +538,4 @@ class Evaluator: ... ``` -**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number. \ No newline at end of file +**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number. diff --git a/examples/five_a_day_benchmark/five_a_day_benchmark.py b/examples/five_a_day_benchmark/five_a_day_benchmark.py index 77bc00ee..23ddfc62 100644 --- a/examples/five_a_day_benchmark/five_a_day_benchmark.py +++ b/examples/five_a_day_benchmark/five_a_day_benchmark.py @@ -978,7 +978,7 @@ def _fmt_usage(usage): # Group components by category if benchmark.usage_by_component: - by_category: dict[str, dict[str, object]] = defaultdict(dict) + by_category: Dict[str, Dict[str, object]] = defaultdict(dict) for key, usage in benchmark.usage_by_component.items(): category, name = key.split(":", 1) by_category[category][name] = usage diff --git a/maseval/core/agent.py b/maseval/core/agent.py index 19b68f0e..39584700 100644 --- a/maseval/core/agent.py +++ b/maseval/core/agent.py @@ -1,4 +1,5 @@ from abc import ABC, abstractmethod +from dataclasses import replace from typing import List, Any, Optional, Dict from .callback import AgentCallback @@ -28,16 +29,13 @@ class AgentAdapter(ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin): litellm is installed). This means cost tracking often works with zero configuration. - To override or disable auto-detection, pass explicit values:: + To override auto-detection, pass explicit values:: adapter = SmolAgentAdapter( agent, name="researcher", cost_calculator=StaticPricingCalculator({...}), model_id="my-custom-model", ) - - Pass ``cost_calculator=None`` explicitly to disable cost calculation - even when auto-detection would otherwise enable it. """ def __init__( @@ -161,7 +159,7 @@ def gather_usage(self) -> Usage: if mid: cost = calculator.calculate_cost(usage, mid) if cost is not None: - usage.cost = cost + usage = replace(usage, cost=cost) return usage def _gather_usage(self) -> Usage: diff --git a/maseval/core/model.py b/maseval/core/model.py index 86399d7f..528f5ac2 100644 --- a/maseval/core/model.py +++ b/maseval/core/model.py @@ -48,7 +48,7 @@ from __future__ import annotations from abc import ABC, abstractmethod -from dataclasses import dataclass +from dataclasses import dataclass, replace from typing import Any, Optional, Dict, List, Union from datetime import datetime import time @@ -316,7 +316,7 @@ def chat( if token_usage.cost == 0.0 and self._cost_calculator is not None: calculated = self._cost_calculator.calculate_cost(token_usage, self.model_id) if calculated is not None: - token_usage.cost = calculated + token_usage = replace(token_usage, cost=calculated) self._usage_records.append(token_usage) diff --git a/maseval/core/usage.py b/maseval/core/usage.py index 81bf8bda..6e32eae1 100644 --- a/maseval/core/usage.py +++ b/maseval/core/usage.py @@ -92,6 +92,11 @@ def __add__(self, other: Usage) -> Usage: if not isinstance(other, Usage): return NotImplemented + # Delegate to TokenUsage.__add__ when the right operand is a + # TokenUsage but self is a plain Usage, so token fields are preserved. + if type(self) is Usage and isinstance(other, TokenUsage): + return TokenUsage.__add__(other, self) + cost = self.cost + other.cost # Sum units @@ -228,7 +233,7 @@ def to_dict(self) -> Dict[str, Any]: @classmethod def from_chat_response_usage( cls, - usage_dict: Dict[str, int], + usage_dict: Dict[str, Any], *, cost: float = 0.0, provider: Optional[str] = None, diff --git a/maseval/interface/agents/_cost.py b/maseval/interface/agents/_cost.py new file mode 100644 index 00000000..11f1e0f7 --- /dev/null +++ b/maseval/interface/agents/_cost.py @@ -0,0 +1,38 @@ +"""Shared cost-calculator auto-detection for agent adapters.""" + +from typing import Optional, Tuple + +from maseval.core.usage import CostCalculator + + +def resolve_auto_cost_calculator( + explicit: Optional[CostCalculator], + cached: Optional[CostCalculator], + attempted: bool, +) -> Tuple[Optional[CostCalculator], Optional[CostCalculator], bool]: + """Resolve the cost calculator, auto-creating one if litellm is available. + + Args: + explicit: The calculator passed explicitly by the user (may be ``None``). + cached: The cached auto-calculator from a previous call (``None`` if + not yet created or creation failed). + attempted: Whether auto-creation has been attempted before. + + Returns: + Tuple of ``(calculator_to_use, updated_cache, updated_attempted)``. + Callers should store the second and third elements back into + ``self._auto_calculator`` and ``self._auto_attempted``. + """ + if explicit is not None: + return explicit, cached, attempted + + if not attempted: + attempted = True + try: + from maseval.interface.usage import LiteLLMCostCalculator + + cached = LiteLLMCostCalculator() + except (ImportError, Exception): + cached = None + + return cached, cached, attempted diff --git a/maseval/interface/agents/camel.py b/maseval/interface/agents/camel.py index 1c440687..d67b65dd 100644 --- a/maseval/interface/agents/camel.py +++ b/maseval/interface/agents/camel.py @@ -19,7 +19,7 @@ from maseval import AgentAdapter, MessageHistory, LLMUser, User from maseval.core.tracing import TraceableMixin from maseval.core.config import ConfigurableMixin -from maseval.core.usage import TokenUsage, Usage +from maseval.core.usage import CostCalculator, TokenUsage, Usage __all__ = [ "CamelAgentAdapter", @@ -176,7 +176,12 @@ class CamelAgentAdapter(AgentAdapter): """ def __init__( - self, agent_instance: Any, name: str, callbacks: Optional[List[Any]] = None, cost_calculator: Any = None, model_id: Optional[str] = None + self, + agent_instance: Any, + name: str, + callbacks: Optional[List[Any]] = None, + cost_calculator: Optional[CostCalculator] = None, + model_id: Optional[str] = None, ): """Initialize the CAMEL adapter. @@ -199,7 +204,8 @@ def __init__( self.messages = None self._cost_calculator = cost_calculator self._model_id = model_id - self._auto_calculator = None # Lazy-initialized + self._auto_calculator: Optional[CostCalculator] = None + self._auto_attempted = False # Store responses from each step() call self._responses: List[Any] = [] # Store errors that occur during execution (for comprehensive logging) @@ -429,7 +435,7 @@ def gather_traces(self) -> Dict[str, Any]: return base_traces - def _resolve_model_id(self): + def _resolve_model_id(self) -> Optional[str]: """Auto-detect model ID from CAMEL agent. CAMEL's ChatAgent stores the model backend in ``model_backend`` @@ -445,18 +451,14 @@ def _resolve_model_id(self): pass return None - def _resolve_cost_calculator(self): + def _resolve_cost_calculator(self) -> Optional[CostCalculator]: """Return the cost calculator, auto-creating one if litellm is available.""" - if self._cost_calculator is not None: - return self._cost_calculator - if self._auto_calculator is None: - try: - from maseval.interface.usage import LiteLLMCostCalculator + from maseval.interface.agents._cost import resolve_auto_cost_calculator - self._auto_calculator = LiteLLMCostCalculator() - except (ImportError, Exception): - self._auto_calculator = False - return self._auto_calculator if self._auto_calculator is not False else None + calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator( + self._cost_calculator, self._auto_calculator, self._auto_attempted + ) + return calculator def _gather_usage(self) -> Usage: """Gather aggregated token usage across all CAMEL agent responses. diff --git a/maseval/interface/agents/langgraph.py b/maseval/interface/agents/langgraph.py index a2e9fd06..63afafe4 100644 --- a/maseval/interface/agents/langgraph.py +++ b/maseval/interface/agents/langgraph.py @@ -9,7 +9,7 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser -from maseval.core.usage import TokenUsage, Usage +from maseval.core.usage import CostCalculator, TokenUsage, Usage __all__ = ["LangGraphAgentAdapter", "LangGraphLLMUser"] @@ -123,7 +123,7 @@ def __init__( name: str, callbacks: Optional[List[Any]] = None, config: Optional[Dict[str, Any]] = None, - cost_calculator: Any = None, + cost_calculator: Optional[CostCalculator] = None, model_id: Optional[str] = None, ): """Initialize the LangGraph adapter. @@ -149,7 +149,8 @@ def __init__( super().__init__(agent_instance, name, callbacks, cost_calculator=cost_calculator, model_id=model_id) self._langgraph_config = config self._last_result = None - self._auto_calculator = None # Lazy-initialized + self._auto_calculator: Optional[CostCalculator] = None + self._auto_attempted = False def get_messages(self) -> MessageHistory: """Get message history from LangGraph. @@ -234,18 +235,14 @@ def gather_config(self) -> dict[str, Any]: return base_config - def _resolve_cost_calculator(self): + def _resolve_cost_calculator(self) -> Optional[CostCalculator]: """Return the cost calculator, auto-creating one if litellm is available.""" - if self._cost_calculator is not None: - return self._cost_calculator - if self._auto_calculator is None: - try: - from maseval.interface.usage import LiteLLMCostCalculator + from maseval.interface.agents._cost import resolve_auto_cost_calculator - self._auto_calculator = LiteLLMCostCalculator() - except (ImportError, Exception): - self._auto_calculator = False - return self._auto_calculator if self._auto_calculator is not False else None + calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator( + self._cost_calculator, self._auto_calculator, self._auto_attempted + ) + return calculator def _gather_usage(self) -> Usage: """Gather aggregated token usage from LangGraph message metadata. diff --git a/maseval/interface/agents/llamaindex.py b/maseval/interface/agents/llamaindex.py index 5c1de402..30ce4283 100644 --- a/maseval/interface/agents/llamaindex.py +++ b/maseval/interface/agents/llamaindex.py @@ -10,7 +10,7 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser -from maseval.core.usage import TokenUsage, Usage +from maseval.core.usage import CostCalculator, TokenUsage, Usage __all__ = ["LlamaIndexAgentAdapter", "LlamaIndexLLMUser"] @@ -118,7 +118,7 @@ def __init__( name: str, callbacks: Optional[List[Any]] = None, max_iterations: Optional[int] = None, - cost_calculator: Any = None, + cost_calculator: Optional[CostCalculator] = None, model_id: Optional[str] = None, ): """Initialize the LlamaIndex adapter. @@ -143,7 +143,8 @@ def __init__( self._last_result = None self._message_cache: List[Dict[str, Any]] = [] self._max_iterations = max_iterations - self._auto_calculator = None # Lazy-initialized + self._auto_calculator: Optional[CostCalculator] = None + self._auto_attempted = False def get_messages(self) -> MessageHistory: """Get message history from LlamaIndex. @@ -224,7 +225,7 @@ def gather_config(self) -> Dict[str, Any]: return base_config - def _resolve_model_id(self): + def _resolve_model_id(self) -> Optional[str]: """Auto-detect model ID from LlamaIndex agent. LlamaIndex agents store their LLM in ``self.llm``, which has a @@ -244,18 +245,14 @@ def _resolve_model_id(self): pass return None - def _resolve_cost_calculator(self): + def _resolve_cost_calculator(self) -> Optional[CostCalculator]: """Return the cost calculator, auto-creating one if litellm is available.""" - if self._cost_calculator is not None: - return self._cost_calculator - if self._auto_calculator is None: - try: - from maseval.interface.usage import LiteLLMCostCalculator + from maseval.interface.agents._cost import resolve_auto_cost_calculator - self._auto_calculator = LiteLLMCostCalculator() - except (ImportError, Exception): - self._auto_calculator = False - return self._auto_calculator if self._auto_calculator is not False else None + calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator( + self._cost_calculator, self._auto_calculator, self._auto_attempted + ) + return calculator def _gather_usage(self) -> Usage: """Gather aggregated token usage from LlamaIndex execution logs. diff --git a/maseval/interface/agents/smolagents.py b/maseval/interface/agents/smolagents.py index d0ae1905..25057208 100644 --- a/maseval/interface/agents/smolagents.py +++ b/maseval/interface/agents/smolagents.py @@ -7,7 +7,7 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional from maseval import AgentAdapter, MessageHistory, LLMUser -from maseval.core.usage import TokenUsage, Usage +from maseval.core.usage import CostCalculator, TokenUsage, Usage __all__ = ["SmolAgentAdapter", "SmolAgentLLMUser"] @@ -102,7 +102,14 @@ class SmolAgentAdapter(AgentAdapter): smolagents to be installed: `pip install maseval[smolagents]` """ - def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_calculator: Any = None, model_id: Optional[str] = None): + def __init__( + self, + agent_instance: Any, + name: str, + callbacks: Any = None, + cost_calculator: Optional[CostCalculator] = None, + model_id: Optional[str] = None, + ): """Initialize the Smolagent adapter. Note: We don't call super().__init__() to avoid initializing self.logs as a list, @@ -124,7 +131,8 @@ def __init__(self, agent_instance: Any, name: str, callbacks: Any = None, cost_c self.messages = None self._cost_calculator = cost_calculator self._model_id = model_id - self._auto_calculator = None # Lazy-initialized + self._auto_calculator: Optional[CostCalculator] = None + self._auto_attempted = False @property def logs(self) -> List[Dict[str, Any]]: # type: ignore[override] @@ -333,7 +341,7 @@ def gather_traces(self) -> dict: return base_logs - def _resolve_model_id(self): + def _resolve_model_id(self) -> Optional[str]: """Auto-detect model ID from smolagents agent. All smolagents model classes (LiteLLMModel, OpenAIServerModel, @@ -345,19 +353,14 @@ def _resolve_model_id(self): except AttributeError: return None - def _resolve_cost_calculator(self): + def _resolve_cost_calculator(self) -> Optional[CostCalculator]: """Return the cost calculator, auto-creating one if litellm is available.""" - if self._cost_calculator is not None: - return self._cost_calculator - # Lazy auto-create: try LiteLLMCostCalculator once - if self._auto_calculator is None: - try: - from maseval.interface.usage import LiteLLMCostCalculator + from maseval.interface.agents._cost import resolve_auto_cost_calculator - self._auto_calculator = LiteLLMCostCalculator() - except (ImportError, Exception): - self._auto_calculator = False # Sentinel: don't retry - return self._auto_calculator if self._auto_calculator is not False else None + calculator, self._auto_calculator, self._auto_attempted = resolve_auto_cost_calculator( + self._cost_calculator, self._auto_calculator, self._auto_attempted + ) + return calculator def _gather_usage(self) -> Usage: """Gather aggregated token usage across all agent steps. diff --git a/pyproject.toml b/pyproject.toml index a352a908..45805087 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -31,10 +31,10 @@ dependencies = [ # Enable optional dependencies for end users [project.optional-dependencies] # Agent frameworks -smolagents = ["smolagents>=1.21.3"] -langgraph = ["langgraph>=0.6.0"] -llamaindex = ["llama-index-core>=0.12.0"] -camel = ["camel-ai>=0.2.0"] +smolagents = ["smolagents>=1.21.3", "litellm>=1.0.0"] +langgraph = ["langgraph>=0.6.0", "litellm>=1.0.0"] +llamaindex = ["llama-index-core>=0.12.0", "litellm>=1.0.0"] +camel = ["camel-ai>=0.2.0", "litellm>=1.0.0"] # Inference engines anthropic = ["anthropic>=0.40.0"] diff --git a/tests/test_core/test_registry.py b/tests/test_core/test_registry.py index 17b30c5b..f3afd27f 100644 --- a/tests/test_core/test_registry.py +++ b/tests/test_core/test_registry.py @@ -262,43 +262,6 @@ def worker(worker_id: int): # ==================== Usage Tracking Tests ==================== -class MockUsageComponent(TraceableMixin): - """Component that implements UsageTrackableMixin for testing.""" - - def __init__(self, name: str, cost: float = 0.0, input_tokens: int = 0, output_tokens: int = 0): - super().__init__() - self._name = name - self._cost = cost - self._input_tokens = input_tokens - self._output_tokens = output_tokens - - def gather_traces(self) -> Dict[str, Any]: - return {"name": self._name} - - def gather_usage(self): - from maseval.core.usage import TokenUsage - - return TokenUsage( - cost=self._cost, - input_tokens=self._input_tokens, - output_tokens=self._output_tokens, - total_tokens=self._input_tokens + self._output_tokens, - ) - - -class MockBrokenUsageComponent(TraceableMixin): - """Component whose gather_usage raises an exception.""" - - def __init__(self): - super().__init__() - - def gather_traces(self) -> Dict[str, Any]: - return {} - - def gather_usage(self): - raise RuntimeError("Usage collection failed") - - class UsageAwareComponent(TraceableMixin, UsageTrackableMixin): """Component with both tracing and usage tracking.""" From d7a8b0b9507e6b7310b420f24f520876061be3fa Mon Sep 17 00:00:00 2001 From: cemde Date: Tue, 17 Mar 2026 18:20:04 +0100 Subject: [PATCH 19/19] fixed changelog --- CHANGELOG.md | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 61e35d06..53911dc0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,13 +11,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Core** -- Usage and cost tracking as a first-class collection axis alongside tracing and configuration. `Usage` and `TokenUsage` data classes record billable resource consumption (tokens, API calls, custom units). `UsageTrackableMixin` enables automatic collection via `gather_usage()`. `ModelAdapter` tracks token usage automatically after each `chat()` call with no changes required from benchmark implementers. (PR: #45) -- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates (supports USD, EUR, credits, or any unit). Pass a `cost_calculator` to any `ModelAdapter` to fill in `Usage.cost` when the provider doesn't report it. Provider-reported cost always takes precedence. (PR: #45) -- `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's bundled model database. Supports `custom_pricing` overrides and `model_id_map` for remapping adapter model IDs to LiteLLM's naming convention. Requires `litellm`. (PR: #45) -- Cost calculation for agent adapters. `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both the model ID and cost calculator are auto-detected (model ID from the framework's agent object, calculator via `LiteLLMCostCalculator` if litellm is installed). For LangGraph, `model_id` must be passed explicitly since graphs can contain multiple models. Explicit `cost_calculator` and `model_id` always override auto-detection. (PR: #45) -- `UsageReporter` post-hoc analysis utility for slicing usage data from benchmark reports by task, component, or model. Create via `UsageReporter.from_reports(benchmark.reports)`. (PR: #45) -- Live usage totals accessible during benchmark execution via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Totals persist across task repetitions. (PR: #45) -- `ComponentRegistry` gains usage collection: `collect_usage()`, `total_usage`, and `usage_by_component` properties, parallel to existing trace and config collection. (PR: #45) +- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45) +- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45) +- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45) - `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42) - `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42) @@ -55,8 +51,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 **Documentation** -- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) covering automatic LLM tracking, cost calculators, non-LLM usage, post-hoc analysis with `UsageReporter`, and the data model. (PR: #45) -- Usage & Cost reference page (`docs/reference/usage.md`) with API documentation for all usage and cost classes. (PR: #45) +- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45) **Core**