Proposal: Add built-in tool_parameter_match evaluator

## 🔴 Required Information

### Is your feature request related to a specific problem?

ADK does not currently expose a dedicated built-in metric for tool argument quality. Existing trajectory metrics tell you whether a tool was called, but not whether the agent supplied the right parameters. This hides important failure modes:

- correct tool name with incorrect required arguments
- partial argument correctness — values that are close but not exact
- repeated same-tool calls that require stable one-to-one alignment between expected and actual invocations before argument-level scoring is meaningful

For regression tracking during agent development, "the right tool was called" is necessary but not sufficient. Without argument-level scoring, agents that systematically miscall a correctly-chosen tool produce no signal in built-in metrics.

### Describe the Solution You'd Like

A new built-in evaluator named `tool_parameter_match` that scores the quality of tool-call arguments after deterministic alignment between expected and actual invocations.

- **Metric name:** `tool_parameter_match`
- **Criterion type:** `ToolParameterMatchCriterion(BaseCriterion)`
- **Match modes:** `name_only`, `name_and_args`, `name_and_required_args` (alignment behavior, mirroring trajectory scoring)
- **Argument strategies:** `exact`, `casefold_exact`, `numeric`, `contains`
- **Per-argument strategy override:** optional `per_arg_strategies` dict
- **Numeric tolerance:** optional `numeric_tolerance` for the `numeric` strategy
- Returns `NOT_EVALUATED` when reference invocations are unavailable
- Returns per-invocation `NOT_EVALUATED` when an invocation has zero expected tool calls

Per-call scoring (for each matched expected tool call):

- score the expected argument keys only
- average the per-argument match scores
- if the expected argument dict is empty, the tool-call score is `1.0`

For unmatched expected calls: score `0.0`.

Case score is the mean of evaluated per-invocation scores, excluding `NOT_EVALUATED` invocations.

### Impact on your work

I am building an evaluation harness for ADK-based agents and need a deterministic, regression-friendly metric for argument-level correctness. Without one, regressions where the agent picks the right tool but supplies wrong or partial arguments are invisible to built-in scoring, which slows down iterative improvement.

No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.

### Willingness to contribute

Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

- **Custom metric function via `custom_metrics`:** works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval-set configs the way built-in metrics can.
- **Embedding argument matching inside `tool_trajectory_in_order`:** would conflate trajectory shape with argument quality and break existing eval sets that rely on the trajectory metric's current binary semantics.

### Proposed API / Implementation

```python
# eval_metrics.py
ArgumentStrategy = Literal["exact", "casefold_exact", "numeric", "contains"]

class ToolParameterMatchCriterion(BaseCriterion):
    match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_and_required_args"
    default_strategy: ArgumentStrategy = "exact"
    per_arg_strategies: dict[str, ArgumentStrategy] | None = None
    numeric_tolerance: float = 0.0
    ordered: bool = True

# parameter_match_evaluator.py
class ToolParameterMatchEvaluator(Evaluator):
    criterion_type: ClassVar = ToolParameterMatchCriterion

    async def evaluate_invocations(
        self,
        actual_invocations,
        expected_invocations,
        conversation_scenario=None,
    ) -> EvaluationResult:
        ...
```

Registration in `_get_default_metric_evaluator_registry()`:

```python
MetricInfo(
    metric_name="tool_parameter_match",
    description="Argument-level match score for deterministically aligned tool calls.",
),
```

Usage in an eval set config:

```json
{
  "metric": "tool_parameter_match",
  "criterion": {
    "threshold": 0.8,
    "matchMode": "name_and_required_args",
    "defaultStrategy": "exact",
    "perArgStrategies": { "temperature": "numeric" },
    "numericTolerance": 0.5
  }
}
```

### Additional Context

Filed alongside #5306 (`tool_trajectory_f1`). The two metrics are independent but address the same workflow gap — partial-credit, deterministic tool-use scoring for regression tracking. They are designed to compose: trajectory scoring decides whether the right tools were called; parameter scoring decides whether each call's arguments were correct.

Issue #4794 proposes adding `ignore_args` to the existing trajectory evaluator. `tool_parameter_match` is complementary — it provides positive argument-quality scoring rather than argument filtering.

I'd appreciate early feedback on two API questions:

1. Is a separate `ToolParameterMatchCriterion` preferred, or should argument-quality scoring be folded into an existing argument-aware criterion?
2. Should call-alignment logic be shared with `tool_trajectory_in_order` / `tool_trajectory_f1` via a common helper, or recomputed inside each metric independently?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add built-in tool_parameter_match evaluator #5643

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Add built-in tool_parameter_match evaluator #5643

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions