Skip to content

Proposal: Add built-in tool_parameter_match evaluator #5643

@fabrizioamort

Description

@fabrizioamort

🔴 Required Information

Is your feature request related to a specific problem?

ADK does not currently expose a dedicated built-in metric for tool argument quality. Existing trajectory metrics tell you whether a tool was called, but not whether the agent supplied the right parameters. This hides important failure modes:

  • correct tool name with incorrect required arguments
  • partial argument correctness — values that are close but not exact
  • repeated same-tool calls that require stable one-to-one alignment between expected and actual invocations before argument-level scoring is meaningful

For regression tracking during agent development, "the right tool was called" is necessary but not sufficient. Without argument-level scoring, agents that systematically miscall a correctly-chosen tool produce no signal in built-in metrics.

Describe the Solution You'd Like

A new built-in evaluator named tool_parameter_match that scores the quality of tool-call arguments after deterministic alignment between expected and actual invocations.

  • Metric name: tool_parameter_match
  • Criterion type: ToolParameterMatchCriterion(BaseCriterion)
  • Match modes: name_only, name_and_args, name_and_required_args (alignment behavior, mirroring trajectory scoring)
  • Argument strategies: exact, casefold_exact, numeric, contains
  • Per-argument strategy override: optional per_arg_strategies dict
  • Numeric tolerance: optional numeric_tolerance for the numeric strategy
  • Returns NOT_EVALUATED when reference invocations are unavailable
  • Returns per-invocation NOT_EVALUATED when an invocation has zero expected tool calls

Per-call scoring (for each matched expected tool call):

  • score the expected argument keys only
  • average the per-argument match scores
  • if the expected argument dict is empty, the tool-call score is 1.0

For unmatched expected calls: score 0.0.

Case score is the mean of evaluated per-invocation scores, excluding NOT_EVALUATED invocations.

Impact on your work

I am building an evaluation harness for ADK-based agents and need a deterministic, regression-friendly metric for argument-level correctness. Without one, regressions where the agent picks the right tool but supplies wrong or partial arguments are invisible to built-in scoring, which slows down iterative improvement.

No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.

Willingness to contribute

Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).


🟡 Recommended Information

Describe Alternatives You've Considered

  • Custom metric function via custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval-set configs the way built-in metrics can.
  • Embedding argument matching inside tool_trajectory_in_order: would conflate trajectory shape with argument quality and break existing eval sets that rely on the trajectory metric's current binary semantics.

Proposed API / Implementation

# eval_metrics.py
ArgumentStrategy = Literal["exact", "casefold_exact", "numeric", "contains"]

class ToolParameterMatchCriterion(BaseCriterion):
    match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_and_required_args"
    default_strategy: ArgumentStrategy = "exact"
    per_arg_strategies: dict[str, ArgumentStrategy] | None = None
    numeric_tolerance: float = 0.0
    ordered: bool = True

# parameter_match_evaluator.py
class ToolParameterMatchEvaluator(Evaluator):
    criterion_type: ClassVar = ToolParameterMatchCriterion

    async def evaluate_invocations(
        self,
        actual_invocations,
        expected_invocations,
        conversation_scenario=None,
    ) -> EvaluationResult:
        ...

Registration in _get_default_metric_evaluator_registry():

MetricInfo(
    metric_name="tool_parameter_match",
    description="Argument-level match score for deterministically aligned tool calls.",
),

Usage in an eval set config:

{
  "metric": "tool_parameter_match",
  "criterion": {
    "threshold": 0.8,
    "matchMode": "name_and_required_args",
    "defaultStrategy": "exact",
    "perArgStrategies": { "temperature": "numeric" },
    "numericTolerance": 0.5
  }
}

Additional Context

Filed alongside #5306 (tool_trajectory_f1). The two metrics are independent but address the same workflow gap — partial-credit, deterministic tool-use scoring for regression tracking. They are designed to compose: trajectory scoring decides whether the right tools were called; parameter scoring decides whether each call's arguments were correct.

Issue #4794 proposes adding ignore_args to the existing trajectory evaluator. tool_parameter_match is complementary — it provides positive argument-quality scoring rather than argument filtering.

I'd appreciate early feedback on two API questions:

  1. Is a separate ToolParameterMatchCriterion preferred, or should argument-quality scoring be folded into an existing argument-aware criterion?
  2. Should call-alignment logic be shared with tool_trajectory_in_order / tool_trajectory_f1 via a common helper, or recomputed inside each metric independently?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions