🔴 Required Information
Is your feature request related to a specific problem?
ADK does not currently expose a dedicated built-in metric for tool argument quality. Existing trajectory metrics tell you whether a tool was called, but not whether the agent supplied the right parameters. This hides important failure modes:
- correct tool name with incorrect required arguments
- partial argument correctness — values that are close but not exact
- repeated same-tool calls that require stable one-to-one alignment between expected and actual invocations before argument-level scoring is meaningful
For regression tracking during agent development, "the right tool was called" is necessary but not sufficient. Without argument-level scoring, agents that systematically miscall a correctly-chosen tool produce no signal in built-in metrics.
Describe the Solution You'd Like
A new built-in evaluator named tool_parameter_match that scores the quality of tool-call arguments after deterministic alignment between expected and actual invocations.
- Metric name:
tool_parameter_match
- Criterion type:
ToolParameterMatchCriterion(BaseCriterion)
- Match modes:
name_only, name_and_args, name_and_required_args (alignment behavior, mirroring trajectory scoring)
- Argument strategies:
exact, casefold_exact, numeric, contains
- Per-argument strategy override: optional
per_arg_strategies dict
- Numeric tolerance: optional
numeric_tolerance for the numeric strategy
- Returns
NOT_EVALUATED when reference invocations are unavailable
- Returns per-invocation
NOT_EVALUATED when an invocation has zero expected tool calls
Per-call scoring (for each matched expected tool call):
- score the expected argument keys only
- average the per-argument match scores
- if the expected argument dict is empty, the tool-call score is
1.0
For unmatched expected calls: score 0.0.
Case score is the mean of evaluated per-invocation scores, excluding NOT_EVALUATED invocations.
Impact on your work
I am building an evaluation harness for ADK-based agents and need a deterministic, regression-friendly metric for argument-level correctness. Without one, regressions where the agent picks the right tool but supplies wrong or partial arguments are invisible to built-in scoring, which slows down iterative improvement.
No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.
Willingness to contribute
Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).
🟡 Recommended Information
Describe Alternatives You've Considered
- Custom metric function via
custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval-set configs the way built-in metrics can.
- Embedding argument matching inside
tool_trajectory_in_order: would conflate trajectory shape with argument quality and break existing eval sets that rely on the trajectory metric's current binary semantics.
Proposed API / Implementation
# eval_metrics.py
ArgumentStrategy = Literal["exact", "casefold_exact", "numeric", "contains"]
class ToolParameterMatchCriterion(BaseCriterion):
match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_and_required_args"
default_strategy: ArgumentStrategy = "exact"
per_arg_strategies: dict[str, ArgumentStrategy] | None = None
numeric_tolerance: float = 0.0
ordered: bool = True
# parameter_match_evaluator.py
class ToolParameterMatchEvaluator(Evaluator):
criterion_type: ClassVar = ToolParameterMatchCriterion
async def evaluate_invocations(
self,
actual_invocations,
expected_invocations,
conversation_scenario=None,
) -> EvaluationResult:
...
Registration in _get_default_metric_evaluator_registry():
MetricInfo(
metric_name="tool_parameter_match",
description="Argument-level match score for deterministically aligned tool calls.",
),
Usage in an eval set config:
{
"metric": "tool_parameter_match",
"criterion": {
"threshold": 0.8,
"matchMode": "name_and_required_args",
"defaultStrategy": "exact",
"perArgStrategies": { "temperature": "numeric" },
"numericTolerance": 0.5
}
}
Additional Context
Filed alongside #5306 (tool_trajectory_f1). The two metrics are independent but address the same workflow gap — partial-credit, deterministic tool-use scoring for regression tracking. They are designed to compose: trajectory scoring decides whether the right tools were called; parameter scoring decides whether each call's arguments were correct.
Issue #4794 proposes adding ignore_args to the existing trajectory evaluator. tool_parameter_match is complementary — it provides positive argument-quality scoring rather than argument filtering.
I'd appreciate early feedback on two API questions:
- Is a separate
ToolParameterMatchCriterion preferred, or should argument-quality scoring be folded into an existing argument-aware criterion?
- Should call-alignment logic be shared with
tool_trajectory_in_order / tool_trajectory_f1 via a common helper, or recomputed inside each metric independently?
🔴 Required Information
Is your feature request related to a specific problem?
ADK does not currently expose a dedicated built-in metric for tool argument quality. Existing trajectory metrics tell you whether a tool was called, but not whether the agent supplied the right parameters. This hides important failure modes:
For regression tracking during agent development, "the right tool was called" is necessary but not sufficient. Without argument-level scoring, agents that systematically miscall a correctly-chosen tool produce no signal in built-in metrics.
Describe the Solution You'd Like
A new built-in evaluator named
tool_parameter_matchthat scores the quality of tool-call arguments after deterministic alignment between expected and actual invocations.tool_parameter_matchToolParameterMatchCriterion(BaseCriterion)name_only,name_and_args,name_and_required_args(alignment behavior, mirroring trajectory scoring)exact,casefold_exact,numeric,containsper_arg_strategiesdictnumeric_tolerancefor thenumericstrategyNOT_EVALUATEDwhen reference invocations are unavailableNOT_EVALUATEDwhen an invocation has zero expected tool callsPer-call scoring (for each matched expected tool call):
1.0For unmatched expected calls: score
0.0.Case score is the mean of evaluated per-invocation scores, excluding
NOT_EVALUATEDinvocations.Impact on your work
I am building an evaluation harness for ADK-based agents and need a deterministic, regression-friendly metric for argument-level correctness. Without one, regressions where the agent picks the right tool but supplies wrong or partial arguments are invisible to built-in scoring, which slows down iterative improvement.
No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.
Willingness to contribute
Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).
🟡 Recommended Information
Describe Alternatives You've Considered
custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval-set configs the way built-in metrics can.tool_trajectory_in_order: would conflate trajectory shape with argument quality and break existing eval sets that rely on the trajectory metric's current binary semantics.Proposed API / Implementation
Registration in
_get_default_metric_evaluator_registry():Usage in an eval set config:
{ "metric": "tool_parameter_match", "criterion": { "threshold": 0.8, "matchMode": "name_and_required_args", "defaultStrategy": "exact", "perArgStrategies": { "temperature": "numeric" }, "numericTolerance": 0.5 } }Additional Context
Filed alongside #5306 (
tool_trajectory_f1). The two metrics are independent but address the same workflow gap — partial-credit, deterministic tool-use scoring for regression tracking. They are designed to compose: trajectory scoring decides whether the right tools were called; parameter scoring decides whether each call's arguments were correct.Issue #4794 proposes adding
ignore_argsto the existing trajectory evaluator.tool_parameter_matchis complementary — it provides positive argument-quality scoring rather than argument filtering.I'd appreciate early feedback on two API questions:
ToolParameterMatchCriterionpreferred, or should argument-quality scoring be folded into an existing argument-aware criterion?tool_trajectory_in_order/tool_trajectory_f1via a common helper, or recomputed inside each metric independently?