google · caohy1988 · Feb 21, 2026 · Feb 21, 2026 · Feb 21, 2026 · Feb 22, 2026
diff --git a/benchmarks/skillsbench/README.md b/benchmarks/skillsbench/README.md
@@ -0,0 +1,131 @@
+# SkillsBench Evaluation Harness for ADK
+
+Evaluates ADK's `SkillToolset` against tasks adapted from the
+[SkillsBench](https://github.com/benchflow-ai/skillsbench) benchmark.
+
+## Overview
+
+This harness adapts 8 representative SkillsBench tasks as ADK skills and
+evaluates them through the ADK evaluation framework. It tests whether an
+agent can discover, load, and execute skills using the `SkillToolset`
+tools: `list_skills`, `load_skill`, `load_skill_resource`, and
+`execute_skill_script`.
+
+## Task Categories
+
+| # | Category | Skill | What it tests |
+|---|----------|-------|---------------|
+| 1 | Data Analysis | csv-aggregation | skill discovery + script execution |
+| 2 | File Processing | json-transform | load_skill_resource + script |
+| 3 | Web Scraping | html-extraction | skill with references |
+| 4 | API Interaction | rest-client | multi-step skill usage |
+| 5 | Text Transformation | regex-replace | simple script execution |
+| 6 | Code Generation | function-scaffold | skill instruction following |
+| 7 | Math Computation | statistical-calc | output validation |
+| 8 | System Admin | log-parsing | complex skill with metadata |
+
+## Setup
+
+```bash
+# From repo root
+uv venv --python "python3.11" ".venv"
+source .venv/bin/activate
+uv sync --all-extras
+
+# Set your API key
+export GOOGLE_API_KEY="your-key-here"
+```
+
+## Usage
+
+### Run with ADK CLI
+
+```bash
+# Interactive web UI
+adk web benchmarks/skillsbench
+
+# Run evaluation via ADK eval
+adk eval benchmarks/skillsbench \
+    benchmarks/skillsbench/eval_sets/skillsbench_eval.json
+```
+
+### Run standalone scorer
+
+```bash
+python benchmarks/skillsbench/runner.py
+python benchmarks/skillsbench/runner.py --num-runs 3
+python benchmarks/skillsbench/runner.py --eval-set path/to/custom_eval.json
+```
+
+### Output format
+
+The standalone runner produces a per-task results table and a
+leaderboard-format summary:
+
+```
+============================================================
+  Leaderboard Summary
+============================================================
+  Model:              gemini-2.5-flash
+  Framework:          ADK SkillToolset
+  Tasks:              X/8 (XX.X%)
+  Avg Discovery:      X.XX
+  Avg Tool Usage:     X.XX
+  Elapsed:            XX.Xs
+============================================================
+```
+
+## Custom Metrics
+
+Three metrics are provided in `metrics.py`:
+
+- **skill_discovery_score** — 1.0 if the agent called both `list_skills`
+  and `load_skill`, else 0.0
+- **tool_usage_score** — Fraction of expected tool calls that were made
+  (ANY_ORDER matching)
+- **skillsbench_binary_score** — 1.0 if the final response contains all
+  expected reference lines, else 0.0
+
+Reference these in eval configs via their dotted paths:
+```
+benchmarks.skillsbench.metrics.skill_discovery_score
+benchmarks.skillsbench.metrics.tool_usage_score
+benchmarks.skillsbench.metrics.skillsbench_binary_score
+```
+
+## Directory Structure
+
+```
+benchmarks/skillsbench/
+├── __init__.py
+├── README.md
+├── agent.py                     # ADK agent with SkillToolset
+├── skills/                      # 8 adapted SkillsBench tasks
+│   ├── csv-aggregation/
+│   ├── json-transform/
+│   ├── html-extraction/
+│   ├── rest-client/
+│   ├── regex-replace/
+│   ├── function-scaffold/
+│   ├── statistical-calc/
+│   └── log-parsing/
+├── eval_sets/
+│   └── skillsbench_eval.json    # EvalSet with 8 cases
+├── metrics.py                   # Custom metric functions
+└── runner.py                    # Standalone runner
+```
+
+## Adding New Tasks
+
+1. Create a skill directory under `skills/` with a `SKILL.md` following
+   the [Agent Skills spec](https://github.com/benchflow-ai/skillsbench)
+2. Add scripts under `skills/<name>/scripts/`
+3. Add references under `skills/<name>/references/` (optional)
+4. Add the skill name to `_SKILL_NAMES` in `agent.py`
+5. Add a new `EvalCase` entry to `eval_sets/skillsbench_eval.json`
+
+## Security Note
+
+This harness uses `UnsafeLocalCodeExecutor` for skill script execution.
+For production or untrusted skill scripts, use `ContainerCodeExecutor`
+or `VertexAICodeExecutor` instead.
diff --git a/benchmarks/skillsbench/__init__.py b/benchmarks/skillsbench/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SkillsBench evaluation harness for ADK SkillToolset."""
diff --git a/benchmarks/skillsbench/agent.py b/benchmarks/skillsbench/agent.py
@@ -0,0 +1,73 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SkillsBench evaluation agent with SkillToolset and Gemini Flash.
+
+This agent loads all skills from the skills/ directory and uses
+SkillToolset to provide list_skills, load_skill, load_skill_resource,
+and execute_skill_script tools. It is designed to be evaluated against
+the SkillsBench benchmark tasks.
+
+WARNING: This agent uses UnsafeLocalCodeExecutor for script execution.
+For production use, prefer ContainerCodeExecutor or VertexAICodeExecutor.
+"""
+
+import pathlib
+
+from google.adk import Agent
+from google.adk.code_executors.unsafe_local_code_executor import UnsafeLocalCodeExecutor
+from google.adk.skills import load_skill_from_dir
+from google.adk.tools.skill_toolset import SkillToolset
+
+_SKILLS_DIR = pathlib.Path(__file__).parent / "skills"
+
+_SKILL_NAMES = [
+    "csv-aggregation",
+    "json-transform",
+    "html-extraction",
+    "rest-client",
+    "regex-replace",
+    "function-scaffold",
+    "statistical-calc",
+    "log-parsing",
+]
+
+_skills = [load_skill_from_dir(_SKILLS_DIR / name) for name in _SKILL_NAMES]
+
+skill_toolset = SkillToolset(
+    skills=_skills,
+    code_executor=UnsafeLocalCodeExecutor(),
+)
+
+root_agent = Agent(
+    model="gemini-3-flash-preview",
+    name="skillsbench_agent",
+    description=(
+        "An agent that completes tasks by discovering and using"
+        " available skills from the SkillsBench benchmark."
+    ),
+    instruction=(
+        "You are an agent that completes tasks by discovering and using"
+        " available skills. Follow this workflow:\n"
+        "1. Use list_skills to find relevant skills for the task.\n"
+        "2. Use load_skill to read the skill's instructions carefully.\n"
+        "3. Use load_skill_resource to examine references or sample data"
+        " if available.\n"
+        "4. Use execute_skill_script to run the skill's scripts with"
+        " appropriate arguments.\n"
+        "5. Interpret the output and present a clear answer.\n\n"
+        "Always check skill instructions before executing scripts."
+    ),
+    tools=[skill_toolset],
+)