Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
d019e0b
feat: Add ExecuteSkillScriptTool for running skill scripts via code e…
caohy1988 Feb 21, 2026
06d995e
fix: Address Gemini Code Assist review — shell injection, shlex, chec…
caohy1988 Feb 21, 2026
e83de80
fix: Address code review findings for ExecuteSkillScriptTool
caohy1988 Feb 21, 2026
52b8563
docs: Add code executor enhancements design document
caohy1988 Feb 22, 2026
4142c37
docs: Address 8 architectural review findings in code executor design…
caohy1988 Feb 22, 2026
8ca1111
docs: Fix 6 review findings — execution_id, PID namespace, Py version…
caohy1988 Feb 22, 2026
d8692ba
docs: Fix container timeout DoS, pkill scope, stale recommendations
caohy1988 Feb 22, 2026
f55da53
docs: Align roadmap with Option A, unify recovery policy, fix fallbac…
caohy1988 Feb 22, 2026
f4fd794
docs: Fix PermissionError kill fallback, align non-goals with Option A
caohy1988 Feb 22, 2026
4bb83a0
docs: Surface cleanup failure as unhealthy state, add post-kill threa…
caohy1988 Feb 22, 2026
3221ac1
docs: Add _healthy guard and post-restart readiness validation
caohy1988 Feb 22, 2026
c3a003d
docs: Document _healthy lifecycle (init, failure, reinit)
caohy1988 Feb 22, 2026
369bba8
docs: Add public reinitialize() method to ContainerCodeExecutor API
caohy1988 Feb 22, 2026
c735183
docs: Check exit_code on post-restart readiness validation
caohy1988 Feb 22, 2026
11f65f0
feat: Add SkillsBench Docker-based evaluation pipeline
caohy1988 Feb 23, 2026
f9a78a6
fix: Add per-command timeout to Docker executor
caohy1988 Feb 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions benchmarks/skillsbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# SkillsBench Evaluation Harness for ADK

Evaluates ADK's `SkillToolset` against tasks adapted from the
[SkillsBench](https://github.com/benchflow-ai/skillsbench) benchmark.

## Overview

This harness adapts 8 representative SkillsBench tasks as ADK skills and
evaluates them through the ADK evaluation framework. It tests whether an
agent can discover, load, and execute skills using the `SkillToolset`
tools: `list_skills`, `load_skill`, `load_skill_resource`, and
`execute_skill_script`.

## Task Categories

| # | Category | Skill | What it tests |
|---|----------|-------|---------------|
| 1 | Data Analysis | csv-aggregation | skill discovery + script execution |
| 2 | File Processing | json-transform | load_skill_resource + script |
| 3 | Web Scraping | html-extraction | skill with references |
| 4 | API Interaction | rest-client | multi-step skill usage |
| 5 | Text Transformation | regex-replace | simple script execution |
| 6 | Code Generation | function-scaffold | skill instruction following |
| 7 | Math Computation | statistical-calc | output validation |
| 8 | System Admin | log-parsing | complex skill with metadata |

## Setup

```bash
# From repo root
uv venv --python "python3.11" ".venv"
source .venv/bin/activate
uv sync --all-extras

# Set your API key
export GOOGLE_API_KEY="your-key-here"
```

## Usage

### Run with ADK CLI

```bash
# Interactive web UI
adk web benchmarks/skillsbench

# Run evaluation via ADK eval
adk eval benchmarks/skillsbench \
benchmarks/skillsbench/eval_sets/skillsbench_eval.json
```

### Run standalone scorer

```bash
python benchmarks/skillsbench/runner.py
python benchmarks/skillsbench/runner.py --num-runs 3
python benchmarks/skillsbench/runner.py --eval-set path/to/custom_eval.json
```

### Output format

The standalone runner produces a per-task results table and a
leaderboard-format summary:

```
============================================================
Leaderboard Summary
============================================================
Model: gemini-2.5-flash
Framework: ADK SkillToolset
Tasks: X/8 (XX.X%)
Avg Discovery: X.XX
Avg Tool Usage: X.XX
Elapsed: XX.Xs
============================================================
```

## Custom Metrics

Three metrics are provided in `metrics.py`:

- **skill_discovery_score** — 1.0 if the agent called both `list_skills`
and `load_skill`, else 0.0
- **tool_usage_score** — Fraction of expected tool calls that were made
(ANY_ORDER matching)
- **skillsbench_binary_score** — 1.0 if the final response contains all
expected reference lines, else 0.0

Reference these in eval configs via their dotted paths:
```
benchmarks.skillsbench.metrics.skill_discovery_score
benchmarks.skillsbench.metrics.tool_usage_score
benchmarks.skillsbench.metrics.skillsbench_binary_score
```

## Directory Structure

```
benchmarks/skillsbench/
├── __init__.py
├── README.md
├── agent.py # ADK agent with SkillToolset
├── skills/ # 8 adapted SkillsBench tasks
│ ├── csv-aggregation/
│ ├── json-transform/
│ ├── html-extraction/
│ ├── rest-client/
│ ├── regex-replace/
│ ├── function-scaffold/
│ ├── statistical-calc/
│ └── log-parsing/
├── eval_sets/
│ └── skillsbench_eval.json # EvalSet with 8 cases
├── metrics.py # Custom metric functions
└── runner.py # Standalone runner
```

## Adding New Tasks

1. Create a skill directory under `skills/` with a `SKILL.md` following
the [Agent Skills spec](https://github.com/benchflow-ai/skillsbench)
2. Add scripts under `skills/<name>/scripts/`
3. Add references under `skills/<name>/references/` (optional)
4. Add the skill name to `_SKILL_NAMES` in `agent.py`
5. Add a new `EvalCase` entry to `eval_sets/skillsbench_eval.json`

## Security Note

This harness uses `UnsafeLocalCodeExecutor` for skill script execution.
For production or untrusted skill scripts, use `ContainerCodeExecutor`
or `VertexAICodeExecutor` instead.
15 changes: 15 additions & 0 deletions benchmarks/skillsbench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""SkillsBench evaluation harness for ADK SkillToolset."""
73 changes: 73 additions & 0 deletions benchmarks/skillsbench/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""SkillsBench evaluation agent with SkillToolset and Gemini Flash.

This agent loads all skills from the skills/ directory and uses
SkillToolset to provide list_skills, load_skill, load_skill_resource,
and execute_skill_script tools. It is designed to be evaluated against
the SkillsBench benchmark tasks.

WARNING: This agent uses UnsafeLocalCodeExecutor for script execution.
For production use, prefer ContainerCodeExecutor or VertexAICodeExecutor.
"""

import pathlib

from google.adk import Agent
from google.adk.code_executors.unsafe_local_code_executor import UnsafeLocalCodeExecutor
from google.adk.skills import load_skill_from_dir
from google.adk.tools.skill_toolset import SkillToolset

_SKILLS_DIR = pathlib.Path(__file__).parent / "skills"

_SKILL_NAMES = [
"csv-aggregation",
"json-transform",
"html-extraction",
"rest-client",
"regex-replace",
"function-scaffold",
"statistical-calc",
"log-parsing",
]

_skills = [load_skill_from_dir(_SKILLS_DIR / name) for name in _SKILL_NAMES]

skill_toolset = SkillToolset(
skills=_skills,
code_executor=UnsafeLocalCodeExecutor(),
)

root_agent = Agent(
model="gemini-3-flash-preview",
name="skillsbench_agent",
description=(
"An agent that completes tasks by discovering and using"
" available skills from the SkillsBench benchmark."
),
instruction=(
"You are an agent that completes tasks by discovering and using"
" available skills. Follow this workflow:\n"
"1. Use list_skills to find relevant skills for the task.\n"
"2. Use load_skill to read the skill's instructions carefully.\n"
"3. Use load_skill_resource to examine references or sample data"
" if available.\n"
"4. Use execute_skill_script to run the skill's scripts with"
" appropriate arguments.\n"
"5. Interpret the output and present a clear answer.\n\n"
"Always check skill instructions before executing scripts."
),
tools=[skill_toolset],
)
Loading