Agent Diff

Interactive environments for evaluating AI agents & RL training on replicas of 3rd party APIs like Linear or Slack.

Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.

Website • Docs • Paper • Feedback

Try it now

	Description
LangChain Agent	Run AgentDiff Benchmark (LangChain Agents)
ReAct Agent (Paper)	AgentDiff Benchmark (ReAct)
Prime Intellect Environment	Run evals or RL training
Custom Evaluations Demo	Write your own assertions & evaluate agents

Quick Start

1. Install SDK

Python: Python SDK docs

uv add agent-diff

TypeScript: TS SDK docs

npm install agent-diff

2. Configure

Hosted

Sign up at agentdiff.dev and get your API key
Set environment variables:

export AGENT_DIFF_API_KEY="ad_live_sk_..."
export AGENT_DIFF_BASE_URL="https://api.agentdiff.dev"

Self-Hosted

git clone https://github.com/agent-diff-bench/agent-diff.git
cd agent-diff/ops
docker-compose up --build
# Backend runs on http://localhost:8000

3. Use

from agent_diff import AgentDiff

client = AgentDiff()

# Create an isolated environment from a template
env = client.init_env(
    templateService="slack",
    templateName="slack_default",
    impersonateUserId="U01AGENBOT9",
)

# Snapshot before agent runs
run = client.start_run(envId=env.environmentId)

# --- Your agent interacts with the API here ---
# SDK provides code execution proxies (Python/Bash) for OpenAI Agents, LangChain, etc.
# Agent writes normal code (e.g. requests.post('https://slack.com/api/chat.postMessage', ...))
# which is automatically intercepted and routed to the sandboxed environment.

from agent_diff import BashExecutorProxy, create_openai_tool
bash = BashExecutorProxy(env.environmentId)
tool = create_openai_tool(bash)  # also: create_langchain_tool, create_smolagents_tool

# Compute state diff and inspect changes
diff = client.diff_run(runId=run.runId)
print(diff.diff['inserts'])   # new records created by agent
print(diff.diff['updates'])   # modified records
print(diff.diff['deletes'])   # deleted records

# Clean up
client.delete_env(envId=env.environmentId)

See the Python SDK and TS SDK for full reference.

Supported APIs

Service	Type	Endpoints	Coverage
Box	REST	27	Files, folders, search, comments, tags, shared links, hubs, versioning
Google Calendar	REST	37	Calendars, events, recurring series, free/busy, ACL, push notifications
Linear	GraphQL	19	Issues, teams, workflow states, labels, comments, relations, memberships
Slack	Web API	25	Conversations, messaging, reactions, threading, users, channels

108 unique endpoints across all 4 services.

Templates, Seeds & Environments

Templates are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:

Location: Templates live in PostgreSQL schemas (e.g., slack_default, box_default, linear_expanded, calendar_base)
Content: Seeded with realistic data — users, channels, messages, files, folders, issues, calendar events, etc.
Seeds: box | calendar | linear | slack

Environments are isolated, temporary copies of a template schema:

URL: Each environment has a unique service URL (e.g., http://localhost:8000/api/env/{env_id}/services/slack)
Creation: client.init_env(templateService="slack", templateName="slack_default", impersonateUserId="U01AGENBOT9")
Cleanup: client.delete_env(envId) or auto-expires after TTL

Agent-Diff Bench

The Agent-Diff benchmark comprises 224 tasks across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.

Agent-Diff Bench Results

Model	Box	Calendar	Linear	Slack	Overall	Pass %	Cost/test	Score/$
deepseek-v3.2	76.6	87.5	94.8	86.1	88.1	76	$0.03	2,938
devstral-2512	79.0	80.0	91.5	85.7	86.0	74	$0.08	1,075
qwen3-vl-235b	68.4	71.0	82.0	75.8	79.2	65	$0.02	3,959
kimi-k2-0905	66.5	72.3	88.2	82.2	75.4	64	$0.04	1,885
grok-4.1-fast	58.5	75.7	66.0	77.1	74.9	52	$0.01	7,489
gemini-3-flash	80.3	62.2	84.0	77.5	73.8	67	$0.05	1,477
gpt-oss-120b	70.1	68.4	79.5	69.1	68.5	60	$0.02	3,428
claude-haiku-4.5	45.1	57.8	35.6	57.3	49.3	50	$0.22	224
llama-4-scout	33.7	41.4	20.9	42.9	38.0	29	$0.02	1,900

Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the paper.

Run Agent-Diff Bench

Prime Intellect — Run evals or RL training with no setup required
Colab Notebooks — Run locally with the example notebooks above
Dataset — 224 tasks across all 4 services (80/20 train/test split). Each test defines expected state changes via declarative assertions. See the assertions docs for how they work.

Documentation

Python SDK — Full Python SDK reference
TypeScript SDK — Full TypeScript SDK reference
Assertions & Evaluation DSL — Write test assertions
API Reference — REST API documentation
Self-Hosting — Docker setup & configuration

Citation

If you use Agent-Diff in your research, please cite:

@article{pysklo2025agentdiff,
  title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
  author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
  journal={arXiv preprint arXiv:2602.11224},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 671 Commits
.github/workflows		.github/workflows
backend		backend
datasets/agent-diff-bench		datasets/agent-diff-bench
docs		docs
examples		examples
experiments/kdd 2026		experiments/kdd 2026
ops		ops
sdk		sdk
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Diff

Try it now

Quick Start

1. Install SDK

2. Configure

3. Use

Supported APIs

Templates, Seeds & Environments

Agent-Diff Bench

Agent-Diff Bench Results

Run Agent-Diff Bench

Documentation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

agent-diff-bench/agent-diff

Folders and files

Latest commit

History

Repository files navigation

Agent Diff

Try it now

Quick Start

1. Install SDK

2. Configure

3. Use

Supported APIs

Templates, Seeds & Environments

Agent-Diff Bench

Agent-Diff Bench Results

Run Agent-Diff Bench

Documentation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages