Skip to content

Interactive Sandboxes for AI Agent Evaluations & Reinforcement Learning on 3rd party APIs (Slack, Linear, Box, gCalendar)

License

Notifications You must be signed in to change notification settings

agent-diff-bench/agent-diff

Repository files navigation

Agent Diff

Interactive environments for evaluating AI agents & RL training on replicas of 3rd party APIs like Linear or Slack.

Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.

arXiv HuggingFace

WebsiteDocsPaperFeedback

Try it now

Description
LangChain Agent Run AgentDiff Benchmark (LangChain Agents) Open In Colab
ReAct Agent (Paper) AgentDiff Benchmark (ReAct) Open In Colab
Prime Intellect Environment Run evals or RL training Prime Intellect
Custom Evaluations Demo Write your own assertions & evaluate agents Open In Colab

Quick Start

1. Install SDK

Python: Python SDK docs

uv add agent-diff

TypeScript: TS SDK docs

npm install agent-diff

2. Configure

Hosted
  1. Sign up at agentdiff.dev and get your API key
  2. Set environment variables:
export AGENT_DIFF_API_KEY="ad_live_sk_..."
export AGENT_DIFF_BASE_URL="https://api.agentdiff.dev"
Self-Hosted
git clone https://github.com/agent-diff-bench/agent-diff.git
cd agent-diff/ops
docker-compose up --build
# Backend runs on http://localhost:8000

3. Use

from agent_diff import AgentDiff

client = AgentDiff()

# Create an isolated environment from a template
env = client.init_env(
    templateService="slack",
    templateName="slack_default",
    impersonateUserId="U01AGENBOT9",
)

# Snapshot before agent runs
run = client.start_run(envId=env.environmentId)

# --- Your agent interacts with the API here ---
# SDK provides code execution proxies (Python/Bash) for OpenAI Agents, LangChain, etc.
# Agent writes normal code (e.g. requests.post('https://slack.com/api/chat.postMessage', ...))
# which is automatically intercepted and routed to the sandboxed environment.

from agent_diff import BashExecutorProxy, create_openai_tool
bash = BashExecutorProxy(env.environmentId)
tool = create_openai_tool(bash)  # also: create_langchain_tool, create_smolagents_tool

# Compute state diff and inspect changes
diff = client.diff_run(runId=run.runId)
print(diff.diff['inserts'])   # new records created by agent
print(diff.diff['updates'])   # modified records
print(diff.diff['deletes'])   # deleted records

# Clean up
client.delete_env(envId=env.environmentId)

See the Python SDK and TS SDK for full reference.

Supported APIs

Service Type Endpoints Coverage
Box REST 27 Files, folders, search, comments, tags, shared links, hubs, versioning
Google Calendar REST 37 Calendars, events, recurring series, free/busy, ACL, push notifications
Linear GraphQL 19 Issues, teams, workflow states, labels, comments, relations, memberships
Slack Web API 25 Conversations, messaging, reactions, threading, users, channels

108 unique endpoints across all 4 services.

Templates, Seeds & Environments

Templates are pre-configured database schemas that serve as the starting point for test environments. Think of them as snapshots of a service's state:

  • Location: Templates live in PostgreSQL schemas (e.g., slack_default, box_default, linear_expanded, calendar_base)
  • Content: Seeded with realistic data — users, channels, messages, files, folders, issues, calendar events, etc.
  • Seeds: box | calendar | linear | slack

Environments are isolated, temporary copies of a template schema:

  • URL: Each environment has a unique service URL (e.g., http://localhost:8000/api/env/{env_id}/services/slack)
  • Creation: client.init_env(templateService="slack", templateName="slack_default", impersonateUserId="U01AGENBOT9")
  • Cleanup: client.delete_env(envId) or auto-expires after TTL

Agent-Diff Bench

The Agent-Diff benchmark comprises 224 tasks across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.

Agent-Diff Bench Results

Model Box Calendar Linear Slack Overall Pass % Cost/test Score/$
deepseek-v3.2 76.6 87.5 94.8 86.1 88.1 76 $0.03 2,938
devstral-2512 79.0 80.0 91.5 85.7 86.0 74 $0.08 1,075
qwen3-vl-235b 68.4 71.0 82.0 75.8 79.2 65 $0.02 3,959
kimi-k2-0905 66.5 72.3 88.2 82.2 75.4 64 $0.04 1,885
grok-4.1-fast 58.5 75.7 66.0 77.1 74.9 52 $0.01 7,489
gemini-3-flash 80.3 62.2 84.0 77.5 73.8 67 $0.05 1,477
gpt-oss-120b 70.1 68.4 79.5 69.1 68.5 60 $0.02 3,428
claude-haiku-4.5 45.1 57.8 35.6 57.3 49.3 50 $0.22 224
llama-4-scout 33.7 41.4 20.9 42.9 38.0 29 $0.02 1,900

Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the paper.

Run Agent-Diff Bench

  • Prime Intellect — Run evals or RL training with no setup required
  • Colab Notebooks — Run locally with the example notebooks above
  • Dataset — 224 tasks across all 4 services (80/20 train/test split). Each test defines expected state changes via declarative assertions. See the assertions docs for how they work.

Documentation

Citation

If you use Agent-Diff in your research, please cite:

@article{pysklo2025agentdiff,
  title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
  author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
  journal={arXiv preprint arXiv:2602.11224},
  year={2025}
}

About

Interactive Sandboxes for AI Agent Evaluations & Reinforcement Learning on 3rd party APIs (Slack, Linear, Box, gCalendar)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors