-
Notifications
You must be signed in to change notification settings - Fork 13
Add GAIA eval_infer for unified evaluation workflow #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
simonrosenberg
wants to merge
43
commits into
main
Choose a base branch
from
openhands/multi-benchmark-eval-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+921
−35
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Create benchmarks/gaia/eval_infer.py to compute scores from GAIA output.jsonl - Add gaia-eval entry point to pyproject.toml - Makes GAIA evaluation API consistent with SWE-bench (both use eval_infer pattern) Co-authored-by: openhands <openhands@all-hands.dev>
- Import APIRemoteWorkspace alongside DockerWorkspace - Add conditional logic in prepare_workspace() to check metadata.workspace_type - Use APIRemoteWorkspace when workspace_type='remote' (for Kubernetes) - Use DockerWorkspace when workspace_type='docker' (for local) - Matches the same pattern as swe_bench evaluation
The EvalMetadata was missing workspace_type=args.workspace, causing it to always default to 'docker' regardless of the --workspace argument passed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
GAIA doesn't need prebuilt images or remote workspace pods like SWE-bench does: - SWE-bench requires instance-specific environments (different repos, dependencies) - GAIA uses the same base environment for all instances (just Q&A with files) This commit adds support for workspace_type='local' which runs commands directly on the host filesystem within the evaluation pod. This eliminates: - The need to spin up remote runtime pods - The need to build and push GAIA-specific images - Complex infrastructure overhead Benefits: - Simpler architecture - everything runs in the same pod - Faster execution - no pod creation/cleanup overhead - Lower resource usage - no additional pods needed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The argument parser was only accepting 'docker' and 'remote' as valid workspace types, but we added support for 'local' workspace in GAIA. This fixes the error: gaia-infer: error: argument --workspace: invalid choice: 'local' (choose from 'docker', 'remote') Now allows: local, docker, remote 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The Pydantic model was only accepting 'docker' and 'remote' as valid workspace types, causing a validation error: Input should be 'docker' or 'remote' [type=literal_error, input_value='local', input_type=str] Now accepts: local, docker, remote Updated description to clarify workspace types: - 'local': In-process execution (commands run on host filesystem) - 'docker': Local Docker containers - 'remote': Remote Kubernetes pods via Runtime API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Created benchmarks/gaia/build_images.py to build universal GAIA agent server image
- Created .github/workflows/build-gaia-images.yml for automated image builds
- Updated benchmarks/gaia/run_infer.py to use GAIA agent server image with remote workspace
- Removed LocalWorkspace and DockerWorkspace support from GAIA (only remote supported now)
- Updated SDK submodule to a55325c (latest main with updated build logic)
GAIA now uses a single universal agent server image (ghcr.io/openhands/eval-agent-server:{sdk_sha}-gaia-binary-minimal)
instead of per-instance images like SWE-bench, since all GAIA instances share the same Python+Node.js environment.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added dataset, split, max-workers, and n-limit inputs to build-gaia-images.yml for compatibility with the orchestration script (orchestrate_eval.py). These inputs are ignored since GAIA builds only one universal agent server image, unlike SWE-bench which builds per-instance images. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The SDK updated DockerWorkspace to deprecate base_image and target parameters. Switch to DockerDevWorkspace which supports building images on-the-fly from a base image. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Since GAIA only builds one universal image (unlike SWE-bench which builds per-instance images), simplified the workflow to remove unnecessary complexity: - Reduced workflow inputs from 8 to 2 parameters (sdk-commit, target) - Removed dataset, split, max-workers, n-limit (not needed for single image) - Simplified build summary to check single image instead of counting multiple - Simplified tracker comment to show single image tag instead of parsing lists The workflow now reflects the fundamental difference between GAIA (one universal Python+Node.js image) and SWE-bench (many per-repository images).
GAIA builds a single universal image, not multiple images. Using singular filename to match this architecture and differentiate from SWE-bench which uses plural (build-swe-bench-images.yml) for its many images.
- Fixed multi-line Python code that confused YAML parser - Fixed heredoc that wasn't properly indented for YAML - Replaced heredoc with simple multi-line string Co-authored-by: openhands <openhands@all-hands.dev>
Lambda functions cannot be pickled for multiprocessing. Replaced with module-level function gaia_tag_fn() to fix the build process. Co-authored-by: openhands <openhands@all-hands.dev>
When Blacksmith builder fails, it falls back to the local docker driver which doesn't support cache export to registry. This adds a fallback that sets up a docker-container driver to support cache export. Co-authored-by: openhands <openhands@all-hands.dev>
Blacksmith internally falls back to local docker driver when it fails, which doesn't support cache export. This change unconditionally sets up a docker-container driver to ensure cache export always works. Co-authored-by: openhands <openhands@all-hands.dev>
Commenting out Tavily MCP server and browser tools to test end-to-end evaluation flow without requiring TAVILY_API_KEY. This is temporary and should be reverted once API key is configured. Changes: - Disabled browser tools (enable_browser=False) - Commented out TAVILY_API_KEY assertion - Commented out Tavily MCP server configuration - Kept fetch MCP server for basic web content retrieval Co-authored-by: openhands <openhands@all-hands.dev>
These scripts generate unified markdown messages for both Slack and GitHub PR notifications. Each benchmark now owns its own message formatting logic.
- Remove dependency on results_summary.json (intermediate file) - GAIA: Compute metrics directly from output.jsonl - SWE-bench: Read report.json directly for metrics - Remove metadata_url and results_url parameters (no longer generated) - Simplifies data flow - formatters use raw evaluation outputs Co-authored-by: openhands <openhands@all-hands.dev>
- Add Dockerfile.gaia to build derived image with mcp-server-fetch pre-cached - Add GitHub Actions workflow to automate image building - Update run_infer.py to use MCP-enabled image - Add comprehensive documentation of the fix This eliminates 1-18 minute conversation creation delays by caching the MCP server package in the Docker image, reducing startup time to <10 seconds. Root cause: uvx downloads mcp-server-fetch on-demand during agent initialization, causing highly variable delays. Solution: Pre-install during Docker build, so package is cached and ready at runtime. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Now that TAVILY_API_KEY is available, restore: - Browser tools (enable_browser=True) - Tavily API key assertion - Tavily MCP server configuration This completes the GAIA evaluation setup with all required tools. Co-authored-by: openhands <openhands@all-hands.dev>
Document current state, blockers, and resolution options for completing the MCP fix workflow end-to-end. Co-authored-by: openhands <openhands@all-hands.dev>
Extend build-gaia-image.yml to build both base and MCP-enhanced images. This allows the workflow to be triggered from main branch and build both: - Base GAIA image (existing) - MCP-enhanced GAIA image with pre-cached mcp-server-fetch (new) This eliminates the need for build-gaia-mcp-image.yml to be on main branch. Co-authored-by: openhands <openhands@all-hands.dev>
Document successful SDK workflow completion and next steps for PR review. Co-authored-by: openhands <openhands@all-hands.dev>
MCP build is now integrated into build-gaia-image.yml, making this standalone workflow unnecessary. Co-authored-by: openhands <openhands@all-hands.dev>
…nclude Chromium - Change default build target from binary-minimal to binary in build-gaia-eval-image.yml - Update run_infer.py to look for gaia-binary-with-mcp image instead of gaia-binary-minimal-with-mcp - This ensures Chromium and other browser dependencies are available for GAIA tasks - Resolves 500 Internal Server Error: 'Chromium is required for browser operations'
76da519 to
6614af7
Compare
- Delete NEXT_STEPS.md, WORKFLOW_RUN_SUMMARY.md, WORKFLOW_STATUS.md (MCP fix documentation) - Delete benchmarks/gaia/README_MCP_FIX.md (MCP implementation details) - These files were related to MCP pre-caching experiments, not needed for the Chromium fix Co-authored-by: openhands <openhands@all-hands.dev>
- Remove benchmarks/gaia/load_hf_dataset.py (test script, not used in workflow) - Remove 'local' workspace type from choices (not implemented/used) - Hardcode 'binary' target in GAIA workflow (only option that includes Chromium) - Remove misleading workflow input options (binary-minimal and source-minimal don't work) GAIA evaluations require Chromium for browser operations, which is only available in the 'binary' target. Exposing other targets as options was misleading since they would fail at runtime. Co-authored-by: openhands <openhands@all-hands.dev>
Resolved conflicts: - benchmarks/gaia/run_infer.py: Kept feature branch version (remote workspace) - benchmarks/swt_bench/run_infer.py: Kept feature branch version (minor comment diff) - uv.lock: Used main version - vendor/software-agent-sdk: Updated to main's SDK version (37c4b350) Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
… needed Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Follow same pattern as SWE-bench to support both docker and remote workspace types. This allows local testing and debugging without requiring remote runtime. - Add DockerWorkspace import - Support workspace_type='docker' with optional building - Keep remote workspace as default for kube workflow - Use universal gaia-with-mcp image for all instances Co-authored-by: openhands <openhands@all-hands.dev>
a3468fa to
2af2bf8
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds GAIA benchmark support to the unified evaluation workflow AND fixes critical timeout issues by pre-caching the MCP server.
Changes
Original: Unified Evaluation Workflow
benchmarks/gaia/eval_infer.py: New evaluation script that computes scores from GAIAoutput.jsonlfileseval_infer.pyreport.jsonwith aggregated statisticsgaia-evalentry point: Updatedpyproject.tomlto register thegaia-evalCLI commandNEW: MCP Server Timeout Fix ⚡
Problem: GAIA evaluations experiencing 30-50% timeout rates due to MCP server initialization taking 1-18 minutes per conversation.
Solution: Pre-cache
mcp-server-fetchin derived Docker image.Changes:
benchmarks/gaia/Dockerfile.gaia- 5-line Dockerfile that extends base SDK image and pre-caches MCP server.github/workflows/build-gaia-image.yml- Enhanced to build both base and MCP-enhanced imagesbenchmarks/gaia/run_infer.py- Updated to use-with-mcpimage suffixImpact:
Images Produced:
ghcr.io/openhands/eval-agent-server:f715937-gaia-binary-minimalghcr.io/openhands/eval-agent-server:f715937-gaia-binary-minimal-with-mcp⚡Why?
Unified Workflow
The GAIA benchmark originally used
get_scoreinstead ofeval_infer, which was inconsistent with SWE-bench's evaluation pattern. This PR makes both benchmarks use the same API.MCP Timeout Fix
MCP server downloads were causing severe evaluation delays and timeouts. Pre-caching eliminates this bottleneck entirely.
Testing Plan
After merge:
build-gaia-image.ymlworkflow withsdk-commit: f715937Related PRs
This is part of a multi-repo change to support multiple benchmarks: