Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

This PR adds the GitHub Actions workflow for building GAIA benchmark agent server images.

Background

GAIA (General AI Assistants) benchmark evaluation requires agent server images, similar to SWE-bench. However, unlike SWE-bench which builds per-instance images (one per repository), GAIA uses a single universal image for all instances.

What's Included

    • Workflow for building the GAIA agent server image

Workflow Features

  • Minimal configuration: Only requires sdk-commit and optional target parameter
  • Universal image: Builds one image based on nikolaik/python-nodejs:python3.12-nodejs22
  • Simple architecture: No need for per-instance builds or complex configuration
  • Image tag format: ghcr.io/openhands/eval-agent-server:{SDK_SHA}-gaia-binary-minimal

Triggering

Can be triggered in two ways:

  1. workflow_dispatch: Manually via GitHub Actions UI or API
  2. PR label: Add build-gaia label to any PR

Next Steps

Once this is merged to main:

  1. The workflow will be available for dispatch via GitHub API
  2. Evaluation orchestration can dispatch builds before running GAIA evaluations
  3. This unblocks end-to-end GAIA evaluation on Kubernetes infrastructure

Related

Note

This PR intentionally only includes the workflow file to enable quick merging. The actual GAIA evaluation implementation code is in PR #125.

This workflow builds a universal agent server image for GAIA benchmark evaluation.

Unlike SWE-bench which requires per-instance images with specific repository
environments, GAIA uses a single universal image for all instances since they
share the same Python+Node.js environment (nikolaik/python-nodejs:python3.12-nodejs22).

Workflow features:
- Minimal configuration: only requires sdk-commit and optional target parameter
- Builds one universal image tagged as: ghcr.io/openhands/eval-agent-server:{SDK_SHA}-gaia-binary-minimal
- Can be triggered via workflow_dispatch or by adding 'build-gaia' label to PRs
- Posts build status to issue #81 for tracking

Note: Workflow filename is singular (build-gaia-image.yml) to reflect that it
builds a single universal image, unlike SWE-bench which uses plural
(build-swe-bench-images.yml) for its many per-instance images.

This is a prerequisite for enabling GAIA benchmark evaluation on the
Kubernetes evaluation infrastructure.
@simonrosenberg simonrosenberg force-pushed the add-gaia-build-workflow branch from 1749c49 to 7d8ec59 Compare December 3, 2025 23:22
@openhands-ai
Copy link

openhands-ai bot commented Dec 3, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • .github/workflows/build-gaia-image.yml

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #129 at branch `add-gaia-build-workflow`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@simonrosenberg simonrosenberg merged commit 5ee6679 into main Dec 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants