Skip to content

Conversation

@NicolasAG
Copy link
Collaborator

@NicolasAG NicolasAG commented Nov 6, 2025

Introduce the webarena_verified benchmark.

  • tasks are registered with this template: webarena_verified.{intent_template_id}.{task_id}
  • new WebArenaVerifiedTask class overrides the setup() function of GenericWebArenaTask to:
    • use the webarena_verified evaluator
    • load extra html headers if PW_EXTRA_HEADERS is set -- used to store secret keys to access self-hosted webarena instances
    • append a hint to the goal for the model to return the expected response format for the webarena_verified evaluator
  • new WebArenaVerifiedEvaluator class that calls the webarena_verified.api.WebArenaVerifiedEvaluator from platform-labs-webarena-verified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants