Skip to content

ColBench#37

Open
ahmedheakl wants to merge 2 commits intoparameterlab:mainfrom
ahmedheakl:agent-collab
Open

ColBench#37
ahmedheakl wants to merge 2 commits intoparameterlab:mainfrom
ahmedheakl:agent-collab

Conversation

@ahmedheakl
Copy link

Description

Add ColBench (Collaborative Agent Bench).

Modules added (maseval/benchmark/colbench/):

  • ColBenchBenchmark: orchestrates the task loop
  • ColBenchUser: LLM-backed human simulator
  • ColBenchAgentAdapter / ColBenchAgentInner: agent under test
  • ColBenchEnvironment: task state holder
  • ColBenchCodeEvaluator: unit-test scoring with sandboxed execution
  • OpenAIModelAdapter: ModelAdapter implementation for OpenAI-compatible APIs (vLLM, TGI, etc.)

(examples/colbench_benchmark):

  • colbench.py: CLI runner matching the original sweet_rl workflow

Output format is backward-compatible with the original [sweet_rl](https://github.com/facebookresearch/sweet_rl) evaluation scripts.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

Documentation

  • Added/updated docstrings for new/modified functions as instructed [CONTRIBUTING.md](CONTRIBUTING.md)
  • Updated relevant documentation in docs/ (if applicable)
  • Tag github issue with this PR (if applicable)

Changelog

  • Added entry to CHANGELOG.md under [Unreleased] section
    • Use Added section for new features
    • Use Changed section for modifications to existing functionality
    • Use Fixed section for bug fixes
    • Use Removed section for deprecated/removed features
  • OR this is a documentation-only change (no changelog needed)

Example:
- Add ColBench benchmark for multi-turn collaborative agent evaluation

Architecture (if applicable)

  • Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
  • Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

  • Requires openai package (already an optional dependency) and a running vLLM server.
  • Tested with meta-llama/Llama-3.1-8B-Instruct on both agent and simulator sides.

@ahmedheakl ahmedheakl changed the title Agent collab ColBench Feb 24, 2026
@cemde cemde added enhancement New feature or request benchmarks regarding the `maseval/benchmark` package labels Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks regarding the `maseval/benchmark` package enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants