Skip to content

Add OMATS (OpenClaw Multi-Agent Test Suite) as a benchmark #40

@ThinkOffApp

Description

@ThinkOffApp

OMATS is a benchmark for evaluating LLMs in multi-agent room environments. It tests failure modes that don't show up in single-agent benchmarks: agents echoing each other, ignoring stop orders, leaking system prompts, planning instead of acting, and compounding each other's guardrails.

The suite has 28 scripted scenarios across three capability stages. Stage 3 tests single-agent discipline (loop avoidance, idle management, personality consistency). Stage 4 tests multi-agent communication (stop order compliance, echo resistance, indirect address parsing, social pressure resistance). Stage 5 tests agent management (task delegation, noise control, conflict resolution, escalation judgment).

Scoring is continuous (0.0–1.0) with auto-fail gates for prompt leakage, impersonation, and silence violations. We've run 10 models so far and the results differentiate well between model tiers.

Repo: https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite

This would complement MASEval's existing benchmarks (GAIA, AgentBench) by adding room-based multi-agent communication evaluation, which none of the current integrations cover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    benchmarksregarding the `maseval/benchmark` packageenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions