The Problem
First paired benchmark against a live RimWorld colony shows agents are not helping:
Agent: 0.801 ± 0.03
Baseline: 0.830 ± 0.00
Delta: -0.029 (p = 0.37)
The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.
Why Agents Are Losing
1. High action failure rate
set_growing_zone → RIMAPI 500 every time (fork bug, tracked separately)
place_blueprint → agent doesn't include x,z coordinates
toggle_power → agent sends building_id=0 (no valid IDs in state)
haul_resource → RIMAPI rejects the job assignment
Agents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.
2. Agents disrupt productive colonist behavior
- RimWorld's built-in AI already assigns colonists to work, eat, sleep, haul
- Our agents override work priorities, draft colonists away from tasks, reassign researchers
- If the override is wrong or the action fails, the colonist is worse off than if we'd done nothing
3. No understanding of what's already working
- Agents see a snapshot of colony state but don't know what colonists are currently doing
- They propose "set_work_priority growing=1" but the colonist is already growing
- The action succeeds but adds no value — and may disrupt the colonist's current task queue
4. 10-second tick interval means minimal game progression
- Colony runs for 10 seconds between deliberation cycles
- Not enough time for actions to have measurable impact before the next override
What Needs to Change
Fix action reliability first
Make agents aware of current colonist activity
Increase tick interval for meaningful progression
Add "do no harm" principle to agent prompts
Success Criteria
The benchmark answer should be:
Agent: 0.85 ± 0.05
Baseline: 0.75 ± 0.03
Delta: +0.10** (p < 0.05)
Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.
How to Reproduce
# Requires: RimWorld running, RIMAPI mod, LM Studio with Nemotron Nano 4B
# Save a Crashlanded colony as "rle_crashlanded_v1"
python scripts/run_scenario.py crashlanded_survival \
--provider openai --model nvidia/nemotron-3-nano-4b \
--base-url http://localhost:1234/v1 \
--no-think --ticks 10
python scripts/run_scenario.py crashlanded_survival --no-agent --ticks 10
Compare the two final scores. Agent must be higher.
The Problem
First paired benchmark against a live RimWorld colony shows agents are not helping:
The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.
Why Agents Are Losing
1. High action failure rate
set_growing_zone→ RIMAPI 500 every time (fork bug, tracked separately)place_blueprint→ agent doesn't include x,z coordinatestoggle_power→ agent sends building_id=0 (no valid IDs in state)haul_resource→ RIMAPI rejects the job assignmentAgents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.
2. Agents disrupt productive colonist behavior
3. No understanding of what's already working
4. 10-second tick interval means minimal game progression
What Needs to Change
Fix action reliability first
set_growing_zoneRIMAPI fork bugtoggle_powerMake agents aware of current colonist activity
current_activityorcurrent_jobto colonist state (if RIMAPI exposes it)Increase tick interval for meaningful progression
Add "do no harm" principle to agent prompts
Success Criteria
The benchmark answer should be:
Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.
How to Reproduce
Compare the two final scores. Agent must be higher.