Agents must beat unmanaged baseline

## The Problem

First paired benchmark against a live RimWorld colony shows agents are **not helping**:

```
Agent:    0.801 ± 0.03
Baseline: 0.830 ± 0.00
Delta:    -0.029 (p = 0.37)
```

The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.

## Why Agents Are Losing

### 1. High action failure rate
- `set_growing_zone` → RIMAPI 500 every time (fork bug, tracked separately)
- `place_blueprint` → agent doesn't include x,z coordinates
- `toggle_power` → agent sends building_id=0 (no valid IDs in state)
- `haul_resource` → RIMAPI rejects the job assignment

Agents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.

### 2. Agents disrupt productive colonist behavior
- RimWorld's built-in AI already assigns colonists to work, eat, sleep, haul
- Our agents override work priorities, draft colonists away from tasks, reassign researchers
- If the override is wrong or the action fails, the colonist is worse off than if we'd done nothing

### 3. No understanding of what's already working
- Agents see a snapshot of colony state but don't know what colonists are currently doing
- They propose "set_work_priority growing=1" but the colonist is already growing
- The action succeeds but adds no value — and may disrupt the colonist's current task queue

### 4. 10-second tick interval means minimal game progression
- Colony runs for 10 seconds between deliberation cycles
- Not enough time for actions to have measurable impact before the next override

## What Needs to Change

### Fix action reliability first
- [ ] Fix `set_growing_zone` RIMAPI fork bug
- [ ] Teach agents to include coordinates for blueprints
- [ ] Expose building IDs in filtered state for `toggle_power`
- [ ] Get execution rate from 43% to 90%+

### Make agents aware of current colonist activity
- [ ] Add `current_activity` or `current_job` to colonist state (if RIMAPI exposes it)
- [ ] Agents should propose NO_ACTION when colonists are already doing the right thing
- [ ] Penalize unnecessary overrides in the scoring

### Increase tick interval for meaningful progression
- [ ] Test with 30-60 second tick intervals so colony state actually changes between ticks
- [ ] Fewer but higher-quality interventions > many disruptive ones

### Add "do no harm" principle to agent prompts
- [ ] System prompt: "Only propose actions that improve on the colony's current trajectory. If colonists are already productive, propose NO_ACTION."
- [ ] Weight NO_ACTION higher in the conflict resolver when no crisis exists

## Success Criteria

The benchmark answer should be:

```
Agent:    0.85 ± 0.05
Baseline: 0.75 ± 0.03
Delta:    +0.10** (p < 0.05)
```

Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.

## How to Reproduce

```bash
# Requires: RimWorld running, RIMAPI mod, LM Studio with Nemotron Nano 4B
# Save a Crashlanded colony as "rle_crashlanded_v1"

python scripts/run_scenario.py crashlanded_survival \
  --provider openai --model nvidia/nemotron-3-nano-4b \
  --base-url http://localhost:1234/v1 \
  --no-think --ticks 10

python scripts/run_scenario.py crashlanded_survival --no-agent --ticks 10
```

Compare the two final scores. Agent must be higher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents must beat unmanaged baseline #6

The Problem

Why Agents Are Losing

1. High action failure rate

2. Agents disrupt productive colonist behavior

3. No understanding of what's already working

4. 10-second tick interval means minimal game progression

What Needs to Change

Fix action reliability first

Make agents aware of current colonist activity

Increase tick interval for meaningful progression

Add "do no harm" principle to agent prompts

Success Criteria

How to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agents must beat unmanaged baseline #6

Description

The Problem

Why Agents Are Losing

1. High action failure rate

2. Agents disrupt productive colonist behavior

3. No understanding of what's already working

4. 10-second tick interval means minimal game progression

What Needs to Change

Fix action reliability first

Make agents aware of current colonist activity

Increase tick interval for meaningful progression

Add "do no harm" principle to agent prompts

Success Criteria

How to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions