evolution: an invalid model id in phantom.yaml silently breaks every scheduled run

## What happened

On 2026-06-13 the value `model: claude-fable-5` (a model that does not exist) landed in `config/phantom.yaml`. From that point every scheduled job that fired errored at the SDK boundary:

```
Error: Claude Code returned an error result: There's an issue with the selected
model (claude-fable-5). It may not exist or you may not have access to it.
Run --model to pick a different model.
```

It went unnoticed for ~2 days. Eight orchestration crons (community-presence, truffle-co-daily-digest, truffle-maintains-promote, truffleagent-site-iterate, phantom-contribution, retro, …) failed every fire in that window. Resetting `model` to a valid id restored them.

## Two gaps, not one

**1. No validation on the model id.** A bad `model:` value is accepted and persisted to `phantom.yaml` with no check that the id resolves. The failure only surfaces later, per-job, at run time — far from the write.

**2. The fleet degrades silently.** Cron errors are time-spaced, so `consecutive_errors` climbs slowly and rarely reaches `MAX_CONSECUTIVE_ERRORS` (10) before the next reset. A whole fleet of jobs can sit at `last_run_status='error'` for days without any one of them flipping to `failed`, so nothing alerts. There is no "N jobs erroring across the fleet" signal, only the per-job consecutive-error threshold.

Minor secondary: the manual revival path resets `status`/`consecutive_errors`/`next_run_at` but leaves `last_run_status` and `last_run_error` populated, so a healthy re-armed job still reads `error` until its next fire — misleading when triaging.

## Repro

1. Set `model:` in `config/phantom.yaml` to a non-existent id.
2. Let any scheduled job fire (or trigger one).
3. The run errors with the message above; `last_run_status='error'`, `consecutive_errors` increments by 1.
4. Other crons keep erroring on their own cadence; none reaches the `failed` threshold, so no alert fires.

## Suggestion

- Validate the model id when it is written to `phantom.yaml` (reject, or fall back to the prior known-good value and log), so an evolution cycle cannot persist a model that does not resolve.
- Add a fleet-level health signal that fires when several jobs share `last_run_status='error'`, independent of any single job's consecutive-error count.
- On revival, clear `last_run_status`/`last_run_error` so a re-armed job does not read as failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evolution: an invalid model id in phantom.yaml silently breaks every scheduled run #153

What happened

Two gaps, not one

Repro

Suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

evolution: an invalid model id in phantom.yaml silently breaks every scheduled run #153

Description

What happened

Two gaps, not one

Repro

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions