What happened
On 2026-06-13 the value model: claude-fable-5 (a model that does not exist) landed in config/phantom.yaml. From that point every scheduled job that fired errored at the SDK boundary:
Error: Claude Code returned an error result: There's an issue with the selected
model (claude-fable-5). It may not exist or you may not have access to it.
Run --model to pick a different model.
It went unnoticed for ~2 days. Eight orchestration crons (community-presence, truffle-co-daily-digest, truffle-maintains-promote, truffleagent-site-iterate, phantom-contribution, retro, …) failed every fire in that window. Resetting model to a valid id restored them.
Two gaps, not one
1. No validation on the model id. A bad model: value is accepted and persisted to phantom.yaml with no check that the id resolves. The failure only surfaces later, per-job, at run time — far from the write.
2. The fleet degrades silently. Cron errors are time-spaced, so consecutive_errors climbs slowly and rarely reaches MAX_CONSECUTIVE_ERRORS (10) before the next reset. A whole fleet of jobs can sit at last_run_status='error' for days without any one of them flipping to failed, so nothing alerts. There is no "N jobs erroring across the fleet" signal, only the per-job consecutive-error threshold.
Minor secondary: the manual revival path resets status/consecutive_errors/next_run_at but leaves last_run_status and last_run_error populated, so a healthy re-armed job still reads error until its next fire — misleading when triaging.
Repro
- Set
model: in config/phantom.yaml to a non-existent id.
- Let any scheduled job fire (or trigger one).
- The run errors with the message above;
last_run_status='error', consecutive_errors increments by 1.
- Other crons keep erroring on their own cadence; none reaches the
failed threshold, so no alert fires.
Suggestion
- Validate the model id when it is written to
phantom.yaml (reject, or fall back to the prior known-good value and log), so an evolution cycle cannot persist a model that does not resolve.
- Add a fleet-level health signal that fires when several jobs share
last_run_status='error', independent of any single job's consecutive-error count.
- On revival, clear
last_run_status/last_run_error so a re-armed job does not read as failed.
What happened
On 2026-06-13 the value
model: claude-fable-5(a model that does not exist) landed inconfig/phantom.yaml. From that point every scheduled job that fired errored at the SDK boundary:It went unnoticed for ~2 days. Eight orchestration crons (community-presence, truffle-co-daily-digest, truffle-maintains-promote, truffleagent-site-iterate, phantom-contribution, retro, …) failed every fire in that window. Resetting
modelto a valid id restored them.Two gaps, not one
1. No validation on the model id. A bad
model:value is accepted and persisted tophantom.yamlwith no check that the id resolves. The failure only surfaces later, per-job, at run time — far from the write.2. The fleet degrades silently. Cron errors are time-spaced, so
consecutive_errorsclimbs slowly and rarely reachesMAX_CONSECUTIVE_ERRORS(10) before the next reset. A whole fleet of jobs can sit atlast_run_status='error'for days without any one of them flipping tofailed, so nothing alerts. There is no "N jobs erroring across the fleet" signal, only the per-job consecutive-error threshold.Minor secondary: the manual revival path resets
status/consecutive_errors/next_run_atbut leaveslast_run_statusandlast_run_errorpopulated, so a healthy re-armed job still readserroruntil its next fire — misleading when triaging.Repro
model:inconfig/phantom.yamlto a non-existent id.last_run_status='error',consecutive_errorsincrements by 1.failedthreshold, so no alert fires.Suggestion
phantom.yaml(reject, or fall back to the prior known-good value and log), so an evolution cycle cannot persist a model that does not resolve.last_run_status='error', independent of any single job's consecutive-error count.last_run_status/last_run_errorso a re-armed job does not read as failed.