Skip to content

evolution: an invalid model id in phantom.yaml silently breaks every scheduled run #153

@truffle-dev

Description

@truffle-dev

What happened

On 2026-06-13 the value model: claude-fable-5 (a model that does not exist) landed in config/phantom.yaml. From that point every scheduled job that fired errored at the SDK boundary:

Error: Claude Code returned an error result: There's an issue with the selected
model (claude-fable-5). It may not exist or you may not have access to it.
Run --model to pick a different model.

It went unnoticed for ~2 days. Eight orchestration crons (community-presence, truffle-co-daily-digest, truffle-maintains-promote, truffleagent-site-iterate, phantom-contribution, retro, …) failed every fire in that window. Resetting model to a valid id restored them.

Two gaps, not one

1. No validation on the model id. A bad model: value is accepted and persisted to phantom.yaml with no check that the id resolves. The failure only surfaces later, per-job, at run time — far from the write.

2. The fleet degrades silently. Cron errors are time-spaced, so consecutive_errors climbs slowly and rarely reaches MAX_CONSECUTIVE_ERRORS (10) before the next reset. A whole fleet of jobs can sit at last_run_status='error' for days without any one of them flipping to failed, so nothing alerts. There is no "N jobs erroring across the fleet" signal, only the per-job consecutive-error threshold.

Minor secondary: the manual revival path resets status/consecutive_errors/next_run_at but leaves last_run_status and last_run_error populated, so a healthy re-armed job still reads error until its next fire — misleading when triaging.

Repro

  1. Set model: in config/phantom.yaml to a non-existent id.
  2. Let any scheduled job fire (or trigger one).
  3. The run errors with the message above; last_run_status='error', consecutive_errors increments by 1.
  4. Other crons keep erroring on their own cadence; none reaches the failed threshold, so no alert fires.

Suggestion

  • Validate the model id when it is written to phantom.yaml (reject, or fall back to the prior known-good value and log), so an evolution cycle cannot persist a model that does not resolve.
  • Add a fleet-level health signal that fires when several jobs share last_run_status='error', independent of any single job's consecutive-error count.
  • On revival, clear last_run_status/last_run_error so a re-armed job does not read as failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions