Skip to content

Supervisor remains alive with missing worker process; due scheduled jobs stop being dispatched until restart #751

Description

@sapandiwakar

Description

We hit a production incident where Solid Queue stopped processing due scheduled jobs even though Mission Control still showed active workers with recent heartbeats.

Restarting the Solid Queue jobs process immediately fixed the issue.

This appears to be a supervisor/dispatcher/process-liveness issue rather than an application job failure: the process table was missing one expected worker before restart, the remaining workers continued heartbeating, scheduled jobs accumulated, and the supervisor did not appear to replace the missing worker or prune it as dead.

Environment

  • solid_queue: 1.4.0
  • Rails: 8.1.3
  • Ruby: 3.4.9
  • Queue DB: SQLite
  • App DB: PostgreSQL
  • Mode: fork
  • Started via bin/jobs
  • Deployment: Kamal container
  • Worker config:
dispatchers:
  - polling_interval: 1
    batch_size: 500

workers:
  - queues: [high_priority, medium_priority, default]
    threads: 3
    processes: 1
    polling_interval: 0.1

  - queues: [high_priority_transport, medium_priority_transport, default_transport]
    threads: 6
    processes: 3
    polling_interval: 0.1

Effective production env:

JOB_CONCURRENCY=1
JOB_THREADS=3
TRANSPORT_JOB_CONCURRENCY=3
TRANSPORT_JOB_THREADS=6
DB_POOL_SIZE=40

Expected behavior

The supervisor should keep the configured process set alive:

  • 1 Dispatcher
  • 1 Scheduler
  • 1 Supervisor(fork)
  • 4 Workers

If a worker exits, the supervisor should replace it or mark/prune it clearly.

Due scheduled jobs should be moved from solid_queue_scheduled_executions to solid_queue_ready_executions and then picked up by workers.

Actual behavior

Mission Control showed only 3 workers before restart:

worker 416 PID 229965
worker 417 PID 236467
worker 418 PID 236473

All three had recent heartbeats.

At the same time:

  • Scheduled jobs accumulated: 264 scheduled jobs, many delayed by around 20+ hours
  • In progress jobs: 0
  • Blocked jobs: 0
  • Some transport jobs had recently finished, but old delayed scheduled jobs remained stuck
  • Restarting the jobs container restored the expected process shape:
{
  "Dispatcher" => 1,
  "Scheduler" => 1,
  "Supervisor(fork)" => 1,
  "Worker" => 4
}

After restart, job processing resumed.

Log evidence

The old worker log had continuous heartbeats and pruning from the supervisor, but no obvious process replacement or dead-process pruning:

SolidQueue-1.4.0 Prune dead processes (...) size: 0

This repeated every 5 minutes.

The log also showed that work stopped moving after approximately 2026-06-24 11:06:28 Europe/Zurich. After that, the process kept heartbeating/pruning but did not perform jobs.

Aggregated log counts by minute near the incident:

2026-06-24T11:05 enq=41 ready=12 performing=22 performed=19 hb=5 prune=0 err=18
2026-06-24T11:06 enq=81 ready=19 performing=32 performed=34 hb=5 prune=0 err=33
2026-06-24T11:07 enq=0  ready=0  performing=0  performed=0  hb=5 prune=0 err=0
...
2026-06-24T11:29 enq=0  ready=0  performing=0  performed=0  hb=6 prune=1 err=0

I also searched the log for replacement/termination signals and did not find any replace_fork, process-exit, shutdown-timeout, or prune signal explaining the missing worker.

Why this looks like a Solid Queue liveness issue

From the README, workers process jobs from solid_queue_ready_executions, while dispatchers move due scheduled jobs from solid_queue_scheduled_executions to ready executions.

In this incident:

  • due scheduled jobs were present and delayed
  • workers/supervisor were still heartbeating
  • one configured worker process was missing
  • no replacement/prune log was emitted
  • restart restored the missing process and resumed processing

So the failure mode appears to be: the supervisor remained alive, but the configured process set was incomplete and/or dispatching stopped, without recovery or visible error.

This looks similar in shape to #204, but we are seeing it on Solid Queue 1.4.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions