Stop long-running scheduled procedures from starving scheduled reducers#5224
Open
Leonardo-Rocha wants to merge 4 commits into
Open
Stop long-running scheduled procedures from starving scheduled reducers#5224Leonardo-Rocha wants to merge 4 commits into
Leonardo-Rocha wants to merge 4 commits into
Conversation
ce369bb to
27e7bcf
Compare
…uled reducers The scheduler actor (`SchedulerActor::handle_queued`) awaited every scheduled function to completion before pulling the next due item from its `DelayQueue`. A scheduled `#[procedure]` that runs for a long time (e.g. one that calls `ctx.sleep_until` in a loop) therefore parked the actor and prevented every other due scheduled function -- reducers and procedures alike -- from being dispatched for as long as the procedure was alive. No error was logged; the reducer's schedule row simply never fired. Procedures already execute on their own pooled instances (`call_pooled`), separate from the main reducer executor, so awaiting them inline in the scheduler bought nothing but head-of-line blocking. Dispatch scheduled procedures on their own `tokio::spawn`ed task and let the actor loop keep draining the queue. Interval-scheduled procedures route their reschedule back to the actor through a new `SchedulerMessage::Reschedule`, since the spawned task cannot touch the actor-owned `queue`/`key_map`. Reducers keep their inline-await path (they cannot yield and run on the main executor, so this preserves their dispatch ordering). `Schedule` and the new `Reschedule` now share `enqueue_scheduled`, which removes any existing queued entry for the id before inserting -- without this, a row update or reschedule racing an already-queued entry would leak an orphaned `DelayQueue` entry and fire a duplicate dispatch. Consequence: scheduled procedures now run concurrently with scheduled reducers and with each other rather than strictly one-at-a-time. Transactional correctness is still enforced by the datastore's serializable isolation; only dispatch ordering relaxes. Concurrent execution remains bounded by the procedure instance pool. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
27e7bcf to
5521e64
Compare
Leonardo-Rocha
commented
Jun 4, 2026
| effective_at, | ||
| real_at, | ||
| } => { | ||
| // Incase of row update, remove the existing entry from queue first |
Author
There was a problem hiding this comment.
this function was just encapsulated in enqueue_scheduled to be reused
Leonardo-Rocha
commented
Jun 4, 2026
| // than one-at-a-time; the datastore's serializable isolation still applies.) | ||
| ScheduledFunctionKind::Procedure => { | ||
| let tx = self.tx.clone(); | ||
| tokio::spawn(async move { |
Author
There was a problem hiding this comment.
This tokio::spawn runs on the host/control runtime, not the per-database executor — but it's only the await-coordinator. The actual wasm work is dispatched to the database's SingleThreadedExecutor inside call_scheduled_procedure → call_pooled → run_async_job. So procedure execution stays on the per-DB pool (bounded by the procedure-instance semaphore); this task just waits for the result and forwards the interval reschedule.
…dure-starvation # Conflicts: # crates/core/src/host/scheduler.rs
The scheduler now dispatches scheduled procedures concurrently, so a long-running scheduled procedure no longer starves a scheduled reducer whose deadline falls during the procedure's sleep. Flip the assertion to expect interleaving (before < scheduled_reducer < after) and rename the test scheduled_procedure_scheduled_reducer_not_interleaved -> scheduled_procedure_scheduled_reducer_interleaves (plus its run selector and client handler) to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Assisted by Claude Opus 4.8 (1M)] Disclaimer: I'm not well versed in the codebase and just wanted to give it a spin to figure if the problem was easy to solve or not. Feel free to disregard the PR if the claims don't make sense.
Description of Changes
Stop long-running scheduled procedures from starving scheduled reducers
The scheduler actor (
SchedulerActor::handle_queued) awaited every scheduled function to completion before pulling the next due item from itsDelayQueue. A scheduled#[procedure]that runs for a long time (e.g. one that callsctx.sleep_untilin a loop) therefore parked the actor and prevented every other due scheduled function -- reducers and procedures alike -- from being dispatched for as long as the procedure was alive. No error was logged; the reducer's schedule row simply never fired.Procedures already execute on their own pooled instances (
call_pooled), separate from the main reducer executor, so awaiting them inline in the scheduler bought nothing but head-of-line blocking. Dispatch scheduled procedures on their owntokio::spawned task and let the actor loop keep draining the queue. Interval-scheduled procedures route their reschedule back to the actor through a newSchedulerMessage::Reschedule, since the spawned task cannot touch the actor-ownedqueue/key_map.Reducers keep their inline-await path (they cannot yield and run on the main executor, so this preserves their dispatch ordering).
Scheduleand the newReschedulenow shareenqueue_scheduled, which removes any existing queued entry for the id before inserting -- without this, a row update or reschedule racing an already-queued entry would leak an orphanedDelayQueueentry and fire a duplicate dispatch.Consequence: scheduled procedures now run concurrently with scheduled reducers and with each other rather than strictly one-at-a-time. Transactional correctness is still enforced by the datastore's serializable isolation; only dispatch ordering relaxes. Concurrent execution remains bounded by the procedure instance pool.
API and ABI breaking changes
None.
Expected complexity level and risk
3 — localized to the scheduler actor, but it changes scheduled-execution concurrency semantics.
Testing
Automated (SDK procedure-concurrency suite)
#4955 has since merged, so this branch is updated from
masterand now flips the SDK test that encoded the starvation as expected behavior. Insdks/rust/tests(modrust_procedure_concurrency):scheduled_procedure_scheduled_reducer_not_interleaved→scheduled_procedure_scheduled_reducer_interleaves(test fn,make_testrun selector, and the client handler).before < after < scheduled_reducer(procedure runs to completion first) tobefore < scheduled_reducer < after(the reducer interleaves between the procedure's two inserts), and updated the docstrings to match.The scenario: a scheduled procedure inserts
scheduled_procedure_before, sleeps, then insertsscheduled_procedure_after; a scheduled reducer comes due during the sleep and insertsscheduled_reducer. The insertion order pins down whether the reducer was starved.Verified both directions (TDD), built against a
spacetimedb-standalonefrom this branch:cargo test -p spacetimedb-sdk --test test rust_procedure_concurrency::scheduled_procedure_scheduled_reducer_interleaves→ok. The full mod passes:test result: ok. 4 passed; 0 failed.awaitlike a reducer instead oftokio::spawn): the same test fails withgot 1 < 3 < 2— i.e.scheduled_procedure_before(1) < scheduled_procedure_after(2) < scheduled_reducer(3), the reducer starved to last. Confirms the renamed test actually gates the fix rather than passing vacuously.Manual reproduction
Reproduction: https://github.com/Qilvo-Tech/spacetimedb-scheduler-starvation-repro
That repo is a minimal module with two scheduled tables: a
#[procedure]that loops onctx.sleep_untilat a 500 ms cadence, and a#[reducer]on a 200 ms interval. The procedure is deliberately the slower ticker, so the reducer's deadline is always sooner — ruling out earliest-deadline or CPU-saturation explanations.procedure_loop iter=logs ~30 times as expected, whilereducer_tick firedlogs 0 times — the reducer is completely starved for as long as the procedure is alive, with no error emitted.spacetimedb-standalonefrom this branch, published the repro module to it):procedure_loopkeeps its 500 ms cadence andreducer_tick firednow logs at the full 5 Hz, interleaved with the procedure — e.g. 46 procedure ticks and 110 reducer fires over the same window. Log tail shows them interleaving cleanly.See #4954.