Skip to content

Conversation

@dt
Copy link

@dt dt commented Dec 24, 2025

Typically goroutines which yielded their execution to the runtime until spare execution capacity became available are resumed in the order in which they yielded, i.e. FIFO. That said, when or even if this happens is unpredictable: a gorotuine opting to yield does so with the explicit knowledge it could be paused indefinitely under sustained overload conditions, and as such, yielding is only done cooperatively by the calling goroutine at locations it knows to be safe to pause, i.e. not in critical sections or while holding locks. In general, the runtime scheduler does not offer user-facing facilities to prioritize relative scheduling of multiple goroutines as this opens it up to complicated questions of priority inversions, but voluntary yields allow the caller to determine when and where it is safe to yield, avoiding these issues.
Thus, within the narrowly defined scope of the resumption order of goroutines which have explicitly yielded -- which is already assumed to imply that deprioritizing them for the duration of a yield should not introduce priority inversions -- introducing relative priority between goroutines insofar allowing control over the oder in which they are resumed as capacity becomes available to do so can be feasible.

@dt dt requested a review from sumeerbhola December 24, 2025 22:47
@dt
Copy link
Author

dt commented Dec 24, 2025

This is an experiment/seed for discussion for now. I wanted to see how messy it'd be as much as anything.

I've been doing some benchmarking with yield enabled and have observed that under sustained overload, it is working as intended to prioritize foreground work, doing a good job of minimizing scheduling delays. However I've also observed that in an overloaded cluster, running a foreground workload along with two background operation, like rangefeeds for replication out to another cluster, and a schema change or IMPORT, we see goroutines from both background operations in the yield queue in roughly equal measures. This makes sense: they both call yield and then they both resume when able and then yield again as needed, so during spikes, they both end up in the queue.

Stepping back however, and looking at the bigger picture of the roles these tasks play in the cluster operation, it could seem that we would prefer to preferentially allocate capacity to one over the other: If a rangefeed is attempting to stream changes out of a cluster in near-real time, and a schema-change or IMPORT is making lots of changes that then need streaming out, it seems likely we would prefer, if we have to delay one of these, due to overload, to delay the writer rather than the reader. If we delay the writer and the reader gets a chance to run instead, perhaps it finishes reading what the writer had written so far, and then has nothing to do, freeing up capacity for the writer to then write more on its own. If we do the opposite however, the writer can keep writing and get further and further ahead of the starved reader. Given our goal with rangefeeds is near-real-time emitting of writes, we likely want to run them first, then run any writers who will make more work for them, if we're rationing what we run.

I originally thought we'd do this kind of cross-task prioritization in CRDB admission queues instead of the runtime, but at least after brief pondering, I didn't really see how we could do so without suffering from the same under-utilization we observed without a scheduler-drained bg work queue, thus this idea of making that queue order user-controlled instead.

I also waffled a bit about how do to a runtime-driven approach here: pass runtime.Yield a priority? Where would it stash it for comparing later? In g ? or should yieldq nodes be changed to a wrapper of g+priority (would be non-trivial since the queue used g as its concrete type). If is is stored on g, should I instead just have a setter to set it once when a task launches a worker (as I do here) and not need to plumb to every Yield call? If there is a priority on g now, should it also participate in deciding who parks and who stays in local runq in Yield()?

Anyway, I'm going to play with this a bit more, but I'm also finding I want more visibility into Yield() delays, though I think that too would need to be done in runtime since we can't afford to check the clock around the noop calls. I guess we could have Yield return a duration though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant