fix: prevent request admission timeout row drops by eric-tramel · Pull Request #730 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-06-02T00:38:16Z

📋 Summary

Fixes Issue #725 by treating local request-admission queue timeouts as scheduler/request-pressure retryables instead of provider failures, and by bounding scheduler model-task admission with provider/model request capacity. This keeps healthy endpoints from dropping rows when async scheduling load creates local request-admission pressure.

🔗 Related Issue

Fixes #725

🔄 Changes

Classify request-admission queue_timeout as ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT / ModelRequestAdmissionTimeoutError so model callers see the right local-boundary failure.
Preserve request-admission timeout failures during async salvage alongside rate limits instead of converting them into dropped rows.
Add provider/model request resource keys to scheduler task admission so model tasks are admitted closer to configured request capacity before they reach request admission.
Keep the existing request-pressure advisory as a fairness hint rather than adding a second pressure-budget dispatch policy.
Add scheduler/model/executor regression coverage for timeout classification, salvage preservation/pacing, request-resource admission, and no-waiter healthy-run behavior.

🧪 Testing

uv run ruff check architecture/dataset-builders.md packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py packages/data-designer-engine/src/data_designer/engine/models/clients/errors.py packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py packages/data-designer-engine/src/data_designer/engine/models/errors.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/test_model_errors.py
uv run ruff format --check <touched Python files> (architecture/dataset-builders.md excluded from format check because ruff requires preview for Markdown formatting)
uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/test_model_errors.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py -q (146 passed)
uv run pytest packages/data-designer-engine/tests -q (2224 passed)
Request-admission pressure harness run locally with the same scenario used to reproduce Issue Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725. The harness remains outside the PR.

Performance demonstration:

Variant	Completed rows	Dropped rows	Request wait timeouts	Request admission timeout errors	Max request waiters	Dispatch batch yields	Elapsed
`origin-main-baseline`	47/128	81	2132	2132	95	0	12.378s
`simplified-working-tree`	128/128	0	1	1	1	0	4.992s

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated

github-actions · 2026-06-02T00:40:59Z

Code Review: PR #730 — `fix: prevent request admission timeout row drops`

Summary

This PR fixes Issue #725 in three coupled ways:

Error classification. Local request-admission queue_timeout failures
are split out from generic provider timeouts into a new
ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT and
ModelRequestAdmissionTimeoutError (subclass of ModelTimeoutError). The
model executor classifies them via _provider_error_from_request_admission,
and _should_retry keys off ProviderErrorKind rather than the chained
cause.
Salvage preservation. Request-admission timeouts are now treated like
429s in the async scheduler: they stay deferred across exhausted salvage
rounds and are retried after a paced wait, instead of becoming dropped
rows. The preservation machinery is renamed
_rate_limit_* → _preserved_retryable_* and now matches a tuple
PRESERVED_RETRYABLE_ERRORS = (ModelRateLimitError, ModelRequestAdmissionTimeoutError).
Dispatch pacing. _dispatch_queued_tasks now consults a
_RequestPressureDispatchContext that memoizes pressure reasons, in-flight
resource sets, and the open-peer search across one selection pass. After
each successful dispatch it asks _request_pressure_dispatch_batch_limit_hit
whether the per-resource / per-provider-model remaining capacity is
exhausted, and yields the dispatch loop with a dispatch_batch_yield event
if so. Cooldowns return a retry_after_seconds that the wait loop honours
via _wait_for_dispatch_wake. Startup uses RequestAdmissionConfig
(initial_limits, max_limit_clamps, startup_ramp_seconds) for a sensible
first-batch cap before snapshots exist.

The PR is well structured: the error/classification change is small, the
scheduler change is the big one, and tests cover the new behaviour densely
including the race conditions (test_scheduler_request_pressure_run_yields_before_creating_waiters,
test_scheduler_request_pressure_run_retries_after_cooldown_timer).

The benchmark in the PR description (47/128 → 128/128 rows, 12.4s → 4.7s) is
the right kind of evidence for a scheduler change of this size.

Findings

Correctness

Mismatched cooldown semantics in resalvage delay (low/medium).
_retryable_resalvage_delay_seconds (async_scheduler.py:1576-1591) was
renamed but its body still gates exclusively on ModelRateLimitError:
```
if not isinstance(self._deferred_errors.get(task), ModelRateLimitError):
    continue
```
Since _has_preserved_retryable_deferred_tasks now triggers this path for
request-admission timeouts too, an all-admission-timeout deferred set will
fall through to the constant RETRYABLE_RESALVAGE_BACKOFF_S (50ms) even
when the request-admission cooldown for the resource is much longer. That
is probably the right answer — admission timeouts don't have a
server-supplied retry-after the way 429s do — but the code reads as a
rename oversight rather than a deliberate decision. Worth either a comment
explaining why admission-timeout tasks intentionally fall back to the
short backoff, or extending the cooldown lookup symmetrically.
asyncio.sleep(0) after every dispatched batch (medium).
Two new sites unconditionally await asyncio.sleep(0) whenever
dispatch_outcome.dispatched is true: _main_dispatch_loop (around the
_dispatch_queued_tasks call) and _drain_frontier. This forces a loop
yield on every successful dispatch, not just when the new batch-limit
yield fires. The intent appears to be giving the worker tasks a chance to
pick up admitted work before the dispatcher loops again, but
_request_pressure_dispatch_batch_limit_hit already breaks the inner
dispatch loop with its own yield event when capacity is exhausted, and
_wait_for_dispatch_wake parks the loop otherwise. Two questions worth
answering in the PR:
1. Is the extra await sleep(0) a no-op overhead in the cap-not-reached
  case, or is it necessary because the worker tasks need a tick to
  schedule the awaitables created during dispatch? A one-line comment
  would put this to rest.
2. The select_next loop already calls
  _request_pressure_dispatch_batch_limit_hit and yields inside
  dispatch when the batch budget is hit; the post-call sleep(0) then
  yields again. If point (1) is real, the simplification is to move the
  yield into _dispatch_queued_tasks itself so it's handled in one
  place.
ModelRequestAdmissionTimeoutError hierarchy (low). The new error
is class ModelRequestAdmissionTimeoutError(ModelTimeoutError): .... That
is intentional (so legacy except ModelTimeoutError continues to catch
admission timeouts) but it does mean callers that want to treat the two
cases differently must use isinstance ordering carefully:
isinstance(exc, ModelRequestAdmissionTimeoutError) before
isinstance(exc, ModelTimeoutError). Worth a one-liner in the docstring on
the class describing the inheritance and why.
select_next view caching is correct but subtle (low).
select_next(..., view=view) now lets the caller pre-compute the queue
view once and reuse it for both selection and the explain_blocked
summary. The view is captured before the heap copy, and queue mutations
inside select_next only touch heap_copy, so reusing a stale view is
safe for the duration of one _dispatch_queued_tasks iteration. Good
optimisation. Worth a comment on select_next saying "view is read-only
and may be reused across calls within the same selection pass" so future
contributors don't accidentally mutate it.
_request_pressure_initial_dispatch_limit startup-ramp default (low).
When startup_ramp_seconds > 0 and configured_initial > 1, the function
hard-codes the dispatch budget to 1. That mirrors the controller's own
ramp behaviour, but couples the scheduler-side budget computation to an
invariant of AdaptiveRequestAdmissionController. If the controller
changes how it staggers the ramp (e.g., to min(2, initial_limit)), this
branch will silently disagree. Consider adding a property on
RequestAdmissionConfig like initial_dispatch_budget(resource) so the
scheduler asks the config for the answer instead of re-deriving it.

API / observability

New event kinds added to observability.py allowlist.
request_pressure_advisory_blocked and dispatch_batch_yield are added.
Good. Make sure consumers of the event stream (the docs page on scheduler
observability, any dashboards) get a follow-up; the PR doesn't list a
docs change beyond the architecture markdown one-liner.
_emit_scheduler_health_snapshot early-return (low). The if self._scheduler_event_sink is None: return short-circuit in
_emit_scheduler_health_snapshot is correct but the contents of the
snapshot are now potentially expensive to compute (pressure diagnostics,
in-flight resource scans). The early return saves work — good. However,
similar early returns are sprinkled inside _is_dispatch_eligible and
_emit_request_pressure_advisory_blocked. There's now a pattern of
if self._scheduler_event_sink is None: return repeated five-ish times in
this file. Consider a small helper _should_emit() -> bool to centralise
the check, or have _emit_scheduler_event itself early-return so callers
don't have to.

Style / conventions

All new code is properly typed with modern syntax (str | None,
dict[...]), uses from __future__ import annotations, and uses absolute
imports — matches STYLEGUIDE.md and the existing module conventions.
The naming refactor (rate_limit_preservation_* →
preserved_retryable_*) is thorough; I checked the diff and didn't see
any half-renamed callsites except for the cooldown calculator noted
above.
_RequestPressureDispatchContext uses @dataclass with field( default_factory=...) for the mutable defaults, which is correct.
The benchmark script lives under scripts/benchmarks/ and uses
argparse, prints provenance (git SHA, dirty flag, diff hash), and emits
both JSON and Markdown — consistent with other benchmark scripts in the
repo (good).

Test coverage

Solidly covered:

test_model_request_executor_classifies_sync_request_admission_queue_timeout
/ _async_* lock in the executor classification fix.
test_scheduler_request_admission_timeout_beyond_salvage_cap_is_delayed_not_dropped
is the direct regression test for Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725.
test_scheduler_paces_sustained_preserved_retryable_resalvage is now
parametrised over rate_limit and request_admission_timeout, locking in
the unified salvage path.
test_scheduler_request_pressure_run_yields_before_creating_waiters
asserts the real fleet behaviour: zero request_wait_timeout events
and at least one dispatch_batch_yield — that's the right level of
assertion.
test_scheduler_request_pressure_run_retries_after_cooldown_timer exercises
the retry_after_seconds round-trip end-to-end.

Suggestions:

A test for the rename-survival of
_retryable_resalvage_delay_seconds with admission-timeout deferred
tasks — i.e., assert that the function does not try to use a cooldown
for admission-timeout tasks (or, after the fix, that it does). The
current test_retryable_resalvage_delay_uses_request_cooldown only seeds
a ModelRateLimitError, which means the rename oversight noted above
isn't caught by tests.
The benchmark prints BENCHMARK_RESULT= to stdout. If you intend that to
be machine-parsed by CI, a tiny test ensuring the JSON shape (just the
shape of BenchmarkResult) would prevent silent drift.

Architecture / docs

architecture/dataset-builders.md updated to mention "preserved retryable
failures: provider rate limits and local request-admission queue timeouts"
— accurate.
The PR description is unusually thorough and includes a reproducible
benchmark variant flag. Good.
_fix-policy.md allowlist (if any) for the touched paths should be
re-checked, but the changed files all sit inside the engine package and
tests, which is the typical scope for fix PRs.

Risks

Blast radius is "every async run with a model column".
_dispatch_queued_tasks is in the hot path. The new context-memoization
is on the right side of the cost curve (it removes redundant scans within
a selection pass), but the per-iteration await asyncio.sleep(0) is a
visible behaviour change for runs that don't hit any pressure. The
benchmark numbers in the PR look fine, but worth sanity-checking against
a "no pressure at all" workload (small model cap not saturated) to make
sure the throughput doesn't regress.
Behaviour change for callers who relied on
ProviderErrorKind.TIMEOUT to mean "any timeout". If downstream code
matches on ProviderErrorKind.TIMEOUT and expected admission timeouts to
be in that bucket, it will now miss them. Same goes for
ModelTimeoutError users with type(...) is checks (rare, but worth
a grep). The isinstance path is preserved by inheritance.
_should_retry change. Previously
isinstance(exc.__cause__, RequestAdmissionError) short-circuited retry;
now it's exc.kind == ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT. The
classifier sets kind=TIMEOUT for non-queue_timeout admission errors,
which means those will now retry where previously they did not. Is this
intended? If a RequestAdmissionError raised with a non-queue_timeout
reason should still be non-retryable, the executor should preserve that
invariant explicitly.

Verdict

Solid fix. The error classification + scheduler pacing combination is the
right shape for the bug. Test coverage is thorough on the new paths.

Two items are worth resolving before merge:

The rename oversight in _retryable_resalvage_delay_seconds — either fix
to also consider admission-timeout cooldowns, or document why it
intentionally only consults rate-limit cooldowns.
The _should_retry semantics change for non-queue_timeout
RequestAdmissionErrors. Confirm the change is intended (or restore the
previous "any RequestAdmissionError-caused ProviderError is
non-retryable" guard).

Smaller polish items (sleep(0) comment, helper for the
event_sink is None guard, comment on select_next view reuse) are
nice-to-haves.

Approve with minor changes requested on items (1) and (2).

greptile-apps · 2026-06-02T00:48:22Z

Greptile Summary

This PR fixes a row-drop bug (issue #725) where async scheduler load under high concurrency caused local request-admission queue timeouts to be misclassified as provider failures, eventually exhausting salvage rounds and silently dropping rows instead of retrying.

Introduces ModelRequestAdmissionTimeoutError / ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT to distinguish local queue timeouts from real provider timeouts, and adds it to the preserved-retryable set so the scheduler backs off and retries instead of dropping rows.
Adds per-provider/model scheduler resource limits (request:{provider}/{model}) derived from configured model weights, so the task admission controller throttles dispatch before tasks ever reach the request-admission queue and create waiters.
Adds asyncio.sleep(0) after task dispatch to yield the event loop, letting dispatched coroutines start before the scheduler loops again; also fixes the queue_empty vs admission_blocked event-kind selection to check queued_count first.

Confidence Score: 5/5

Safe to merge — no functional regressions identified; all changed paths have targeted test coverage and the fix is validated end-to-end by a harness run.

The two root causes are addressed cleanly and independently: the error-classification change is isolated to _provider_error_from_request_admission and guarded by RequestDenyReason Literal typing, and the scheduler resource-limit additions flow through the existing admission-controller machinery without touching hot paths for non-model tasks. The asyncio.sleep(0) placement after dispatch is the minimal intervention needed to yield the event loop, and the queue_empty event-kind fix corrects a pre-existing diagnostic inaccuracy without changing runtime behavior. Test coverage spans error classification (sync + async), salvage-cycle preservation and pacing, the no-waiter admission regression, and the resolver resource-limit computation. The 2224-test suite passed in full.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Generalises the rate-limit-preservation machinery to cover any PRESERVED_RETRYABLE_ERRORS (now includes ModelRequestAdmissionTimeoutError); incorporates per-provider/model resource limits into task admission; fixes queue_empty/admission_blocked event selection; adds asyncio.sleep(0) after dispatch to yield the event loop.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py	Adds per-provider/model scheduler resource request to model tasks and builds request_resource_limits (min weight across generators for the same endpoint) exposed to the scheduler for task admission bounding.
packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py	Extracts _provider_error_from_request_admission helper that correctly maps queue_timeout decisions to REQUEST_ADMISSION_TIMEOUT and other admission denials to TIMEOUT; updates _should_retry to use the new kind enum check.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py	Widens SchedulerResourceKey from a closed Literal to str and adds request_scheduler_resource_key; validation updated to reject empty strings.
packages/data-designer-engine/src/data_designer/engine/models/errors.py	Adds ModelRequestAdmissionTimeoutError as a subclass of ModelTimeoutError with a user-facing message distinguishing local admission from provider-side timeouts.

Sequence Diagram

sequenceDiagram
    participant S as AsyncTaskScheduler
    participant TAC as TaskAdmissionController
    participant MRE as ModelRequestExecutor
    participant RAC as RequestAdmissionController
    participant P as Provider API

    S->>TAC: is_eligible(task, view) [checks llm_wait + request:provider/model limits]
    TAC-->>S: eligible
    S->>S: dispatch task + asyncio.sleep(0) yields loop
    S->>MRE: agenerate / acompletion
    MRE->>RAC: acquire_async(item)
    alt queue_timeout
        RAC-->>MRE: "RequestAdmissionError(reason=queue_timeout)"
        MRE-->>S: ProviderError(REQUEST_ADMISSION_TIMEOUT)
        S->>S: defer task to preserved retryables
        S->>S: _wait_before_retryable_resalvage()
        S->>S: retry in next salvage round
    else lease acquired
        RAC-->>MRE: RequestAdmissionLease
        MRE->>P: HTTP request
        P-->>MRE: response / error
        MRE->>RAC: release(lease, outcome)
        MRE-->>S: result or ModelRateLimitError
        alt rate limited
            S->>S: defer to preserved retryables
        else success
            S->>S: checkpoint row group
        end
    end

_{Reviews (3): Last reviewed commit: "fix: prevent request admission timeout r..." | Re-trigger Greptile}

- Classify local request-admission queue timeouts separately from provider timeouts - Preserve request-admission timeouts through async salvage like rate limits - Bound model task admission by provider/model request capacity - Add regression coverage for Issue NVIDIA-NeMo#725 Fixes NVIDIA-NeMo#725 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>

eric-tramel requested a review from a team as a code owner June 2, 2026 00:38

eric-tramel deployed to agentic-ci June 2, 2026 00:38 — with GitHub Actions Active

eric-tramel self-assigned this Jun 2, 2026

eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 820e795 to 0c57a64 Compare June 2, 2026 00:41

eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 0c57a64 to d9da186 Compare June 2, 2026 00:56

eric-tramel added bug Something isn't working ⏱️ scheduler labels Jun 2, 2026

eric-tramel force-pushed the codex/fix-725-request-timeouts branch from d9da186 to 6d8c86f Compare June 2, 2026 01:28

eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 6d8c86f to 0756416 Compare June 2, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent request admission timeout row drops#730

fix: prevent request admission timeout row drops#730
eric-tramel wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
eric-tramel:codex/fix-725-request-timeouts

eric-tramel commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Confidence Score: 5/5

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-tramel commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented Jun 2, 2026

Code Review: PR #730 — fix: prevent request admission timeout row drops

Summary

Findings

Correctness

API / observability

Style / conventions

Test coverage

Architecture / docs

Risks

Verdict

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eric-tramel commented Jun 2, 2026 •

edited

Loading

Code Review: PR #730 — `fix: prevent request admission timeout row drops`

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading