Skip to content

fix: prevent request admission timeout row drops#730

Open
eric-tramel wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
eric-tramel:codex/fix-725-request-timeouts
Open

fix: prevent request admission timeout row drops#730
eric-tramel wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
eric-tramel:codex/fix-725-request-timeouts

Conversation

@eric-tramel
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel commented Jun 2, 2026

📋 Summary

Fixes Issue #725 by treating local request-admission queue timeouts as scheduler/request-pressure retryables instead of provider failures, and by bounding scheduler model-task admission with provider/model request capacity. This keeps healthy endpoints from dropping rows when async scheduling load creates local request-admission pressure.

🔗 Related Issue

Fixes #725

🔄 Changes

  • Classify request-admission queue_timeout as ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT / ModelRequestAdmissionTimeoutError so model callers see the right local-boundary failure.
  • Preserve request-admission timeout failures during async salvage alongside rate limits instead of converting them into dropped rows.
  • Add provider/model request resource keys to scheduler task admission so model tasks are admitted closer to configured request capacity before they reach request admission.
  • Keep the existing request-pressure advisory as a fairness hint rather than adding a second pressure-budget dispatch policy.
  • Add scheduler/model/executor regression coverage for timeout classification, salvage preservation/pacing, request-resource admission, and no-waiter healthy-run behavior.

🧪 Testing

  • uv run ruff check architecture/dataset-builders.md packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py packages/data-designer-engine/src/data_designer/engine/models/clients/errors.py packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py packages/data-designer-engine/src/data_designer/engine/models/errors.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/test_model_errors.py
  • uv run ruff format --check <touched Python files> (architecture/dataset-builders.md excluded from format check because ruff requires preview for Markdown formatting)
  • uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/test_model_errors.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py -q (146 passed)
  • uv run pytest packages/data-designer-engine/tests -q (2224 passed)
  • Request-admission pressure harness run locally with the same scenario used to reproduce Issue Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725. The harness remains outside the PR.

Performance demonstration:

Variant Completed rows Dropped rows Request wait timeouts Request admission timeout errors Max request waiters Dispatch batch yields Elapsed
origin-main-baseline 47/128 81 2132 2132 95 0 12.378s
simplified-working-tree 128/128 0 1 1 1 0 4.992s

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated

@eric-tramel eric-tramel requested a review from a team as a code owner June 2, 2026 00:38
@eric-tramel eric-tramel deployed to agentic-ci June 2, 2026 00:38 — with GitHub Actions Active
@eric-tramel eric-tramel self-assigned this Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Code Review: PR #730fix: prevent request admission timeout row drops

Summary

This PR fixes Issue #725 in three coupled ways:

  1. Error classification. Local request-admission queue_timeout failures
    are split out from generic provider timeouts into a new
    ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT and
    ModelRequestAdmissionTimeoutError (subclass of ModelTimeoutError). The
    model executor classifies them via _provider_error_from_request_admission,
    and _should_retry keys off ProviderErrorKind rather than the chained
    cause.
  2. Salvage preservation. Request-admission timeouts are now treated like
    429s in the async scheduler: they stay deferred across exhausted salvage
    rounds and are retried after a paced wait, instead of becoming dropped
    rows. The preservation machinery is renamed
    _rate_limit_*_preserved_retryable_* and now matches a tuple
    PRESERVED_RETRYABLE_ERRORS = (ModelRateLimitError, ModelRequestAdmissionTimeoutError).
  3. Dispatch pacing. _dispatch_queued_tasks now consults a
    _RequestPressureDispatchContext that memoizes pressure reasons, in-flight
    resource sets, and the open-peer search across one selection pass. After
    each successful dispatch it asks _request_pressure_dispatch_batch_limit_hit
    whether the per-resource / per-provider-model remaining capacity is
    exhausted, and yields the dispatch loop with a dispatch_batch_yield event
    if so. Cooldowns return a retry_after_seconds that the wait loop honours
    via _wait_for_dispatch_wake. Startup uses RequestAdmissionConfig
    (initial_limits, max_limit_clamps, startup_ramp_seconds) for a sensible
    first-batch cap before snapshots exist.

The PR is well structured: the error/classification change is small, the
scheduler change is the big one, and tests cover the new behaviour densely
including the race conditions (test_scheduler_request_pressure_run_yields_before_creating_waiters,
test_scheduler_request_pressure_run_retries_after_cooldown_timer).

The benchmark in the PR description (47/128 → 128/128 rows, 12.4s → 4.7s) is
the right kind of evidence for a scheduler change of this size.

Findings

Correctness

  • Mismatched cooldown semantics in resalvage delay (low/medium).
    _retryable_resalvage_delay_seconds (async_scheduler.py:1576-1591) was
    renamed but its body still gates exclusively on ModelRateLimitError:

    if not isinstance(self._deferred_errors.get(task), ModelRateLimitError):
        continue

    Since _has_preserved_retryable_deferred_tasks now triggers this path for
    request-admission timeouts too, an all-admission-timeout deferred set will
    fall through to the constant RETRYABLE_RESALVAGE_BACKOFF_S (50ms) even
    when the request-admission cooldown for the resource is much longer. That
    is probably the right answer — admission timeouts don't have a
    server-supplied retry-after the way 429s do — but the code reads as a
    rename oversight rather than a deliberate decision. Worth either a comment
    explaining why admission-timeout tasks intentionally fall back to the
    short backoff, or extending the cooldown lookup symmetrically.

  • asyncio.sleep(0) after every dispatched batch (medium).
    Two new sites unconditionally await asyncio.sleep(0) whenever
    dispatch_outcome.dispatched is true: _main_dispatch_loop (around the
    _dispatch_queued_tasks call) and _drain_frontier. This forces a loop
    yield on every successful dispatch, not just when the new batch-limit
    yield fires. The intent appears to be giving the worker tasks a chance to
    pick up admitted work before the dispatcher loops again, but
    _request_pressure_dispatch_batch_limit_hit already breaks the inner
    dispatch loop with its own yield event when capacity is exhausted, and
    _wait_for_dispatch_wake parks the loop otherwise. Two questions worth
    answering in the PR:

    1. Is the extra await sleep(0) a no-op overhead in the cap-not-reached
      case, or is it necessary because the worker tasks need a tick to
      schedule the awaitables created during dispatch? A one-line comment
      would put this to rest.
    2. The select_next loop already calls
      _request_pressure_dispatch_batch_limit_hit and yields inside
      dispatch when the batch budget is hit; the post-call sleep(0) then
      yields again. If point (1) is real, the simplification is to move the
      yield into _dispatch_queued_tasks itself so it's handled in one
      place.
  • ModelRequestAdmissionTimeoutError hierarchy (low). The new error
    is class ModelRequestAdmissionTimeoutError(ModelTimeoutError): .... That
    is intentional (so legacy except ModelTimeoutError continues to catch
    admission timeouts) but it does mean callers that want to treat the two
    cases differently must use isinstance ordering carefully:
    isinstance(exc, ModelRequestAdmissionTimeoutError) before
    isinstance(exc, ModelTimeoutError). Worth a one-liner in the docstring on
    the class describing the inheritance and why.

  • select_next view caching is correct but subtle (low).
    select_next(..., view=view) now lets the caller pre-compute the queue
    view once and reuse it for both selection and the explain_blocked
    summary. The view is captured before the heap copy, and queue mutations
    inside select_next only touch heap_copy, so reusing a stale view is
    safe for the duration of one _dispatch_queued_tasks iteration. Good
    optimisation. Worth a comment on select_next saying "view is read-only
    and may be reused across calls within the same selection pass" so future
    contributors don't accidentally mutate it.

  • _request_pressure_initial_dispatch_limit startup-ramp default (low).
    When startup_ramp_seconds > 0 and configured_initial > 1, the function
    hard-codes the dispatch budget to 1. That mirrors the controller's own
    ramp behaviour, but couples the scheduler-side budget computation to an
    invariant of AdaptiveRequestAdmissionController. If the controller
    changes how it staggers the ramp (e.g., to min(2, initial_limit)), this
    branch will silently disagree. Consider adding a property on
    RequestAdmissionConfig like initial_dispatch_budget(resource) so the
    scheduler asks the config for the answer instead of re-deriving it.

API / observability

  • New event kinds added to observability.py allowlist.
    request_pressure_advisory_blocked and dispatch_batch_yield are added.
    Good. Make sure consumers of the event stream (the docs page on scheduler
    observability, any dashboards) get a follow-up; the PR doesn't list a
    docs change beyond the architecture markdown one-liner.

  • _emit_scheduler_health_snapshot early-return (low). The if self._scheduler_event_sink is None: return short-circuit in
    _emit_scheduler_health_snapshot is correct but the contents of the
    snapshot are now potentially expensive to compute (pressure diagnostics,
    in-flight resource scans). The early return saves work — good. However,
    similar early returns are sprinkled inside _is_dispatch_eligible and
    _emit_request_pressure_advisory_blocked. There's now a pattern of
    if self._scheduler_event_sink is None: return repeated five-ish times in
    this file. Consider a small helper _should_emit() -> bool to centralise
    the check, or have _emit_scheduler_event itself early-return so callers
    don't have to.

Style / conventions

  • All new code is properly typed with modern syntax (str | None,
    dict[...]), uses from __future__ import annotations, and uses absolute
    imports — matches STYLEGUIDE.md and the existing module conventions.
  • The naming refactor (rate_limit_preservation_*
    preserved_retryable_*) is thorough; I checked the diff and didn't see
    any half-renamed callsites except for the cooldown calculator noted
    above.
  • _RequestPressureDispatchContext uses @dataclass with field( default_factory=...) for the mutable defaults, which is correct.
  • The benchmark script lives under scripts/benchmarks/ and uses
    argparse, prints provenance (git SHA, dirty flag, diff hash), and emits
    both JSON and Markdown — consistent with other benchmark scripts in the
    repo (good).

Test coverage

Solidly covered:

  • test_model_request_executor_classifies_sync_request_admission_queue_timeout
    / _async_* lock in the executor classification fix.
  • test_scheduler_request_admission_timeout_beyond_salvage_cap_is_delayed_not_dropped
    is the direct regression test for Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725.
  • test_scheduler_paces_sustained_preserved_retryable_resalvage is now
    parametrised over rate_limit and request_admission_timeout, locking in
    the unified salvage path.
  • test_scheduler_request_pressure_run_yields_before_creating_waiters
    asserts the real fleet behaviour: zero request_wait_timeout events
    and at least one dispatch_batch_yield — that's the right level of
    assertion.
  • test_scheduler_request_pressure_run_retries_after_cooldown_timer exercises
    the retry_after_seconds round-trip end-to-end.

Suggestions:

  • A test for the rename-survival of
    _retryable_resalvage_delay_seconds with admission-timeout deferred
    tasks
    — i.e., assert that the function does not try to use a cooldown
    for admission-timeout tasks (or, after the fix, that it does). The
    current test_retryable_resalvage_delay_uses_request_cooldown only seeds
    a ModelRateLimitError, which means the rename oversight noted above
    isn't caught by tests.
  • The benchmark prints BENCHMARK_RESULT= to stdout. If you intend that to
    be machine-parsed by CI, a tiny test ensuring the JSON shape (just the
    shape of BenchmarkResult) would prevent silent drift.

Architecture / docs

  • architecture/dataset-builders.md updated to mention "preserved retryable
    failures: provider rate limits and local request-admission queue timeouts"
    — accurate.
  • The PR description is unusually thorough and includes a reproducible
    benchmark variant flag. Good.
  • _fix-policy.md allowlist (if any) for the touched paths should be
    re-checked, but the changed files all sit inside the engine package and
    tests, which is the typical scope for fix PRs.

Risks

  • Blast radius is "every async run with a model column".
    _dispatch_queued_tasks is in the hot path. The new context-memoization
    is on the right side of the cost curve (it removes redundant scans within
    a selection pass), but the per-iteration await asyncio.sleep(0) is a
    visible behaviour change for runs that don't hit any pressure. The
    benchmark numbers in the PR look fine, but worth sanity-checking against
    a "no pressure at all" workload (small model cap not saturated) to make
    sure the throughput doesn't regress.
  • Behaviour change for callers who relied on
    ProviderErrorKind.TIMEOUT to mean "any timeout".
    If downstream code
    matches on ProviderErrorKind.TIMEOUT and expected admission timeouts to
    be in that bucket, it will now miss them. Same goes for
    ModelTimeoutError users with type(...) is checks (rare, but worth
    a grep). The isinstance path is preserved by inheritance.
  • _should_retry change. Previously
    isinstance(exc.__cause__, RequestAdmissionError) short-circuited retry;
    now it's exc.kind == ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT. The
    classifier sets kind=TIMEOUT for non-queue_timeout admission errors,
    which means those will now retry where previously they did not. Is this
    intended? If a RequestAdmissionError raised with a non-queue_timeout
    reason should still be non-retryable, the executor should preserve that
    invariant explicitly.

Verdict

Solid fix. The error classification + scheduler pacing combination is the
right shape for the bug. Test coverage is thorough on the new paths.

Two items are worth resolving before merge:

  1. The rename oversight in _retryable_resalvage_delay_seconds — either fix
    to also consider admission-timeout cooldowns, or document why it
    intentionally only consults rate-limit cooldowns.
  2. The _should_retry semantics change for non-queue_timeout
    RequestAdmissionErrors. Confirm the change is intended (or restore the
    previous "any RequestAdmissionError-caused ProviderError is
    non-retryable" guard).

Smaller polish items (sleep(0) comment, helper for the
event_sink is None guard, comment on select_next view reuse) are
nice-to-haves.

Approve with minor changes requested on items (1) and (2).

@eric-tramel eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 820e795 to 0c57a64 Compare June 2, 2026 00:41
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR fixes a row-drop bug (issue #725) where async scheduler load under high concurrency caused local request-admission queue timeouts to be misclassified as provider failures, eventually exhausting salvage rounds and silently dropping rows instead of retrying.

  • Introduces ModelRequestAdmissionTimeoutError / ProviderErrorKind.REQUEST_ADMISSION_TIMEOUT to distinguish local queue timeouts from real provider timeouts, and adds it to the preserved-retryable set so the scheduler backs off and retries instead of dropping rows.
  • Adds per-provider/model scheduler resource limits (request:{provider}/{model}) derived from configured model weights, so the task admission controller throttles dispatch before tasks ever reach the request-admission queue and create waiters.
  • Adds asyncio.sleep(0) after task dispatch to yield the event loop, letting dispatched coroutines start before the scheduler loops again; also fixes the queue_empty vs admission_blocked event-kind selection to check queued_count first.

Confidence Score: 5/5

Safe to merge — no functional regressions identified; all changed paths have targeted test coverage and the fix is validated end-to-end by a harness run.

The two root causes are addressed cleanly and independently: the error-classification change is isolated to _provider_error_from_request_admission and guarded by RequestDenyReason Literal typing, and the scheduler resource-limit additions flow through the existing admission-controller machinery without touching hot paths for non-model tasks. The asyncio.sleep(0) placement after dispatch is the minimal intervention needed to yield the event loop, and the queue_empty event-kind fix corrects a pre-existing diagnostic inaccuracy without changing runtime behavior. Test coverage spans error classification (sync + async), salvage-cycle preservation and pacing, the no-waiter admission regression, and the resolver resource-limit computation. The 2224-test suite passed in full.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Generalises the rate-limit-preservation machinery to cover any PRESERVED_RETRYABLE_ERRORS (now includes ModelRequestAdmissionTimeoutError); incorporates per-provider/model resource limits into task admission; fixes queue_empty/admission_blocked event selection; adds asyncio.sleep(0) after dispatch to yield the event loop.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py Adds per-provider/model scheduler resource request to model tasks and builds request_resource_limits (min weight across generators for the same endpoint) exposed to the scheduler for task admission bounding.
packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py Extracts _provider_error_from_request_admission helper that correctly maps queue_timeout decisions to REQUEST_ADMISSION_TIMEOUT and other admission denials to TIMEOUT; updates _should_retry to use the new kind enum check.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py Widens SchedulerResourceKey from a closed Literal to str and adds request_scheduler_resource_key; validation updated to reject empty strings.
packages/data-designer-engine/src/data_designer/engine/models/errors.py Adds ModelRequestAdmissionTimeoutError as a subclass of ModelTimeoutError with a user-facing message distinguishing local admission from provider-side timeouts.

Sequence Diagram

sequenceDiagram
    participant S as AsyncTaskScheduler
    participant TAC as TaskAdmissionController
    participant MRE as ModelRequestExecutor
    participant RAC as RequestAdmissionController
    participant P as Provider API

    S->>TAC: is_eligible(task, view) [checks llm_wait + request:provider/model limits]
    TAC-->>S: eligible
    S->>S: dispatch task + asyncio.sleep(0) yields loop
    S->>MRE: agenerate / acompletion
    MRE->>RAC: acquire_async(item)
    alt queue_timeout
        RAC-->>MRE: "RequestAdmissionError(reason=queue_timeout)"
        MRE-->>S: ProviderError(REQUEST_ADMISSION_TIMEOUT)
        S->>S: defer task to preserved retryables
        S->>S: _wait_before_retryable_resalvage()
        S->>S: retry in next salvage round
    else lease acquired
        RAC-->>MRE: RequestAdmissionLease
        MRE->>P: HTTP request
        P-->>MRE: response / error
        MRE->>RAC: release(lease, outcome)
        MRE-->>S: result or ModelRateLimitError
        alt rate limited
            S->>S: defer to preserved retryables
        else success
            S->>S: checkpoint row group
        end
    end
Loading

Reviews (3): Last reviewed commit: "fix: prevent request admission timeout r..." | Re-trigger Greptile

@eric-tramel eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 0c57a64 to d9da186 Compare June 2, 2026 00:56
@eric-tramel eric-tramel added bug Something isn't working ⏱️ scheduler labels Jun 2, 2026
@eric-tramel eric-tramel force-pushed the codex/fix-725-request-timeouts branch from d9da186 to 6d8c86f Compare June 2, 2026 01:28
- Classify local request-admission queue timeouts separately from provider timeouts
- Preserve request-admission timeouts through async salvage like rate limits
- Bound model task admission by provider/model request capacity
- Add regression coverage for Issue NVIDIA-NeMo#725

Fixes NVIDIA-NeMo#725

Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>
@eric-tramel eric-tramel force-pushed the codex/fix-725-request-timeouts branch from 6d8c86f to 0756416 Compare June 2, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ⏱️ scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler stalls can turn healthy request waiters into queue_timeout row drops

1 participant