fix: prevent request admission timeout row drops#730
Conversation
Code Review: PR #730 —
|
820e795 to
0c57a64
Compare
Greptile SummaryThis PR fixes a row-drop bug (issue #725) where async scheduler load under high concurrency caused local request-admission queue timeouts to be misclassified as provider failures, eventually exhausting salvage rounds and silently dropping rows instead of retrying.
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Generalises the rate-limit-preservation machinery to cover any PRESERVED_RETRYABLE_ERRORS (now includes ModelRequestAdmissionTimeoutError); incorporates per-provider/model resource limits into task admission; fixes queue_empty/admission_blocked event selection; adds asyncio.sleep(0) after dispatch to yield the event loop. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py | Adds per-provider/model scheduler resource request to model tasks and builds request_resource_limits (min weight across generators for the same endpoint) exposed to the scheduler for task admission bounding. |
| packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py | Extracts _provider_error_from_request_admission helper that correctly maps queue_timeout decisions to REQUEST_ADMISSION_TIMEOUT and other admission denials to TIMEOUT; updates _should_retry to use the new kind enum check. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py | Widens SchedulerResourceKey from a closed Literal to str and adds request_scheduler_resource_key; validation updated to reject empty strings. |
| packages/data-designer-engine/src/data_designer/engine/models/errors.py | Adds ModelRequestAdmissionTimeoutError as a subclass of ModelTimeoutError with a user-facing message distinguishing local admission from provider-side timeouts. |
Sequence Diagram
sequenceDiagram
participant S as AsyncTaskScheduler
participant TAC as TaskAdmissionController
participant MRE as ModelRequestExecutor
participant RAC as RequestAdmissionController
participant P as Provider API
S->>TAC: is_eligible(task, view) [checks llm_wait + request:provider/model limits]
TAC-->>S: eligible
S->>S: dispatch task + asyncio.sleep(0) yields loop
S->>MRE: agenerate / acompletion
MRE->>RAC: acquire_async(item)
alt queue_timeout
RAC-->>MRE: "RequestAdmissionError(reason=queue_timeout)"
MRE-->>S: ProviderError(REQUEST_ADMISSION_TIMEOUT)
S->>S: defer task to preserved retryables
S->>S: _wait_before_retryable_resalvage()
S->>S: retry in next salvage round
else lease acquired
RAC-->>MRE: RequestAdmissionLease
MRE->>P: HTTP request
P-->>MRE: response / error
MRE->>RAC: release(lease, outcome)
MRE-->>S: result or ModelRateLimitError
alt rate limited
S->>S: defer to preserved retryables
else success
S->>S: checkpoint row group
end
end
Reviews (3): Last reviewed commit: "fix: prevent request admission timeout r..." | Re-trigger Greptile
0c57a64 to
d9da186
Compare
d9da186 to
6d8c86f
Compare
- Classify local request-admission queue timeouts separately from provider timeouts - Preserve request-admission timeouts through async salvage like rate limits - Bound model task admission by provider/model request capacity - Add regression coverage for Issue NVIDIA-NeMo#725 Fixes NVIDIA-NeMo#725 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>
6d8c86f to
0756416
Compare
📋 Summary
Fixes Issue #725 by treating local request-admission queue timeouts as scheduler/request-pressure retryables instead of provider failures, and by bounding scheduler model-task admission with provider/model request capacity. This keeps healthy endpoints from dropping rows when async scheduling load creates local request-admission pressure.
🔗 Related Issue
Fixes #725
🔄 Changes
queue_timeoutasProviderErrorKind.REQUEST_ADMISSION_TIMEOUT/ModelRequestAdmissionTimeoutErrorso model callers see the right local-boundary failure.🧪 Testing
uv run ruff check architecture/dataset-builders.md packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py packages/data-designer-engine/src/data_designer/engine/models/clients/errors.py packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py packages/data-designer-engine/src/data_designer/engine/models/errors.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/test_model_errors.pyuv run ruff format --check <touched Python files>(architecture/dataset-builders.mdexcluded from format check because ruff requires preview for Markdown formatting)uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/test_model_errors.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py -q(146 passed)uv run pytest packages/data-designer-engine/tests -q(2224 passed)Performance demonstration:
origin-main-baselinesimplified-working-tree✅ Checklist