feat(waterdata): Add multi-value GET-parameter chunker for OGC API#280
Draft
thodson-usgs wants to merge 1 commit into
Draft
feat(waterdata): Add multi-value GET-parameter chunker for OGC API#280thodson-usgs wants to merge 1 commit into
thodson-usgs wants to merge 1 commit into
Conversation
7 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
Adds multi-value GET-parameter chunking to Water Data OGC getters to keep requests under the server’s URL byte limit (preventing HTTP 414), while coordinating with existing CQL filter chunking and adding a quota safety abort mechanism.
Changes:
- Introduces
dataretrieval.waterdata.chunkingwith a greedy chunk planner,RequestTooLarge/QuotaExhausted, and the@multi_value_chunkeddecorator. - Wraps
utils._fetch_oncewith@multi_value_chunkedoutside@filters.chunked, and updates pagination/response metadata aggregation to reflect last headers + cumulative elapsed. - Adds extensive unit tests for planning/coordination/quota behavior and documents the behavior change in NEWS +
get_dailydocs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
dataretrieval/waterdata/chunking.py |
New multi-value chunking decorator, planner, and quota guard exceptions. |
dataretrieval/waterdata/utils.py |
Wires the new chunker into _fetch_once; aggregates paginated response headers/elapsed for accurate metadata and quota inspection. |
dataretrieval/waterdata/filters.py |
Shares encoding-ratio helper; aggregates chunked responses using first URL + last headers + summed elapsed. |
tests/waterdata_test.py |
Adds planner/decorator/quota guard tests using a deterministic fake request builder. |
dataretrieval/waterdata/api.py |
Updates get_daily docstring with a chained-query example relying on transparent chunking. |
NEWS.md |
Documents the new chunking behavior and metadata semantics changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9d46c40 to
1c450dd
Compare
1c450dd to
12b16f2
Compare
For multi-value waterdata queries (e.g. monitoring_location_id with ~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's ~8 KB nginx buffer and the API returns HTTP 414. This PR adds a chunker that transparently splits long list params across sub-requests so each URL fits the byte budget. The chunker is a decorator applied to ``_fetch_once`` outside the existing ``@filters.chunked`` (CQL chunker), so list-chunking is the outer loop and filter-chunking is the inner loop: @chunking.multi_value_chunked(build_request=_construct_api_requests) @filters.chunked(build_request=_construct_api_requests) def _fetch_once(args): ... Key design points: - ``_plan_chunks`` greedy-halves the largest chunk across all dimensions until the worst-case sub-request fits ``url_limit`` (URL + body, via ``_request_bytes``, so POST routes are sized correctly). Cartesian product of per-dim partitions becomes the sub-request set; capped at ``max_chunks=1000``. - ``_filter_aware_probe_args`` coordinates with ``filters.chunked``: the planner probes URL length using a synthetic clause that matches the inner filter chunker's bail-floor size (longest single clause, scaled by worst-case URL encoding ratio). Without this coordination, the outer planner would raise ``RequestTooLarge`` on combinations the stacked chunkers can actually handle. - ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining`` after each sub-request; if it drops below ``quota_safety_floor=50``, the wrapper raises with the partial frame, completed-chunk offset, and last observed remaining quota — letting callers salvage or resume after the rate-limit window resets, rather than crash into a silent mid-pagination 429. - ``RequestTooLarge`` is raised when the smallest reducible plan still exceeds ``url_limit`` (every multi-value param at a singleton chunk and any chunkable filter at the inner chunker's bail floor) or when the cartesian product exceeds ``max_chunks``. - All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``) resolve at call time, so monkey-patching ``filters._WATERDATA_URL_ BYTE_LIMIT`` for tests / non-default quotas affects the decorator uniformly. Public additions: - ``dataretrieval.waterdata.chunking.multi_value_chunked`` - ``dataretrieval.waterdata.chunking.RequestTooLarge`` - ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries ``partial_frame``, ``partial_response``, ``completed_chunks``, ``total_chunks``, ``remaining``) Tests (30 new): - ``_filter_aware_probe_args`` worst-case-clause modelling - ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter- chunker coordination, ``max_chunks`` cap, lazy-default reads - ``multi_value_chunked`` pass-through, cartesian-product shape, end-to-end with stacked filter chunker - ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no- abort, zero-floor disable - ``RequestTooLarge`` message contents and triggering conditions End-to-end correctness verified against the live API: identical per-site cell-for-cell output between unchunked (single call) and chunked (forced fan-out via patched limit) paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12b16f2 to
cd70929
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Splits multi-value list params across sub-requests so each URL fits the server's ~8 KB byte cap. Without this,
get_daily(monitoring_location_id=[300+ sites])and similar workloads blow past the nginx buffer and the API returns HTTP 414.The chunker is a decorator wrapped outside the existing
@filters.chunked(CQL chunker) on_fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop:The two coordinate via a filter-aware probe so the outer planner doesn't reject combinations the stacked chunkers can actually handle.
Planner
_plan_chunksgreedy-halves the largest chunk across all dimensions until the worst-case sub-request fitsurl_limit(URL + body, via_request_bytes, so POST routes size correctly). Cartesian product of per-dim partitions becomes the sub-request set; capped atmax_chunks=1000.Quota guard
Mid-call, the wrapper reads
x-ratelimit-remainingafter each sub-request. If it drops belowquota_safety_floor=50, the wrapper raisesQuotaExhaustedcarrying:partial_frame— rows collected so farcompleted_chunks/total_chunks— resume offsetremaining— last observed quotaCallers can either salvage or resume after the rate-limit window resets, rather than crash into a 429.
Failure modes
RequestTooLargewith explicit messagemax_chunksRequestTooLargewith the actual countx-ratelimit-remainingdrops below floor mid-callQuotaExhaustedwithpartial_frame, offsetsPublic API
Tests
30 new tests (offline, deterministic fake
build_request) covering:_filter_aware_probe_argsworst-case-clause modelling_plan_chunksgreedy halving,RequestTooLargefloor, filter-chunker coordination,max_chunkscap, lazy-default readsmulti_value_chunkedpass-through, cartesian-product shape, end-to-end with stacked filter chunkerQuotaExhaustedheader parsing, mid-call abort, last-chunk no-abort, zero-floor disableEnd-to-end correctness was verified against the live API: per-site cell-for-cell identical output between an unchunked single call and a forced-fan-out chunked call (patched
_WATERDATA_URL_BYTE_LIMIT=300→ 30 HTTP sub-requests for the same query).Known limitations (deferred)
max_chunkscaps the chunk count, not the HTTP request count. A chunked call with 1000 chunks where each paginates 5× = 5000 HTTP requests. Lowermax_chunksproportionally for heavy-pagination workloads, or rely on theQuotaExhaustedruntime guard.QuotaExhausted. Caller decides whether to wait + resume or fail loudly.Provenance
Supersedes #276 with squashed history. The original PR carried 21 commits across iterative simplify / review-response passes; this one rolls them into a single coherent change atop the now-merged #233 (multi-value GET routing) and #279 (mid-pagination failure surfacing).
Refs #276, #279