Skip to content

feat(waterdata): Add multi-value GET-parameter chunker for OGC API#280

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:chunker-multivalue
Draft

feat(waterdata): Add multi-value GET-parameter chunker for OGC API#280
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:chunker-multivalue

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Splits multi-value list params across sub-requests so each URL fits the server's ~8 KB byte cap. Without this, get_daily(monitoring_location_id=[300+ sites]) and similar workloads blow past the nginx buffer and the API returns HTTP 414.

The chunker is a decorator wrapped outside the existing @filters.chunked (CQL chunker) on _fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop:

@chunking.multi_value_chunked(build_request=_construct_api_requests)   # outer
@filters.chunked(build_request=_construct_api_requests)                # inner
def _fetch_once(args): ...

The two coordinate via a filter-aware probe so the outer planner doesn't reject combinations the stacked chunkers can actually handle.

Planner

_plan_chunks greedy-halves the largest chunk across all dimensions until the worst-case sub-request fits url_limit (URL + body, via _request_bytes, so POST routes size correctly). Cartesian product of per-dim partitions becomes the sub-request set; capped at max_chunks=1000.

Quota guard

Mid-call, the wrapper reads x-ratelimit-remaining after each sub-request. If it drops below quota_safety_floor=50, the wrapper raises QuotaExhausted carrying:

  • partial_frame — rows collected so far
  • completed_chunks / total_chunks — resume offset
  • remaining — last observed quota

Callers can either salvage or resume after the rate-limit window resets, rather than crash into a 429.

Failure modes

Scenario Result
URL fits without chunking Decorator passes through; one probe, no overhead
Lists need splitting Greedy halve → cartesian product of N sub-requests
Filter chunking also needed Outer mv-loop × inner filter-loop
Filter has clauses bigger than the budget RequestTooLarge with explicit message
Cartesian product > max_chunks RequestTooLarge with the actual count
x-ratelimit-remaining drops below floor mid-call QuotaExhausted with partial_frame, offsets

Public API

dataretrieval.waterdata.chunking.multi_value_chunked   # decorator
dataretrieval.waterdata.chunking.RequestTooLarge       # exception
dataretrieval.waterdata.chunking.QuotaExhausted        # exception

Tests

30 new tests (offline, deterministic fake build_request) covering:

  • _filter_aware_probe_args worst-case-clause modelling
  • _plan_chunks greedy halving, RequestTooLarge floor, filter-chunker coordination, max_chunks cap, lazy-default reads
  • multi_value_chunked pass-through, cartesian-product shape, end-to-end with stacked filter chunker
  • QuotaExhausted header parsing, mid-call abort, last-chunk no-abort, zero-floor disable

End-to-end correctness was verified against the live API: per-site cell-for-cell identical output between an unchunked single call and a forced-fan-out chunked call (patched _WATERDATA_URL_BYTE_LIMIT=300 → 30 HTTP sub-requests for the same query).

Known limitations (deferred)

  • max_chunks caps the chunk count, not the HTTP request count. A chunked call with 1000 chunks where each paginates 5× = 5000 HTTP requests. Lower max_chunks proportionally for heavy-pagination workloads, or rely on the QuotaExhausted runtime guard.
  • No auto-retry on QuotaExhausted. Caller decides whether to wait + resume or fail loudly.
  • Sub-requests run sequentially. Parallelism is a follow-up (WIP: perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft) #278 draft).

Provenance

Supersedes #276 with squashed history. The original PR carried 21 commits across iterative simplify / review-response passes; this one rolls them into a single coherent change atop the now-merged #233 (multi-value GET routing) and #279 (mid-pagination failure surfacing).

Refs #276, #279

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-value GET-parameter chunking to Water Data OGC getters to keep requests under the server’s URL byte limit (preventing HTTP 414), while coordinating with existing CQL filter chunking and adding a quota safety abort mechanism.

Changes:

  • Introduces dataretrieval.waterdata.chunking with a greedy chunk planner, RequestTooLarge / QuotaExhausted, and the @multi_value_chunked decorator.
  • Wraps utils._fetch_once with @multi_value_chunked outside @filters.chunked, and updates pagination/response metadata aggregation to reflect last headers + cumulative elapsed.
  • Adds extensive unit tests for planning/coordination/quota behavior and documents the behavior change in NEWS + get_daily docs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dataretrieval/waterdata/chunking.py New multi-value chunking decorator, planner, and quota guard exceptions.
dataretrieval/waterdata/utils.py Wires the new chunker into _fetch_once; aggregates paginated response headers/elapsed for accurate metadata and quota inspection.
dataretrieval/waterdata/filters.py Shares encoding-ratio helper; aggregates chunked responses using first URL + last headers + summed elapsed.
tests/waterdata_test.py Adds planner/decorator/quota guard tests using a deterministic fake request builder.
dataretrieval/waterdata/api.py Updates get_daily docstring with a chained-query example relying on transparent chunking.
NEWS.md Documents the new chunking behavior and metadata semantics changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dataretrieval/waterdata/chunking.py
@thodson-usgs thodson-usgs force-pushed the chunker-multivalue branch 2 times, most recently from 9d46c40 to 1c450dd Compare May 17, 2026 15:47
@thodson-usgs thodson-usgs requested a review from Copilot May 17, 2026 15:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment thread dataretrieval/waterdata/filters.py Outdated
Comment thread NEWS.md
Comment thread dataretrieval/waterdata/chunking.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants