Add multi-value GET-parameter chunker for waterdata OGC API by thodson-usgs · Pull Request #276 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-15T02:10:08Z

Summary

Adds a multi-value GET-parameter chunker (dataretrieval/waterdata/chunking.py) that splits cartesian combinations of multi-value list params across sub-requests so each URL fits the server's ~8 KB byte limit. Sits OUTSIDE @filters.chunked on _fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop. The two chunkers coordinate via a filter-aware probe.

Depends on #273. That PR fixes the pre-existing silent-truncation bug in _walk_pages whose blast radius this PR materially expands. Merge order: #273 → #233 → this PR. See "Merge ordering" below.

Why

PR #233 routes multi-value waterdata queries through GET with comma-separated values (e.g. monitoring_location_id=USGS-A,USGS-B,…). For real-world workloads (~300+ sites or pcodes), the URL blows past the server's 8 KB nginx buffer and the API returns HTTP 414. This PR makes long list params transparently fan out.

Design

@chunking.multi_value_chunked(build_request=_construct_api_requests)   # outer
@filters.chunked(build_request=_construct_api_requests)                # inner
def _fetch_once(args: dict) -> tuple[DataFrame, Response]: ...

Planner (_plan_chunks): greedy halving of the largest chunk across all dims until the worst-case sub-request URL fits the byte limit. Cartesian product of per-dim partitions becomes the sub-request set. O(log N) per dim.
Coordination with filters.chunked: when planning, the URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args. This models the per-sub-request URL that filters.chunked will actually emit — the filter chunker can shrink the filter down to one clause per sub-request but no smaller (it bails if any clause exceeds the budget). Without this coordination, a long OR-filter combined with multi-value lists triggered premature RequestTooLarge exceptions even when the combined chunkers would have made things fit.
Hard cap (max_chunks, default 1000): the cartesian product won't exceed this. Note: this caps the count of chunks, not the count of HTTP requests, because each chunk may paginate (each pagination is a separate HTTP request that counts against the hourly quota).
Quota-aware abort (quota_safety_floor, default 50): after every sub-request the wrapper reads x-ratelimit-remaining. If it drops below the floor, the wrapper raises QuotaExhausted carrying the combined partial frame, the chunk offset, and the last-observed remaining — letting callers either salvage what they have or resume from the next chunk after the hourly window resets. This is critical because of the pagination silent-truncation bug fixed in Fix silent / misleading failures in paginated waterdata pagination #273; without quota awareness the chunker could otherwise drive the API into 429 territory mid-call and (pre-Fix silent / misleading failures in paginated waterdata pagination #273) silently lose pagination pages.

Failure modes (post-fix)

Scenario	Result
URL fits without chunking	Decorator passes through; one request, no overhead beyond a single planning probe
Lists need splitting	Greedy halve → cartesian product of N sub-requests
Filter chunking also needed	Outer mv-loop × inner filter-loop; identical-row dedup at combine
Filter has clauses bigger than achievable budget	`RequestTooLarge` with explicit message
Cartesian product > `max_chunks`	`RequestTooLarge` with the actual count
`x-ratelimit-remaining` drops below floor mid-call	`QuotaExhausted` with `.partial_frame`, `.completed_chunks`, `.total_chunks`

Coordination empirically verified

Validated against api.waterdata.usgs.gov (worktree contains the three live test scripts):

Test	Setup	Result
3-dim equivalence	4 sites × 3 pcodes × 3 stats, `url_limit=220`	16 sub-requests, every axis confirmed split, chunked frame exactly equal to unchunked reference
Stress (6 edge cases)	Singletons, lopsided sizes, monitoring-locations POST path, RequestTooLarge floor, etc.	All pass
Filter/mv composition (3 regimes)	mv-only / filter-only / both with previously-failing `url_limit=220`	All pass — frames match references; the "both" case would have raised `RequestTooLarge` without the filter-aware probe

Test plan

Unit tests for _filter_aware_probe_args (8 cases)
Unit tests for _plan_chunks greedy halving, RequestTooLarge floor, coordination with filter chunker, max_chunks cap
Unit tests for multi_value_chunked pass-through, cartesian product shape, lazy URL-limit reads, max_chunks override
Unit tests for QuotaExhausted machinery (8 cases: header parsing, default floor, mid-call abort with partial frame, last-chunk-no-abort, zero-floor disable, message contents)
Defensive: filter and filter_lang excluded from _NEVER_CHUNK so a caller passing filter as a list can't produce malformed CQL
Live: 3-dim equivalence, 6-case stress matrix, 3-regime filter/mv composition all green against production API
No regression: existing _construct_api_requests_* and _check_* tests still pass

30 new + existing tests; runs offline (uses a deterministic fake build_request). Live scripts in /tmp of the development worktree, not in the test suite.

Merge ordering

Fix silent / misleading failures in paginated waterdata pagination #273 first (silent-truncation fix in _walk_pages). This chunker amplifies the frequency of the latent silent-truncation bug — without Fix silent / misleading failures in paginated waterdata pagination #273, every paginated sub-request inside the cartesian product becomes a chance for silent data loss when the hourly quota is exhausted. The chunker's QuotaExhausted guard mitigates this somewhat by aborting before the 429, but it's belt-and-suspenders, not a substitute.
Use GET with comma-separated values for multi-value waterdata queries #233 next (GET-with-commas routing). This chunker layers on top of Use GET with comma-separated values for multi-value waterdata queries #233's GET routing.
This PR last.

If the team prefers to land in a different order, this PR can ship without #273 — but the chunker should be considered "production-ready" only once #273 is merged.

Files changed

dataretrieval/waterdata/chunking.py (new) — decorator, planner, RequestTooLarge, QuotaExhausted
dataretrieval/waterdata/utils.py — wire @chunking.multi_value_chunked outside @filters.chunked on _fetch_once; updated docstring
tests/waterdata_test.py — 30 new tests covering planner, coordination, cap, quota machinery

Public API additions

dataretrieval.waterdata.chunking.RequestTooLarge — raised when planning can't make the URL fit
dataretrieval.waterdata.chunking.QuotaExhausted — raised mid-call when the quota safety floor is breached; carries .partial_frame, .partial_response, .completed_chunks, .total_chunks, .remaining
dataretrieval.waterdata.chunking.multi_value_chunked — decorator, in case downstream wraps a custom _fetch_once
Module constants _DEFAULT_MAX_CHUNKS = 1000, _DEFAULT_QUOTA_SAFETY_FLOOR = 50 (underscored — not part of the stable public API; available for monkey-patching by users with non-default quotas)

Known limitations (deferred)

The cap counts chunks, not HTTP requests. Paginated sub-requests each make multiple HTTP calls; a chunked call with 1000 chunks where each paginates 5× = 5000 HTTP requests. Document workaround: lower max_chunks proportionally for paginating workloads, or rely on the QuotaExhausted abort to catch overruns at runtime.
No auto-retry / no sleep-until-reset on QuotaExhausted. Caller decides. Could add on_quota_exhausted="sleep" mode in a follow-up if desired.
Sequential sub-requests. Parallelism would reduce wall time but complicates rate-limit handling.

🤖 Generated with Claude Code

The OGC API now supports comma-separated values for fields like monitoring_location_id, parameter_code, and statistic_id, making POST+CQL2 unnecessary for most services. Update _construct_api_requests to join list params with commas and use GET for daily, continuous, latest-daily, latest-continuous, field-measurements, time-series-metadata, and channel-measurements. The monitoring-locations endpoint does not yet support comma-separated GET parameters (returns 400); it retains the POST+CQL2 path. Closes DOI-USGS#210. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nit tests Style alignment with the rest of `waterdata/utils.py`: - Hoist `_cql2_required_services` from a function-local lowercase `set` to a module-level `_CQL2_REQUIRED_SERVICES = frozenset(...)` to match the convention of `_DATE_RANGE_PARAMS`, `_NO_NORMALIZE_PARAMS`, `_MONITORING_LOCATION_ID_RE`, etc. - Drop the "Legacy path:" prefix in the inline comment. POST/CQL2 is still the current and required path for monitoring-locations — the API team hasn't promised to add comma-GET there. Rephrased the two branches symmetrically ("POST with CQL2 JSON" / "GET with comma- separated values") so neither reads as deprecated. New unit tests: - `test_construct_api_requests_single_value_stays_get` — confirms a scalar `monitoring_location_id="USGS-..."` still produces a clean GET with no `%2C`, i.e. existing single-site callers see no change. - `test_construct_api_requests_numeric_list_joins_with_str` — pins down that `water_year=[2020, 2021]` reaches the URL as `water_year=2020%2C2021`, exercising the `str(x) for x in v` generator that exists specifically to handle non-string list params (without it, `",".join` on a list of ints would TypeError). - `test_construct_api_requests_two_element_date_list_becomes_interval` — pins down the contract that a two-element date list (`time=["2024-01-01", "2024-01-31"]`) is interpreted as start/end of an OGC datetime interval (joined with `/`), NOT as two discrete dates. The OGC `datetime` parameter doesn't support "these N specific dates" — that would require a CQL filter. Test exists so this semantic choice can't be silently changed.

Wraps _fetch_once with a cartesian-product chunker that sits OUTSIDE @filters.chunked. Splits multi-value list params (monitoring_location_id, parameter_code, statistic_id, etc.) across sub-requests so each URL fits the server's ~8 KB byte limit. Coordination with @filters.chunked: the planner's URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args, modeling the per-sub-request URL the inner filter chunker will actually emit. Without this coordination, a long OR-filter plus multi-value lists triggered premature RequestTooLarge even when the combined chunkers would have made things fit. Two safety guards: - max_chunks=1000 cap on cartesian-product size (matches USGS API hourly quota; raises RequestTooLarge with the actual count when exceeded). - QuotaExhausted abort: between sub-requests, reads x-ratelimit-remaining; if below quota_safety_floor (default 50), raises with the partial frame and chunk offset so callers can resume instead of crashing into a mid-call HTTP 429. 30 unit tests cover the planner, filter-aware coordination, the cap, and the quota-aware abort. Live tests in /tmp verify a 3-dim equivalence case (chunked == unchunked, 16 sub-requests, all axes split), 6 edge-case stress scenarios, and 3 mv/filter composition regimes. Depends on DOI-USGS#273 (paginated silent-truncation fix) — this PR multiplies the frequency at which the silent-truncation bug class would have surfaced. Merge order: DOI-USGS#273 -> DOI-USGS#233 -> this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The filter chunker (`filters.chunked`) splits a filter into chunks each ≤ the per-sub-request budget, but bails (returns the full filter unchanged) when ANY single OR-clause exceeds the budget. So the smallest filter size the inner chunker can guarantee to emit per sub-request is bounded below by the LARGEST single clause, not the smallest. The original implementation used `min(parts)` to model the chunker's output floor. For filters with uniform clause sizes (all my prior tests), min == max and the bug was hidden. For filters with lopsided clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would let the planner falsely declare a plan feasible. The inner chunker would then bail on the large clause, the real per-sub-request URL would carry the full filter, and the request would 414 server-side. Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL limit, the inner chunker's budget is ≥ max(parts), so all clauses fit individually and chunking succeeds. If singleton+max-clause doesn't fit, the planner correctly raises `RequestTooLarge` instead of producing an unservable plan. Regression test: `test_plan_chunks_probes_with_max_clause_not_min`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The motivating user story for this PR is the same one R `dataRetrieval` covers in #870: pull a long monitoring-location list from one getter, feed it to another. Before chunking this fails with HTTP 414 once the URL grows past the server's ~8 KB limit; after it transparently fans out. - chunking.py: prepend a docstring example showing the Ohio-stream-sites → daily-discharge chained call, so readers landing on the module file see the motivating scenario immediately. - api.py get_daily: add the same chained example to the Examples block (where similar single-site and multi-site examples already live), so the most-used getter's docstring shows what just became possible. - NEWS.md: user-visible entry framing the change in terms of "this now works" — chained queries, transparent chunking, max_chunks cap, and QuotaExhausted resume. References R PR #870 as the analogous change. No code changes; pure docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….prod Three small simplifications, no behavior change: - Extract _chunk_bytes(chunk) helper for len(",".join(map(str, chunk))). Used in both _worst_case_args and _plan_chunks; the helper documents the cost model the planner compares chunks under. - Name the magic sentinel 10**9 as _QUOTA_UNKNOWN. _read_remaining returns it on missing/malformed x-ratelimit-remaining headers; having one definition prevents the value from drifting between branches. - Use math.prod for the cartesian-product cardinality calculation in _plan_chunks (max_chunks check) and the wrapper (quota-floor loop bound). Replaces an open-coded multiply-loop in two places. All 25 chunker tests and 88 filter tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # NEWS.md # dataretrieval/waterdata/utils.py # tests/waterdata_test.py

Copilot

Pull request overview

Adds a new multi-value GET-parameter chunking layer for Water Data OGC requests to prevent HTTP 414 failures when comma-joined list parameters exceed the server’s URL byte limit, and wires it into the existing waterdata OGC fetch pipeline (composing with the existing CQL filter chunker).

Changes:

Introduces dataretrieval.waterdata.chunking with a greedy planner, multi_value_chunked decorator, and RequestTooLarge / QuotaExhausted exceptions.
Updates OGC request construction and wraps _fetch_once with @chunking.multi_value_chunked outside @filters.chunked.
Adds extensive unit tests for request routing and chunker behavior; updates NEWS and get_daily docs to describe the new behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`dataretrieval/waterdata/chunking.py`	New multi-value list chunking planner/decorator + quota guard and new exceptions.
`dataretrieval/waterdata/utils.py`	Routes multi-value params via comma-joined GET (except POST+CQL2 services) and composes the new chunker with existing filter chunking on `_fetch_once`.
`tests/waterdata_test.py`	Adds unit tests covering GET/POST routing, planning/coordination logic, caps, and quota-abort behavior.
`dataretrieval/waterdata/api.py`	Updates `get_daily` docstring examples to include the new transparent chunking behavior.
`NEWS.md`	Changelog entry documenting the new URL-size chunking and quota-abort behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PR 233 routes monitoring-locations through POST with the multi-value list embedded in a CQL2 JSON body — the URL stays ~200 bytes regardless of how many sites are passed. The chunker was probing url length only, so it concluded "no need to split" for any number of monitoring_location_id values and let the request go out unchunked. The server then rejected it with HTTP 403 ("Query request denied. Possible reasons include query exceeding server limits") once the CQL2 body grew past its own server-side limit. Empirical: get_monitoring_locations(monitoring_location_id=[671 CAMELS gauges], properties=[...]) failed; bisection on 100 / 250 / 500 / 671 sites showed the boundary between 100 (PASS) and 250 (FAIL). Add a small _request_bytes() helper that sums URL and body lengths, and route both planner probes (the initial "fits?" check and the greedy-halving loop) through it. For GET routes (body is None) this reduces to the previous URL-only sizing — no behavior change. For POST routes, the body bytes now drive the chunking decision. The test _FakeReq fixture grows a body slot defaulting to None to keep its GET-shape contract while satisfying the new probe. Verified against live API: the same four monitoring-locations calls now succeed (100 / 250 / 500 / 671 sites). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Annotate ``req`` as ``requests.PreparedRequest`` (the only caller flow is ``build_request(...).prepare()``; ``requests`` is already imported). - ``_cql2_param`` returns ``str``, which ``requests.Request(data=...)`` carries through to ``req.body``. The hot path on POST routes was ``str(body).encode("utf-8")``; ``str(<str>)`` is a no-op, so drop it and let ``body.encode("utf-8")`` allocate once. - Trim docstring: replaces the rotting "PR 233" / "currently only monitoring-locations" anchors with a behavioral description that doesn't rely on which routes happen to be POST today. No behavior change. 30 chunker unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… current quota The ``multi_value_chunked`` decorator reads ``x-ratelimit-remaining`` from the response returned by ``fetch_once(sub_args)`` to honor its documented ``QuotaExhausted`` safety floor. That response was two layers stale: 1. ``_walk_pages`` captured ``initial_response = resp`` before pagination and returned it, so any sub-request with N > 1 pages bubbled up only the first page's headers — the loop already kept overwriting ``resp`` each iteration; we just weren't returning the latest. 2. ``_combine_chunk_responses`` returned ``responses[0]`` with summed ``elapsed``, so when ``filters.chunked`` fanned out a long OR-filter into N sub-chunks the outer wrapper only saw the first sub-chunk's headers. Composed, the staleness gap per outer chunk was ``inner_chunks × pages_per_inner_chunk − 1`` HTTP requests of quota consumption the chunker was blind to. For the canonical workload (chained query, long site list, paginated filter) that gap easily exceeds the default floor of 50, so the guard never tripped — users hit ``RuntimeError("429: Too many requests...")`` from ``_raise_for_non_200`` instead of the structured ``QuotaExhausted`` with ``partial_frame``/``completed_chunks`` they were promised. Fix both layers: ``_walk_pages`` returns the latest ``resp`` (which the loop was already maintaining), and ``_combine_chunk_responses`` returns ``responses[-1]`` (with ``elapsed`` summed onto it instead of onto ``responses[0]``). Both changes match ``QuotaExhausted.partial_response``'s docstring ("metadata for the last successful sub-request"). Same fix applied to the parallel pagination loop in the stats helper for consistency. No behavior change for single-page mocked tests (initial == latest). 209 waterdata unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``_filter_aware_probe_args`` substituted the LONGEST raw OR-clause into the URL probe, but the inner ``filters._effective_filter_budget`` computes its bail floor as ``len(longest) * max(per_clause_encoding_ ratio)`` — the worst per-call ratio across all clauses, not the ratio of the longest one. Under lopsided encoding (e.g. a long alphanumeric clause alongside short clauses heavy in ``%27`` / ``%2C`` / non-ASCII), ``encoding_ratio_max`` exceeds ``ratio_of_longest`` and the planner could approve a plan the inner chunker then refuses to emit, leaving the actual URL over the limit. Mirror the inner chunker's model: synthesize an ASCII probe clause of length ``ceil(len(longest) * encoding_ratio_max)``. ASCII has 1:1 URL encoding, so the URL builder sees exactly the bail-floor byte count and the planner's check coincides with the inner chunker's bail condition. Dormant in practice for typical USGS CQL filters (``field='value'`` encoding ratios all cluster between 1.16 and 1.67), but the docstring claimed a categorical guarantee that was technically false. This restores that guarantee. Test ``test_filter_aware_probe_args_substitutes_longest_or_clause`` was renamed and rewritten to verify the new contract: the probe filter is a synthetic ASCII string whose length matches the bail floor model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

thodson-usgs · 2026-05-15T22:27:56Z

+    """Total bytes of a prepared request: URL + body.
+
+    GET routes have ``body=None`` and reduce to URL length. POST routes
+    (CQL2 JSON body) need body bytes — the URL stays short regardless of
+    payload, so URL-only sizing would underestimate the request and skip
+    chunking when it's needed.
+    """
+    url_len = len(req.url)
+    body = req.body
+    if body is None:
+        return url_len
+    if isinstance(body, (bytes, bytearray)):
+        return url_len + len(body)
+    return url_len + len(body.encode("utf-8"))


Acknowledged — leaving as a follow-up rather than fixing here. The reasoning:

The current len(url) + len(body) model is conservative-safe (no false passes — every plan it approves actually fits the server's combined limit) but imprecise on the POST path: it will sometimes chunk a monitoring-locations request that would have fit unchunked at the body's actual ceiling. The opposite failure (under-chunking and 414/403) is what the find-bugs review surfaced and the bug PR #276 closed, so we landed on the side of over-chunking.

A proper fix needs two pieces I don't have data for:

A defensible POST body byte limit. The 8KB nginx URL ceiling is well-documented; the CQL2 POST body limit isn't. The integration session showed a bisection boundary between 100 sites (PASS) and 250 sites (FAIL) for monitoring-locations, which suggests an IN-list cardinality cap or a body byte limit somewhere in that range — but I haven't pinned the exact threshold or its source.

A separate _WATERDATA_BODY_BYTE_LIMIT constant and either a _request_bytes that returns (url_len, body_len) with the planner comparing each to its own ceiling, or a method-aware probe.

Both are scope-expansive enough that they belong in a follow-up PR with its own measurement of the upstream POST limits, rather than guessing here. The conservative behavior is documented in _request_bytes's docstring; users who hit it on POST routes will see it's well-formed chunking, not a 414.

Partial follow-up: 585b2a6 just landed and now documents the URL+body sizing explicitly in both _plan_chunks and multi_value_chunked docstrings, so the conservative-over-chunking on POSTs is no longer hidden from users (was the silent-footgun part of this concern). The underlying optimization — separate _WATERDATA_BODY_BYTE_LIMIT constant after measuring actual upstream POST limits — remains pending as originally noted. Leaving this thread open as the tracker for that follow-up.

… fixes Five fixes from the PR review: - ``_plan_chunks`` checks ``total > max_chunks`` inside the halving loop now: each split only grows the cartesian product, so once the cap is crossed it can never come back under. Continuing to halve the URL just wastes work. - ``_plan_chunks``'s ``max_chunks`` default becomes ``int | None = None`` and resolves to ``_DEFAULT_MAX_CHUNKS`` at call time. The previous ``max_chunks: int = _DEFAULT_MAX_CHUNKS`` bound the constant at module-import time, defeating the documented monkeypatch path for direct callers (the wrapper already resolved lazily, but ``_plan_chunks`` direct calls saw the import-time value). - ``_chunk_bytes`` docstring no longer claims the URL-encoded comma overhead is "constant per chunk" — it scales with ``2 * (len - 1)``. The function still uses raw ``,`` length because the planner only needs a monotone comparator across dims, but the wording was wrong. - ``QuotaExhausted.partial_response`` docstring now says "last completed sub-request" to match the bug_001 fix in ``_combine_chunk_responses``. - Module-level docstring drops the chained-query example (duplicated from ``get_daily``'s docstring) and points readers there. No behavior change for existing callers. 209 waterdata tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

- ``_chunk_bytes`` now returns URL-encoded length via ``quote_plus`` instead of raw ``,``-join length. The function is the planner's biggest-chunk comparator and indirect URL contribution estimate; values containing URL-special chars (``%``, ``+``, ``/``, ``&``, etc.) expand under encoding and could mis-rank chunks under the raw-length form. For typical USGS multi-value workloads (alphanumeric IDs and codes) the two are equal, but the encoded form is always correct and matches what ``_request_bytes`` sees. - ``filters.chunked``'s docstring now says "last chunk's URL/headers" to match what ``_combine_chunk_responses`` returns after the bug_001 fix, with a note about why (rate-limit state). - Module docstring rewrapped so identifiers (``filters._effective_ filter_budget``, ``per-clause encoding ratio``) don't break across line endings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

…ive elapsed The bug_001 fix made paginated and filter-chunked calls return the LAST response, which gave the chunker's QuotaExhausted guard access to current x-ratelimit-remaining — but it also clobbered BaseMetadata.url with a pagination cursor or final sub-chunk URL, harming reproducibility for users who inspect md.url to re-issue or debug a query. Split the fields: - ``md.url`` — first (original-request) response's URL, unchanged from pre-bug_001 behavior; matches the user's submitted query. - ``md.header`` — last completed sub-request/page's headers, so ``x-ratelimit-remaining`` reflects current quota state and the chunker's QuotaExhausted guard works correctly. This is a behavior change from pre-bug_001 (used to be first-page headers). - ``md.query_time`` (response.elapsed) — cumulative across all pages/sub-requests, not just one. Also a small behavior change. Implemented in three sites: - ``_walk_pages`` (utils.py): track ``initial_response`` and ``total_elapsed``; on exit, copy ``resp.headers`` and ``total_elapsed`` onto ``initial_response`` and return it. - The parallel pagination loop in the stats helper: same pattern. - ``_combine_chunk_responses`` (filters.py): return ``responses[0]`` but with ``headers`` from ``responses[-1]`` and summed ``elapsed``. NEWS entry updated to call out the metadata-behavior change. Docstrings on ``QuotaExhausted.partial_response`` and ``filters.chunked`` refreshed to match. All 209 waterdata tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

`filters._WATERDATA_URL_BYTE_LIMIT` was split across a line break (`filters._WATERDATA_` / `URL_BYTE_LIMIT`), which renders awkwardly in Sphinx and copies poorly. Rewrap the line so the identifier sits intact on a single line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

dataretrieval/waterdata/chunking.py:268

In _plan_chunks’ docstring, “filter … reduced to its smallest OR-clause” doesn’t match _filter_aware_probe_args, which probes using the longest OR clause scaled by the max per-clause encoding ratio (i.e., the inner filter chunker’s bail floor). Adjust the wording so callers understand the planner is checking worst-case per-sub-request filter size, not the shortest clause.

    no chunkable lists). Raises ``RequestTooLarge`` when:
    - every multi-value param is already a singleton chunk AND the
      filter (if any) is already at its smallest OR-clause and the URL
      still exceeds ``url_limit`` (irreducible), or
    - the cartesian-product plan exceeds ``max_chunks`` sub-requests
      (the hourly API budget); checked after each split so we bail
      promptly once the cap is unreachable.

Five targeted cleanups from review, no behavior change: - Drop the duplicate ``_FetchOnce`` TypeVar in chunking.py; import the one already defined in filters.py. The two had identical bodies. - Extract ``_max_per_clause_encoding_ratio(parts)`` in filters.py. Both ``_effective_filter_budget`` and the outer ``_filter_aware_probe_args`` need the same worst-case ratio formula; pinning it in one place keeps them from drifting. - Replace the manual ``best: tuple | None`` sentinel + nested-loop scan in ``_plan_chunks`` with a generator + ``max(..., key=..., default=None)``. Removes the sentinel, the conditional-update branch, and the post-loop ``if best is None`` check. - Extract ``_finalize_paginated_response`` in utils.py so the 4-line "carry last page's headers + cumulative elapsed onto the initial response" pattern lives in one spot instead of duplicated across ``_walk_pages`` and the stats helper. - Tighten parametrized type hints from ``dict[str, list]`` to ``dict[str, list[Any]]`` (and the planner's return type) per PEP 585. Also trimmed the 17-line ``_filter_aware_probe_args`` docstring to 9 lines; the substance is preserved, the prose is leaner. All 209 waterdata tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two strays left after the previous /simplify pass: - ``_chunk_bytes(chunk: list)`` → ``chunk: list[Any]``. The function calls ``map(str, chunk)``, so elements just need ``__str__``; ``list[Any]`` matches what the planner actually passes (mixed ``list[str]`` for IDs/codes and ``list[int]`` for ``water_year``). - ``_finalize_paginated_response(..., total_elapsed)`` had no annotation. The caller accumulates ``resp.elapsed`` (a ``datetime.timedelta``); add ``total_elapsed: timedelta`` and import ``timedelta`` alongside ``datetime`` in utils.py. No behavior change; 209 tests pass and ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

… build Three corrections from PR review: - ``RequestTooLarge``'s docstring said the second irreducible case is "any chunkable filter reduced to its smallest top-level OR-clause." The planner actually probes at the inner chunker's bail-floor size, which is bounded below by the LONGEST clause (after URL-encoding), not the shortest. Rewrite the case to describe what the planner actually does. - ``test_plan_chunks_coordinates_with_filter_chunker``'s docstring said the planner models per-sub-request URL as ``worst-dim-chunk + shortest-clause``. Same direction error; corrected to ``longest-clause-after-encoding`` with the rationale (inner chunker's bail floor, not its happy-path output). - ``_fake_build`` test fixture used raw ``len(",".join(...))`` for list params, but the real ``_construct_api_requests`` URL goes through ``quote_plus``. For the all-alphanumeric values these tests use, the gap is 2 bytes per comma — small but enough to let a test pass against the fake while production would have a larger URL. Pull ``quote_plus`` into the fake so its byte count matches what the chunker's ``_request_bytes`` actually measures. No behavior change to the production chunker; 209 waterdata tests pass with no other tunings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five docstring fixes, no behavior change: 1. ``_plan_chunks`` docstring described the first ``RequestTooLarge`` case as "filter (if any) is already at its smallest OR-clause" — same wrong-direction error already fixed on ``RequestTooLarge``. The planner probes at the inner chunker's BAIL floor (longest clause's URL contribution), not the smallest. Rewrite to "the smallest reducible plan" with the bail-floor clarification. 2. The ``RequestTooLarge`` raised inside the halving loop carried a matching wrong-direction phrase ("any chunkable filter reduced to one OR-clause"). Rewrite to "any chunkable filter at the inner chunker's bail-floor size" and broaden the user advice to include "shorten the filter". 3. ``_chunk_bytes`` docstring claimed it's "indirectly used as the URL contribution estimate" — that's not what the code does; the function is a comparator only. Trim the misleading sentence and keep the rationale for ``quote_plus`` over raw join length. 4. ``_worst_case_args`` docstring's "with the filter already reduced to its filter-chunker floor" was oblique. Rewrite to make the chain explicit: caller passes ``probe_args`` (already through ``_filter_aware_probe_args``), and this function uses each dim's largest chunk against that. 5. The ``_DEFAULT_MAX_CHUNKS`` module comment said it's "read lazily in the wrapper" — stale; ``_plan_chunks`` now also resolves lazily. Update to "both the decorator wrapper and ``_plan_chunks``". 209 waterdata tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs

Medium: dataretrieval/waterdata/utils.py now finalizes paginated responses with whatever object is currently in resp after the pagination loop. In the mid-pagination non-200 path, client.request(...) assigns the failed response to resp, _raise_for_non_200(resp) raises, the broad except swallows it as best-effort, and _finalize_paginated_response(initial_response, resp, total_elapsed) then copies the failed page headers onto the successful initial response.

Scenario: page 1 succeeds with data and x-ratelimit-remaining: 100, page 2 returns 503 or 429 with x-ratelimit-remaining: 0. The returned DataFrame contains only page 1, but BaseMetadata.header reports the failed page headers. When this runs under multi_value_chunked, the quota guard can also treat a failed/incomplete page as the “last completed” response and raise QuotaExhausted with a partial frame that already includes a truncated sub-request.

Suggestion: track a separate last_completed_response and update it only after the page passes _raise_for_non_200(...) and its body has been appended. Finalize with that object instead of the loop variable resp. The same pattern applies to get_stats_data.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

dataretrieval/waterdata/chunking.py:347

The multi_value_chunked docstring describes url_limit as a URL byte limit, but planning uses _request_bytes (URL + body) for POST-routed requests. To avoid confusing callers (especially for monitoring-locations), consider clarifying here that the decorator currently budgets against total request bytes (URL + body) even though the default comes from _WATERDATA_URL_BYTE_LIMIT.

    """Decorator that splits multi-value list params across sub-requests
    so each URL fits ``url_limit`` bytes (defaults to
    ``filters._WATERDATA_URL_BYTE_LIMIT``) and the cartesian-product
    plan stays ≤ ``max_chunks`` sub-requests (defaults to
    ``_DEFAULT_MAX_CHUNKS``). All defaults are resolved at call time so

thodson-usgs · 2026-05-17T02:12:47Z

Medium: multi_value_chunked enforces the 8KB url_limit against URL+body (_request_bytes) for all routes, but monitoring-locations is intentionally sent as POST with a CQL2 JSON body (_construct_api_requests).

Concrete failure: a monitoring-locations call with many IDs can have a short URL (<8KB) and a large JSON body (>8KB). The planner will keep splitting on body size and can raise RequestTooLarge (or hit max_chunks) even though this is not a URL-414 case.

Because this PR is scoped to URL-length protection, budgeting POST body bytes against _WATERDATA_URL_BYTE_LIMIT looks like a regression for valid monitoring-locations requests. Suggestion: enforce url_limit on len(req.url) for this chunker, or gate POST body sizing behind a separate explicit limit.

Copilot review surfaced that ``_plan_chunks`` and ``multi_value_chunked`` docstrings advertised the budget as URL length, but the implementation gates on ``_request_bytes`` (URL + body) so POST routes are sized correctly. The public ``url_limit`` parameter name is kept for API stability; only the prose is updated to honestly describe what's measured. Also tightened ``_worst_case_args`` wording from "URL probe" to "request-size probe" for consistency. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

+#   - every multi-value list contributes ``len(",".join(map(str, v)))``,
+#   - the ``filter`` kwarg contributes ``len(filter)``.


thodson-usgs and others added 6 commits May 14, 2026 08:24

thodson-usgs requested a review from Copilot May 15, 2026 19:59

Copilot started reviewing on behalf of thodson-usgs May 15, 2026 19:59 View session

Merge remote-tracking branch 'upstream/main' into multivalue-chunker

84d1d33

# Conflicts: # NEWS.md # dataretrieval/waterdata/utils.py # tests/waterdata_test.py

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread dataretrieval/waterdata/chunking.py Outdated

Comment thread dataretrieval/waterdata/chunking.py Outdated

Comment thread dataretrieval/waterdata/chunking.py Outdated

thodson-usgs and others added 4 commits May 15, 2026 15:10

thodson-usgs requested a review from Copilot May 15, 2026 22:18

Copilot started reviewing on behalf of thodson-usgs May 15, 2026 22:19 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

thodson-usgs commented May 15, 2026

View reviewed changes

Comment thread dataretrieval/waterdata/chunking.py Outdated

Comment thread dataretrieval/waterdata/utils.py Outdated

thodson-usgs requested a review from Copilot May 15, 2026 22:39

Copilot started reviewing on behalf of thodson-usgs May 15, 2026 22:39 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread tests/waterdata_test.py

Comment thread dataretrieval/waterdata/filters.py Outdated

Comment thread dataretrieval/waterdata/chunking.py Outdated

Comment thread dataretrieval/waterdata/chunking.py Outdated

thodson-usgs requested a review from Copilot May 15, 2026 23:05

Copilot started reviewing on behalf of thodson-usgs May 15, 2026 23:06 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread dataretrieval/waterdata/utils.py Outdated

Comment thread dataretrieval/waterdata/filters.py Outdated

Comment thread NEWS.md Outdated

thodson-usgs requested a review from Copilot May 16, 2026 00:55

Copilot started reviewing on behalf of thodson-usgs May 16, 2026 00:55 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread tests/waterdata_test.py

Comment thread dataretrieval/waterdata/chunking.py Outdated

thodson-usgs requested a review from Copilot May 16, 2026 13:04

Copilot started reviewing on behalf of thodson-usgs May 16, 2026 13:04 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread dataretrieval/waterdata/chunking.py Outdated

Comment thread tests/waterdata_test.py Outdated

thodson-usgs and others added 2 commits May 16, 2026 08:18

thodson-usgs requested a review from Copilot May 16, 2026 13:31

Copilot started reviewing on behalf of thodson-usgs May 16, 2026 13:32 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread tests/waterdata_test.py

Comment thread tests/waterdata_test.py Outdated

thodson-usgs and others added 2 commits May 16, 2026 09:33

This was referenced May 16, 2026

WIP: perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft) #278

Draft

fix(waterdata): Raise on mid-pagination failures instead of silently truncating #279

Merged

thodson-usgs requested a review from Copilot May 17, 2026 01:57

Copilot started reviewing on behalf of thodson-usgs May 17, 2026 01:58 View session

thodson-usgs commented May 17, 2026

View reviewed changes

Merge branch 'main' into multivalue-chunker

5c9ee68

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread dataretrieval/waterdata/chunking.py Outdated

thodson-usgs force-pushed the multivalue-chunker branch from 585b2a6 to 0e10c50 Compare May 17, 2026 02:29

thodson-usgs requested a review from Copilot May 17, 2026 02:31

Copilot started reviewing on behalf of thodson-usgs May 17, 2026 02:32 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread tests/waterdata_test.py

Comment on lines +226 to +227

# - every multi-value list contributes ``len(",".join(map(str, v)))``,

# - the ``filter`` kwarg contributes ``len(filter)``.

		# - every multi-value list contributes ``len(",".join(map(str, v)))``,
		# - the ``filter`` kwarg contributes ``len(filter)``.

Conversation

thodson-usgs commented May 15, 2026

Summary

Why

Design

Failure modes (post-fix)

Coordination empirically verified

Test plan

Merge ordering

Files changed

Public API additions

Known limitations (deferred)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

thodson-usgs May 15, 2026

Choose a reason for hiding this comment

Uh oh!

thodson-usgs May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

thodson-usgs left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

thodson-usgs commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

thodson-usgs commented May 17, 2026 •

edited

Loading