Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7fe353b
Use GET with comma-separated values for multi-value waterdata queries
thodson-usgs Apr 14, 2026
fa78869
Polish PR 233: module-level constant, fix misleading comment, add 3 u…
thodson-usgs May 14, 2026
22a09c7
Add multi-value GET-parameter chunker for waterdata OGC API
thodson-usgs May 15, 2026
c7ef181
Probe with longest OR-clause, not shortest
thodson-usgs May 15, 2026
ed216dd
Document the chained-query use case in chunker, get_daily, NEWS
thodson-usgs May 15, 2026
0768245
Tidy chunking.py: extract _chunk_bytes, name quota sentinel, use math…
thodson-usgs May 15, 2026
84d1d33
Merge remote-tracking branch 'upstream/main' into multivalue-chunker
thodson-usgs May 15, 2026
2941328
Probe URL + body bytes (not just URL) to chunk POST-routed services
thodson-usgs May 15, 2026
a9cf2d7
Tighten _request_bytes: type the param and drop redundant str() wrap
thodson-usgs May 15, 2026
7aeacd5
Return latest paginated/chunked response so QuotaExhausted floor sees…
thodson-usgs May 15, 2026
13d5e0c
Probe filter URL with encoding-ratio-weighted size, not just raw length
thodson-usgs May 15, 2026
df25e0d
Address PR review: lazy max_chunks default, early-exit cap check, doc…
thodson-usgs May 15, 2026
16f02fd
Address PR review round 2: URL-encoded chunk sizing, doc fixes
thodson-usgs May 15, 2026
10d4156
Preserve original-query URL in md.url; carry latest headers + cumulat…
thodson-usgs May 15, 2026
82087ad
Rewrap multi_value_chunked docstring so identifier stays intact
thodson-usgs May 16, 2026
aeb0f0a
Simplify chunking module: shared helpers, idiomatic max(), tighter types
thodson-usgs May 16, 2026
954fbcb
Parametrize remaining bare type hints
thodson-usgs May 16, 2026
3b11ce5
Fix doc/test wording around longest-vs-shortest clause and align fake…
thodson-usgs May 16, 2026
526a7e4
Tighten chunking docstrings for accuracy and clarity
thodson-usgs May 16, 2026
37e2d2a
Add optional parallel chunk processing via ThreadPoolExecutor
thodson-usgs May 16, 2026
0453011
Merge branch 'main' into chunker-async-experiment
thodson-usgs May 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
**05/17/2026:** The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) now transparently chunk requests whose URLs would otherwise exceed the server's ~8 KB byte limit. A common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — previously failed with HTTP 414 once the resulting URL grew past the limit; it now fans out across multiple sub-requests under the hood and returns one combined DataFrame. The chunker coordinates with the existing CQL `filter` chunker (long top-level-`OR` filters still split correctly when used alongside long multi-value lists), caps cartesian-product plans at 1000 sub-requests (the default USGS hourly quota), and aborts mid-call with a structured `QuotaExhausted` exception — carrying the partial result and a resume offset — if `x-ratelimit-remaining` drops below a safety floor. Mirrors R `dataRetrieval`'s [#870](https://github.com/DOI-USGS/dataRetrieval/pull/870), generalized to N dimensions. Note one metadata-behavior change for paginated/chunked calls: `BaseMetadata.url` still reflects the user's original query (unchanged), but `BaseMetadata.header` now carries the *last* page's / sub-request's headers (so `x-ratelimit-remaining` is current) rather than the first, and `BaseMetadata.query_time` is now the cumulative wall-clock across pages instead of the first page's elapsed.

**05/16/2026:** Fixed silent truncation in the paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`). Mid-pagination failures (HTTP 429, 5xx, network error) were previously swallowed — pagination would quietly stop and the function would return whatever rows it had collected, leaving callers with truncated DataFrames they had no way to detect. The loops now status-check every page like the initial request and raise `RuntimeError` on any failure, with the upstream exception chained as `__cause__` and a short menu of recovery actions (wait and retry, reduce the request, or obtain an API token) in the message. **Behavior change**: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception; retry the call (possibly with a smaller `limit` or narrower query).

**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
Expand Down
15 changes: 15 additions & 0 deletions dataretrieval/waterdata/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,21 @@ def get_daily(
... parameter_code="00060",
... last_modified="P7D",
... )

>>> # Chain queries: pull all stream sites in a state, then their
>>> # daily discharge for the last week. The site list can be hundreds
>>> # of values long — the request is transparently chunked across
>>> # multiple sub-requests so the URL stays under the server's byte
>>> # limit. Combined output looks like a single query.
>>> sites_df, _ = dataretrieval.waterdata.get_monitoring_locations(
... state_name="Ohio",
... site_type="Stream",
... )
>>> df, md = dataretrieval.waterdata.get_daily(
... monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
... parameter_code="00060",
... time="P7D",
... )
"""
service = "daily"
output_id = "daily_id"
Expand Down
Loading
Loading