feat(waterdata): add chunk_granularity to control OGC chunk fan-out by thodson-usgs · Pull Request #341 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-07-01T14:08:23Z

Summary

Adds waterdata.chunk_granularity(level) — a context manager to control how finely the OGC waterdata (and NGWMN) getters split multi-value requests into chunked sub-requests.

Today the chunker splits a request only as much as the server's ~8 KB URL-byte limit forces — the fewest sub-requests. That is the safe default, but it can be needlessly conservative. Because every sub-request paginates, splitting a large result further is usually quota-neutral: ten states pulled as one under-limit request page just as many times as ten per-state requests would. In that situation finer chunks buy smoother progress, more even concurrency, and a smaller unit of retry/resume — at no extra quota cost.

The library can't tell in advance whether a query is large (ten states over a short window might fit in a single page, where extra chunks would only burn quota), so this is a deliberate, scoped knob the user sets with their own judgment — not automatic, and not a process-wide env var (which would be a quota footgun). Scoping it to a with block keeps an aggressive setting from leaking into unrelated calls.

from dataretrieval import waterdata

# Default: chunk only as much as the URL limit needs.
df, md = waterdata.get_daily(monitoring_location_id=many_sites)

# Opt into a finer split for a pull you know is large:
with waterdata.chunk_granularity("high"):
    df, md = waterdata.get_daily(
        monitoring_location_id=many_sites, parameter_code="00060"
    )

The dial

chunk_granularity(level) takes one of three levels, typed as waterdata.GranularityLevel (a typing.Literal["low", "medium", "high"]) so a type checker rejects anything else at the call site, and an invalid string raises ValueError at the with:

`level`	per-axis sub-chunk cap
`"low"`	2
`"medium"`	8
`"high"`	32

Each axis is split into min(len(values), cap) pieces. There is no "off" level — not entering the block is off.

The ceiling is a dedicated granularity constant, deliberately decoupled from concurrency. How finely a query splits (fan-out volume) is orthogonal to how many sub-requests run at once (API_USGS_CONCURRENT), so the cap is its own _GRANULARITY_MAX_CHUNKS = 32 rather than a fraction of the concurrency width; the three levels are spaced 4× apart (32 / 8 / 2) and derived from that one constant so they move together if it changes. Capping the aggressive end at 32 is the guardrail: an accidental "high" on a 10 000-item list can't explode into thousands of sub-requests. (With several multi-value arguments the per-argument counts still multiply.)

Exported as waterdata.chunk_granularity / waterdata.GranularityLevel and, for parity with ChunkInterrupted, at the top level as dataretrieval.chunk_granularity / dataretrieval.GranularityLevel.

Implementation

ChunkPlan._refine(max_chunks_per_axis) — a soft pass that runs after the existing hard byte pass (_plan). It only ever splits chunks further (via the shared _split_at primitive), so the url_limit invariant always holds and it never raises. A no-op at cap 0, so the default path is byte-for-byte unchanged (passthrough preserved). Where _plan splits by URL bytes, _refine splits by atom count — evening out cardinality for smooth fan-out.
The resolved integer cap is read from an Ambient (contextvar) set by the context manager, at plan-construction time inside multi_value_chunked's wrapper — so a later resume() (which re-issues already-planned sub-requests) needs no extra snapshot.
_resolve_granularity maps the level name → cap and is the single validation boundary; ChunkPlan only ever sees a plain int. Valid levels come from get_args(GranularityLevel), so the Literal stays the single source of truth (mirrors _VALID_ON_TIE / _VALID_FILE_TYPES in sibling modules).

Tests & checks

29 granularity unit + end-to-end tests in tests/waterdata_chunking_test.py, plus an export-surface test; covers the cap→pieces ramp/saturation (with cover-partition checks), the level ordering + 4× spacing (low < medium < high, high == the granularity ceiling), the guardrail on long axes, byte-budget preservation, filter-axis + multi-axis behavior, level resolution + rejection of every non-level shape (old int/keyword/None/wrong-case/whitespace/unhashable), context-manager scoping/validation, and the passthrough-unchanged default.
ruff check, ruff format --check, and mypy --strict all clean.
NEWS.md + a userguide section updated.

Note

Earlier revisions of this branch used an off/1–5/max dial, then briefly derived the caps from the concurrency width; it's now the fixed "low"/"medium"/"high" enum with a dedicated granularity ceiling decoupled from concurrency. Still a draft — happy to adjust the level names or the spacing.

🤖 Generated with Claude Code

The OGC getters chunk a multi-value request only as far as the server's ~8 KB URL limit forces — the fewest sub-requests. But because every sub-request paginates, splitting a large result further is usually quota-neutral, so that conservative default can be needlessly coarse: ten states pulled as one under-limit request page just as many times as ten per-state requests would. Add `waterdata.chunk_granularity(level)`, a context manager that lets a caller who knows their pull is large opt into a finer split — trading the same pages for more, smaller sub-requests (smoother progress, more even concurrency, a smaller unit of retry/resume). The level is "low", "medium", or "high" (typed as `GranularityLevel`, a Literal, so a type checker rejects anything else; an invalid string raises ValueError at the `with`). Each level caps how many sub-chunks a multi-value argument is split into, derived from the default fan-out concurrency (`API_USGS_CONCURRENT`): high = the full width, medium a quarter, low a sixteenth (32 / 8 / 2 by default). Capping the aggressive end at the concurrency width bounds the blast radius so an accidental "high" on a huge list can't explode into thousands of sub-requests. There is no "off" level — not entering the block is off. It is a scoped `with` block, not an env var, because the library can't tell in advance whether a query is large (a short-window query might fit one page, where extra chunks only burn quota). Implementation: a soft `ChunkPlan._refine` pass runs after the hard byte pass; it only ever splits further, so the url_limit invariant holds and it never raises. The resolved per-axis cap is read from a contextvar (Ambient) set by the context manager at plan-construction time. Exported (with the `GranularityLevel` type) from `dataretrieval.waterdata` and the top-level `dataretrieval` package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thodson-usgs force-pushed the feat/chunk-granularity branch 2 times, most recently from ec8269d to b47e5cc Compare July 1, 2026 15:12

thodson-usgs force-pushed the feat/chunk-granularity branch from b47e5cc to 0195113 Compare July 1, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341

feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:feat/chunk-granularity

thodson-usgs commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thodson-usgs commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The dial

Implementation

Tests & checks

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented Jul 1, 2026 •

edited

Loading