Skip to content

feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:feat/chunk-granularity
Draft

feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:feat/chunk-granularity

Conversation

@thodson-usgs

@thodson-usgs thodson-usgs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds waterdata.chunk_granularity(level) — a context manager to control how finely the OGC waterdata (and NGWMN) getters split multi-value requests into chunked sub-requests.

Today the chunker splits a request only as much as the server's ~8 KB URL-byte limit forces — the fewest sub-requests. That is the safe default, but it can be needlessly conservative. Because every sub-request paginates, splitting a large result further is usually quota-neutral: ten states pulled as one under-limit request page just as many times as ten per-state requests would. In that situation finer chunks buy smoother progress, more even concurrency, and a smaller unit of retry/resume — at no extra quota cost.

The library can't tell in advance whether a query is large (ten states over a short window might fit in a single page, where extra chunks would only burn quota), so this is a deliberate, scoped knob the user sets with their own judgment — not automatic, and not a process-wide env var (which would be a quota footgun). Scoping it to a with block keeps an aggressive setting from leaking into unrelated calls.

from dataretrieval import waterdata

# Default: chunk only as much as the URL limit needs.
df, md = waterdata.get_daily(monitoring_location_id=many_sites)

# Opt into a finer split for a pull you know is large:
with waterdata.chunk_granularity("high"):
    df, md = waterdata.get_daily(
        monitoring_location_id=many_sites, parameter_code="00060"
    )

The dial

chunk_granularity(level) takes one of three levels, typed as waterdata.GranularityLevel (a typing.Literal["low", "medium", "high"]) so a type checker rejects anything else at the call site, and an invalid string raises ValueError at the with:

level per-axis sub-chunk cap
"low" 2
"medium" 8
"high" 32

Each axis is split into min(len(values), cap) pieces. There is no "off" level — not entering the block is off.

The ceiling is a dedicated granularity constant, deliberately decoupled from concurrency. How finely a query splits (fan-out volume) is orthogonal to how many sub-requests run at once (API_USGS_CONCURRENT), so the cap is its own _GRANULARITY_MAX_CHUNKS = 32 rather than a fraction of the concurrency width; the three levels are spaced 4× apart (32 / 8 / 2) and derived from that one constant so they move together if it changes. Capping the aggressive end at 32 is the guardrail: an accidental "high" on a 10 000-item list can't explode into thousands of sub-requests. (With several multi-value arguments the per-argument counts still multiply.)

Exported as waterdata.chunk_granularity / waterdata.GranularityLevel and, for parity with ChunkInterrupted, at the top level as dataretrieval.chunk_granularity / dataretrieval.GranularityLevel.

Implementation

  • ChunkPlan._refine(max_chunks_per_axis) — a soft pass that runs after the existing hard byte pass (_plan). It only ever splits chunks further (via the shared _split_at primitive), so the url_limit invariant always holds and it never raises. A no-op at cap 0, so the default path is byte-for-byte unchanged (passthrough preserved). Where _plan splits by URL bytes, _refine splits by atom count — evening out cardinality for smooth fan-out.
  • The resolved integer cap is read from an Ambient (contextvar) set by the context manager, at plan-construction time inside multi_value_chunked's wrapper — so a later resume() (which re-issues already-planned sub-requests) needs no extra snapshot.
  • _resolve_granularity maps the level name → cap and is the single validation boundary; ChunkPlan only ever sees a plain int. Valid levels come from get_args(GranularityLevel), so the Literal stays the single source of truth (mirrors _VALID_ON_TIE / _VALID_FILE_TYPES in sibling modules).

Tests & checks

  • 29 granularity unit + end-to-end tests in tests/waterdata_chunking_test.py, plus an export-surface test; covers the cap→pieces ramp/saturation (with cover-partition checks), the level ordering + 4× spacing (low < medium < high, high == the granularity ceiling), the guardrail on long axes, byte-budget preservation, filter-axis + multi-axis behavior, level resolution + rejection of every non-level shape (old int/keyword/None/wrong-case/whitespace/unhashable), context-manager scoping/validation, and the passthrough-unchanged default.
  • ruff check, ruff format --check, and mypy --strict all clean.
  • NEWS.md + a userguide section updated.

Note

Earlier revisions of this branch used an off/15/max dial, then briefly derived the caps from the concurrency width; it's now the fixed "low"/"medium"/"high" enum with a dedicated granularity ceiling decoupled from concurrency. Still a draft — happy to adjust the level names or the spacing.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the feat/chunk-granularity branch 2 times, most recently from ec8269d to b47e5cc Compare July 1, 2026 15:12
The OGC getters chunk a multi-value request only as far as the server's
~8 KB URL limit forces — the fewest sub-requests. But because every
sub-request paginates, splitting a large result further is usually
quota-neutral, so that conservative default can be needlessly coarse: ten
states pulled as one under-limit request page just as many times as ten
per-state requests would.

Add `waterdata.chunk_granularity(level)`, a context manager that lets a
caller who knows their pull is large opt into a finer split — trading the
same pages for more, smaller sub-requests (smoother progress, more even
concurrency, a smaller unit of retry/resume). The level is "low", "medium",
or "high" (typed as `GranularityLevel`, a Literal, so a type checker rejects
anything else; an invalid string raises ValueError at the `with`). Each level
caps how many sub-chunks a multi-value argument is split into, derived from
the default fan-out concurrency (`API_USGS_CONCURRENT`): high = the full
width, medium a quarter, low a sixteenth (32 / 8 / 2 by default). Capping the
aggressive end at the concurrency width bounds the blast radius so an
accidental "high" on a huge list can't explode into thousands of sub-requests.
There is no "off" level — not entering the block is off. It is a scoped `with`
block, not an env var, because the library can't tell in advance whether a
query is large (a short-window query might fit one page, where extra chunks
only burn quota).

Implementation: a soft `ChunkPlan._refine` pass runs after the hard byte
pass; it only ever splits further, so the url_limit invariant holds and it
never raises. The resolved per-axis cap is read from a contextvar (Ambient)
set by the context manager at plan-construction time. Exported (with the
`GranularityLevel` type) from `dataretrieval.waterdata` and the top-level
`dataretrieval` package.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the feat/chunk-granularity branch from b47e5cc to 0195113 Compare July 1, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant