feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341
Draft
thodson-usgs wants to merge 1 commit into
Draft
feat(waterdata): add chunk_granularity to control OGC chunk fan-out#341thodson-usgs wants to merge 1 commit into
thodson-usgs wants to merge 1 commit into
Conversation
ec8269d to
b47e5cc
Compare
The OGC getters chunk a multi-value request only as far as the server's ~8 KB URL limit forces — the fewest sub-requests. But because every sub-request paginates, splitting a large result further is usually quota-neutral, so that conservative default can be needlessly coarse: ten states pulled as one under-limit request page just as many times as ten per-state requests would. Add `waterdata.chunk_granularity(level)`, a context manager that lets a caller who knows their pull is large opt into a finer split — trading the same pages for more, smaller sub-requests (smoother progress, more even concurrency, a smaller unit of retry/resume). The level is "low", "medium", or "high" (typed as `GranularityLevel`, a Literal, so a type checker rejects anything else; an invalid string raises ValueError at the `with`). Each level caps how many sub-chunks a multi-value argument is split into, derived from the default fan-out concurrency (`API_USGS_CONCURRENT`): high = the full width, medium a quarter, low a sixteenth (32 / 8 / 2 by default). Capping the aggressive end at the concurrency width bounds the blast radius so an accidental "high" on a huge list can't explode into thousands of sub-requests. There is no "off" level — not entering the block is off. It is a scoped `with` block, not an env var, because the library can't tell in advance whether a query is large (a short-window query might fit one page, where extra chunks only burn quota). Implementation: a soft `ChunkPlan._refine` pass runs after the hard byte pass; it only ever splits further, so the url_limit invariant holds and it never raises. The resolved per-axis cap is read from a contextvar (Ambient) set by the context manager at plan-construction time. Exported (with the `GranularityLevel` type) from `dataretrieval.waterdata` and the top-level `dataretrieval` package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b47e5cc to
0195113
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
waterdata.chunk_granularity(level)— a context manager to control how finely the OGCwaterdata(and NGWMN) getters split multi-value requests into chunked sub-requests.Today the chunker splits a request only as much as the server's ~8 KB URL-byte limit forces — the fewest sub-requests. That is the safe default, but it can be needlessly conservative. Because every sub-request paginates, splitting a large result further is usually quota-neutral: ten states pulled as one under-limit request page just as many times as ten per-state requests would. In that situation finer chunks buy smoother progress, more even concurrency, and a smaller unit of retry/resume — at no extra quota cost.
The library can't tell in advance whether a query is large (ten states over a short window might fit in a single page, where extra chunks would only burn quota), so this is a deliberate, scoped knob the user sets with their own judgment — not automatic, and not a process-wide env var (which would be a quota footgun). Scoping it to a
withblock keeps an aggressive setting from leaking into unrelated calls.The dial
chunk_granularity(level)takes one of three levels, typed aswaterdata.GranularityLevel(atyping.Literal["low", "medium", "high"]) so a type checker rejects anything else at the call site, and an invalid string raisesValueErrorat thewith:level"low""medium""high"Each axis is split into
min(len(values), cap)pieces. There is no"off"level — not entering the block is off.The ceiling is a dedicated granularity constant, deliberately decoupled from concurrency. How finely a query splits (fan-out volume) is orthogonal to how many sub-requests run at once (
API_USGS_CONCURRENT), so the cap is its own_GRANULARITY_MAX_CHUNKS = 32rather than a fraction of the concurrency width; the three levels are spaced 4× apart (32/8/2) and derived from that one constant so they move together if it changes. Capping the aggressive end at 32 is the guardrail: an accidental"high"on a 10 000-item list can't explode into thousands of sub-requests. (With several multi-value arguments the per-argument counts still multiply.)Exported as
waterdata.chunk_granularity/waterdata.GranularityLeveland, for parity withChunkInterrupted, at the top level asdataretrieval.chunk_granularity/dataretrieval.GranularityLevel.Implementation
ChunkPlan._refine(max_chunks_per_axis)— a soft pass that runs after the existing hard byte pass (_plan). It only ever splits chunks further (via the shared_split_atprimitive), so theurl_limitinvariant always holds and it never raises. A no-op at cap 0, so the default path is byte-for-byte unchanged (passthrough preserved). Where_plansplits by URL bytes,_refinesplits by atom count — evening out cardinality for smooth fan-out.Ambient(contextvar) set by the context manager, at plan-construction time insidemulti_value_chunked's wrapper — so a laterresume()(which re-issues already-planned sub-requests) needs no extra snapshot._resolve_granularitymaps the level name → cap and is the single validation boundary;ChunkPlanonly ever sees a plain int. Valid levels come fromget_args(GranularityLevel), so theLiteralstays the single source of truth (mirrors_VALID_ON_TIE/_VALID_FILE_TYPESin sibling modules).Tests & checks
tests/waterdata_chunking_test.py, plus an export-surface test; covers the cap→pieces ramp/saturation (with cover-partition checks), the level ordering + 4× spacing (low < medium < high,high == the granularity ceiling), the guardrail on long axes, byte-budget preservation, filter-axis + multi-axis behavior, level resolution + rejection of every non-level shape (old int/keyword/None/wrong-case/whitespace/unhashable), context-manager scoping/validation, and the passthrough-unchanged default.ruff check,ruff format --check, andmypy --strictall clean.Note
Earlier revisions of this branch used an
off/1–5/maxdial, then briefly derived the caps from the concurrency width; it's now the fixed"low"/"medium"/"high"enum with a dedicated granularity ceiling decoupled from concurrency. Still a draft — happy to adjust the level names or the spacing.🤖 Generated with Claude Code