feat(waterdata): drop hash-valued ID columns by default by thodson-usgs · Pull Request #281 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-17T21:53:47Z

Summary

The waterdata services previously returned hash-valued ID columns (UUIDs and hex digests) in every response across three distinct code paths. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like monitoring_location_id (AGENCY-ID format), parameter_code, statistic_id, and time are sufficient to pin a row.

This PR drops the hash columns by default across every public get_* function and adds include_hash_ids: bool = False for opt-in. Stable AGENCY-ID and code-style identifiers are unaffected.

What gets dropped

Path	Function(s)	Columns dropped by default
OGC	`get_daily`, `get_continuous`, `get_latest_*`, `get_peaks`, `get_field_measurements`, `get_field_measurements_metadata`, `get_channel`, `get_time_series_metadata`, `get_combined_metadata`, `get_monitoring_locations`, `get_reference_table`	`daily_id`, `continuous_id`, `latest_*_id`, `peak_id`, `channel_measurements_id`, `field_measurement_id`, `field_visit_id`, `field_measurements_series_id`, `field_series_id`, `time_series_id`, `parent_time_series_id`, `combined_meta_id`
Statistics	`get_stats_por`, `get_stats_date_range`	`computation_id`, `parent_time_series_id`
Samples CSV	`get_samples`	`Activity_ActivityIdentifier`, `Result_MeasureIdentifier`

get_monitoring_locations is unchanged — the id (USGS-…) is the AGENCY-ID format, not a hash.

Implementation

A single _drop_hash_columns(df, include_hash_ids, keep=None) helper handles the post-processing drop across all three paths via set(df.columns) & _HASH_ID_COLUMNS - keep. The constants live in dataretrieval/waterdata/utils.py:

_HASH_ID_COLUMNS — the set of column names whose values are server-generated hashes.
_queryables_cache — per-service queryables lookup, one HTTP fetch per service per process.
_default_props_cache — memoized non-hash property whitelist by (service, output_id).

OGC path (get_ogc_data): trims server-side via the OGC properties= query parameter. _default_non_hash_properties(service, output_id) builds the whitelist by subtracting _HASH_ID_COLUMNS from the service's queryables. _arrange_cols calls _drop_hash_columns as a client-side safety net for the queryables-fetch-failed fallback path.

Statistics path (get_stats_data): server doesn't accept properties=, so _drop_hash_columns runs client-side after _expand_percentiles (which still needs computation_id as a join key while it explodes percentile lists into rows).

Samples CSV path (get_samples): _drop_hash_columns runs after CSV parse.

All three honor include_hash_ids: bool = False.

Performance

Measured against the live API on get_continuous(USGS-02238500, 00060, 2023-01-01/2024-01-01) — ~35,000 rows, ~one server page. Methodology: warmup pair + 5 alternating rounds, queryables cache left warm across rounds (mirrors real-world callers that issue multiple queries per process).

Metric	`include_hash_ids=True`	default	Δ
Server payload bytes (5 rounds, identical each time)	25,330,519	20,775,705	−18.0%
DataFrame memory (deterministic)	16.6 MB	11.0 MB	−33.5%
Peak traced memory (deterministic)	118.3 MB	97.5 MB	−17.6%
Local parse + DataFrame (offline, best of 5)	1.35 s	1.20 s	−11.0%
End-to-end wall (live, median of 5)	5.23 s	6.41 s	+22.6%

The wall-clock picture is honest but mixed. Bytes-on-wire are reliably 18% smaller and the local parse/DataFrame work is reliably 11% faster (offline measurements have tight variance: ±0.05s). But repeated bare-HTTP measurements show USGS's server takes ~16% longer to serialize a filtered response than a full one, and on a fast home connection that extra server time exceeds the byte-transfer savings. Online wall has high run-to-run variance (5–10× spread for the same query); the median favors the legacy path by ~1.2s here.

Reliable wins: memory (33% smaller DataFrame, 17–21% smaller peak), bytes-on-wire (18% smaller), and the UX cleanup of not dragging UUID columns around by default. Wall-clock: roughly neutral on a fast connection (server-side filter cost ≈ byte-transfer savings), wins on bandwidth-constrained ones.

Verification

I ran a live-API column-content sweep against every public dataretrieval.waterdata function — call each, scan every column's first 50 non-null values against UUID / 32-char-hex / 16+-char-hex regexes, flag any column whose first values look hash-shaped. The sweep found two leaks beyond the OGC path (get_stats_* and get_samples), both fixed in this PR.

Function	Verified	Notes
11 OGC `get_*` services	✅ Live	no hash-shaped columns
`get_reference_table` (parameter-codes, agency-codes)	✅ Live
`get_codes(states)`	✅ Live
`get_samples` (1,494 rows non-empty)	✅ Live	post-fix verified clean
`get_samples_summary`	✅ (fixture)	8 cols, no UUIDs
`get_stats_por` (default)	✅ Live	22 × 17 cols (was 19) — hash IDs gone
`get_stats_por (include_hash_ids=True)`	✅ Live	22 × 19 cols — opt-in restores them
`get_stats_date_range`	✅ Live	3 × 18 cols, post-fix verified clean
`get_ratings`	✅	Returns dict of rating tables (`INDEP`/`DEP`/`STOR`/`SHIFT`) — no hashes by definition
`get_nearest_continuous`	✅	Thin wrapper over `get_continuous` (verified)

Migration

Default behavior changes — callers relying on hash columns need to either:

Pass include_hash_ids=True, or
List the hash column explicitly in properties= (OGC services only).

Existing tests that asserted hash columns were present have been updated; new tests document the opt-in path on each surface.

Test plan

Unit tests: 157 pass in waterdata_utils_test.py + filters/nearest/ratings; 274 pass across the whole non-live tree.
Live API tests: 67 pass in waterdata_test.py.
Live-API hash-column sweep across every public get_* function — no leaks remain.
Opt-in path: include_hash_ids=True returns the legacy columns on every surface, verified live for stats and samples.
Explicit override: listing a hash column in properties= keeps it even with the default.

🤖 Generated with Claude Code

The waterdata OGC services previously returned per-record UUID columns (``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash columns (``time_series_id``, ``parent_time_series_id``, ``field_visit_id``, ``field_measurements_series_id``) in every response. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like ``monitoring_location_id`` (AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time`` are sufficient to pin a row. Drop the hash columns by default and add ``include_hash_ids: bool = False`` to every OGC ``get_*`` function for opt-in. Implementation trims server-side via the OGC ``properties=`` query parameter (cached per service from one queryables fetch) so the payload itself is smaller, with a client-side drop as a safety net. ``monitoring_location_id`` and other AGENCY-ID / code-style identifiers are unaffected. Offline benchmark on a synthetic 30,000-row payload (mirrors the on-wire shape and per-row size of a 1-year ``get_continuous`` query): - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller) - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller) - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller) - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster) Network savings stack on top of the local speedup. For very small queries (≲1k rows) the one-time queryables fetch overhead can dominate the savings; large queries are the target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e_range The OGC ``get_*`` functions in the prior commit drop hash columns through ``get_ogc_data``. The statistics services (which return JSON through ``get_stats_data`` rather than OGC features) bypassed that path, so ``get_stats_por`` and ``get_stats_date_range`` were still returning ``computation_id`` (UUID) and ``parent_time_series_id`` (hex hash) by default. This commit: - Adds ``computation_id`` to ``_HASH_ID_COLUMNS`` (``parent_time_series_id`` was already there). - Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``, ``get_stats_por``, and ``get_stats_date_range``. - Drops the hash columns at the end of ``get_stats_data``, after ``_expand_percentiles`` (which still needs ``computation_id`` as a join key while it explodes the percentile lists into rows). - Updates ``test_get_stats_por_expanded_false`` / ``test_get_stats_date_range`` to reflect the new column count and adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in. Discovered while running a live-API sweep across every public waterdata ``get_*`` function — the OGC services now pass, the stats ones used to leak, and this commit closes that gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two mocked-response tests for ``get_stats_data``: - ``test_get_stats_data_drops_hash_ids_by_default`` asserts ``computation_id`` and ``parent_time_series_id`` are removed when ``include_hash_ids=False`` (the new default). - ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the opt-in path preserves them, matching the legacy behavior. Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the fake response only needs to carry the column shape, not the full nested-percentile body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A live-API column-content scan found that the Samples CSV service returns two UUID-valued columns by default: - ``Activity_ActivityIdentifier`` (per-activity UUID) - ``Result_MeasureIdentifier`` (per-measurement UUID) These weren't covered by the prior OGC / stats commits because ``get_samples`` parses CSV directly without going through ``get_ogc_data`` or ``get_stats_data``. This commit: - Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``. - Plumbs ``include_hash_ids: bool = False`` through ``get_samples`` and drops the named columns from the parsed CSV before returning. - Updates ``test_mock_get_samples`` to reflect the new column count (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for the opt-in path. - Updates ``test_samples_results`` and ``test_samples_activity`` similarly. Stable identifiers (``Org_Identifier``, ``Location_Identifier``, ``Project_Identifier``, ``USGSpcode``, …) are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Post-PR-281 cleanup from a code review pass. No behavior change for users; the three code paths that drop hash columns (OGC, stats, samples) now share one helper, and a few smaller wins. - New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper replaces three near-identical drop blocks across ``_arrange_cols``, ``get_stats_data``, and ``get_samples``. Uses ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection) in place of the per-call list-comprehension + ``if drop_cols:`` truthiness guard. ``df.drop`` accepts an empty Index, so the guard was unnecessary. - ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples branch now calls the helper instead of touching the constant. - New ``_properties_unspecified(properties)`` extracts the ``properties is None or all(pd.isna(properties))`` predicate so the five call sites that need it stop drifting between ``not properties`` and ``properties is None`` variants. - ``_default_non_hash_properties`` memoizes its result by ``(service, output_id)`` via ``_default_props_cache``. The previous version rebuilt a ~30–100-item list on every OGC call after the queryables cache was warm; this saves the per-call list rebuild on the hot path. - ``get_ogc_data`` flattens its nested ``try/except + if/else`` branches into a single ``use_server_trim`` flag, removing the duplicated ``_switch_properties_id`` assignment. - The benchmark no longer clears ``_queryables_cache`` between measured rounds. Clearing per round only penalizes the default path (the legacy path doesn't consult the cache), so the previous comparison was pessimistic against the default. Real-world callers pay the queryables fetch once per process. - The benchmark's ``capture_fixtures`` now calls ``_default_non_hash_properties`` directly instead of reimplementing the filter, so the "trimmed" fixture matches what the runtime actually sends bit-for-bit. All 157 unit and mock tests pass after the refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The ``benchmarks/`` directory was useful during development for measuring the hash-ID-drop impact, but it's not part of the runtime or test surface and the numbers are captured in the PR description. Removing to keep the diff focused on the library change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs and others added 6 commits May 17, 2026 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(waterdata): drop hash-valued ID columns by default#281

feat(waterdata): drop hash-valued ID columns by default#281
thodson-usgs wants to merge 6 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids

thodson-usgs commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What gets dropped

Implementation

Performance

Verification

Migration

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented May 17, 2026 •

edited

Loading