Skip to content

feat(waterdata): drop hash-valued ID columns by default#281

Draft
thodson-usgs wants to merge 6 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids
Draft

feat(waterdata): drop hash-valued ID columns by default#281
thodson-usgs wants to merge 6 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-waterdata-drop-hash-ids

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 17, 2026

Summary

The waterdata services previously returned hash-valued ID columns (UUIDs and hex digests) in every response across three distinct code paths. These IDs are unstable across record refreshes and not human-meaningful — stable identifiers like monitoring_location_id (AGENCY-ID format), parameter_code, statistic_id, and time are sufficient to pin a row.

This PR drops the hash columns by default across every public get_* function and adds include_hash_ids: bool = False for opt-in. Stable AGENCY-ID and code-style identifiers are unaffected.

What gets dropped

Path Function(s) Columns dropped by default
OGC get_daily, get_continuous, get_latest_*, get_peaks, get_field_measurements, get_field_measurements_metadata, get_channel, get_time_series_metadata, get_combined_metadata, get_monitoring_locations, get_reference_table daily_id, continuous_id, latest_*_id, peak_id, channel_measurements_id, field_measurement_id, field_visit_id, field_measurements_series_id, field_series_id, time_series_id, parent_time_series_id, combined_meta_id
Statistics get_stats_por, get_stats_date_range computation_id, parent_time_series_id
Samples CSV get_samples Activity_ActivityIdentifier, Result_MeasureIdentifier

get_monitoring_locations is unchanged — the id (USGS-…) is the AGENCY-ID format, not a hash.

Implementation

A single _drop_hash_columns(df, include_hash_ids, keep=None) helper handles the post-processing drop across all three paths via set(df.columns) & _HASH_ID_COLUMNS - keep. The constants live in dataretrieval/waterdata/utils.py:

  • _HASH_ID_COLUMNS — the set of column names whose values are server-generated hashes.
  • _queryables_cache — per-service queryables lookup, one HTTP fetch per service per process.
  • _default_props_cache — memoized non-hash property whitelist by (service, output_id).

OGC path (get_ogc_data): trims server-side via the OGC properties= query parameter. _default_non_hash_properties(service, output_id) builds the whitelist by subtracting _HASH_ID_COLUMNS from the service's queryables. _arrange_cols calls _drop_hash_columns as a client-side safety net for the queryables-fetch-failed fallback path.

Statistics path (get_stats_data): server doesn't accept properties=, so _drop_hash_columns runs client-side after _expand_percentiles (which still needs computation_id as a join key while it explodes percentile lists into rows).

Samples CSV path (get_samples): _drop_hash_columns runs after CSV parse.

All three honor include_hash_ids: bool = False.

Performance

Measured against the live API on get_continuous(USGS-02238500, 00060, 2023-01-01/2024-01-01) — ~35,000 rows, ~one server page. Methodology: warmup pair + 5 alternating rounds, queryables cache left warm across rounds (mirrors real-world callers that issue multiple queries per process).

Metric include_hash_ids=True default Δ
Server payload bytes (5 rounds, identical each time) 25,330,519 20,775,705 −18.0%
DataFrame memory (deterministic) 16.6 MB 11.0 MB −33.5%
Peak traced memory (deterministic) 118.3 MB 97.5 MB −17.6%
Local parse + DataFrame (offline, best of 5) 1.35 s 1.20 s −11.0%
End-to-end wall (live, median of 5) 5.23 s 6.41 s +22.6%

The wall-clock picture is honest but mixed. Bytes-on-wire are reliably 18% smaller and the local parse/DataFrame work is reliably 11% faster (offline measurements have tight variance: ±0.05s). But repeated bare-HTTP measurements show USGS's server takes ~16% longer to serialize a filtered response than a full one, and on a fast home connection that extra server time exceeds the byte-transfer savings. Online wall has high run-to-run variance (5–10× spread for the same query); the median favors the legacy path by ~1.2s here.

Reliable wins: memory (33% smaller DataFrame, 17–21% smaller peak), bytes-on-wire (18% smaller), and the UX cleanup of not dragging UUID columns around by default. Wall-clock: roughly neutral on a fast connection (server-side filter cost ≈ byte-transfer savings), wins on bandwidth-constrained ones.

Verification

I ran a live-API column-content sweep against every public dataretrieval.waterdata function — call each, scan every column's first 50 non-null values against UUID / 32-char-hex / 16+-char-hex regexes, flag any column whose first values look hash-shaped. The sweep found two leaks beyond the OGC path (get_stats_* and get_samples), both fixed in this PR.

Function Verified Notes
11 OGC get_* services ✅ Live no hash-shaped columns
get_reference_table (parameter-codes, agency-codes) ✅ Live
get_codes(states) ✅ Live
get_samples (1,494 rows non-empty) ✅ Live post-fix verified clean
get_samples_summary ✅ (fixture) 8 cols, no UUIDs
get_stats_por (default) ✅ Live 22 × 17 cols (was 19) — hash IDs gone
get_stats_por (include_hash_ids=True) ✅ Live 22 × 19 cols — opt-in restores them
get_stats_date_range ✅ Live 3 × 18 cols, post-fix verified clean
get_ratings Returns dict of rating tables (INDEP/DEP/STOR/SHIFT) — no hashes by definition
get_nearest_continuous Thin wrapper over get_continuous (verified)

Migration

Default behavior changes — callers relying on hash columns need to either:

  1. Pass include_hash_ids=True, or
  2. List the hash column explicitly in properties= (OGC services only).

Existing tests that asserted hash columns were present have been updated; new tests document the opt-in path on each surface.

Test plan

  • Unit tests: 157 pass in waterdata_utils_test.py + filters/nearest/ratings; 274 pass across the whole non-live tree.
  • Live API tests: 67 pass in waterdata_test.py.
  • Live-API hash-column sweep across every public get_* function — no leaks remain.
  • Opt-in path: include_hash_ids=True returns the legacy columns on every surface, verified live for stats and samples.
  • Explicit override: listing a hash column in properties= keeps it even with the default.

🤖 Generated with Claude Code

thodson-usgs and others added 6 commits May 17, 2026 16:53
The waterdata OGC services previously returned per-record UUID columns
(``daily_id``, ``continuous_id``, ``peak_id``, …) plus secondary hash
columns (``time_series_id``, ``parent_time_series_id``,
``field_visit_id``, ``field_measurements_series_id``) in every
response. These IDs are unstable across record refreshes and not
human-meaningful — stable identifiers like ``monitoring_location_id``
(AGENCY-ID format), ``parameter_code``, ``statistic_id`` and ``time``
are sufficient to pin a row.

Drop the hash columns by default and add ``include_hash_ids: bool =
False`` to every OGC ``get_*`` function for opt-in. Implementation
trims server-side via the OGC ``properties=`` query parameter (cached
per service from one queryables fetch) so the payload itself is
smaller, with a client-side drop as a safety net. ``monitoring_location_id``
and other AGENCY-ID / code-style identifiers are unaffected.

Offline benchmark on a synthetic 30,000-row payload (mirrors the
on-wire shape and per-row size of a 1-year ``get_continuous`` query):

  - Server payload: 14,310,081 → 10,230,081 bytes (28.5% smaller)
  - DataFrame memory: 14.2 MB → 9.4 MB (33.5% smaller)
  - Peak traced memory: 94.1 MB → 73.9 MB (21.5% smaller)
  - Local parse + DataFrame construction: 1.05s → 0.94s (10.8% faster)

Network savings stack on top of the local speedup. For very small
queries (≲1k rows) the one-time queryables fetch overhead can dominate
the savings; large queries are the target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_range

The OGC ``get_*`` functions in the prior commit drop hash columns
through ``get_ogc_data``. The statistics services (which return JSON
through ``get_stats_data`` rather than OGC features) bypassed that
path, so ``get_stats_por`` and ``get_stats_date_range`` were still
returning ``computation_id`` (UUID) and ``parent_time_series_id``
(hex hash) by default.

This commit:
- Adds ``computation_id`` to ``_HASH_ID_COLUMNS``
  (``parent_time_series_id`` was already there).
- Plumbs ``include_hash_ids: bool = False`` through ``get_stats_data``,
  ``get_stats_por``, and ``get_stats_date_range``.
- Drops the hash columns at the end of ``get_stats_data``, after
  ``_expand_percentiles`` (which still needs ``computation_id`` as a
  join key while it explodes the percentile lists into rows).
- Updates ``test_get_stats_por_expanded_false`` /
  ``test_get_stats_date_range`` to reflect the new column count and
  adds ``test_get_stats_por_include_hash_ids`` documenting the opt-in.

Discovered while running a live-API sweep across every public
waterdata ``get_*`` function — the OGC services now pass, the stats
ones used to leak, and this commit closes that gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mocked-response tests for ``get_stats_data``:
- ``test_get_stats_data_drops_hash_ids_by_default`` asserts
  ``computation_id`` and ``parent_time_series_id`` are removed when
  ``include_hash_ids=False`` (the new default).
- ``test_get_stats_data_keeps_hash_ids_when_opted_in`` asserts the
  opt-in path preserves them, matching the legacy behavior.

Both use ``monkeypatch`` to stub ``_handle_stats_nesting`` so the
fake response only needs to carry the column shape, not the full
nested-percentile body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A live-API column-content scan found that the Samples CSV service
returns two UUID-valued columns by default:
  - ``Activity_ActivityIdentifier`` (per-activity UUID)
  - ``Result_MeasureIdentifier`` (per-measurement UUID)

These weren't covered by the prior OGC / stats commits because
``get_samples`` parses CSV directly without going through
``get_ogc_data`` or ``get_stats_data``.

This commit:
- Adds the two CamelCase column names to ``_HASH_ID_COLUMNS``.
- Plumbs ``include_hash_ids: bool = False`` through ``get_samples``
  and drops the named columns from the parsed CSV before returning.
- Updates ``test_mock_get_samples`` to reflect the new column count
  (187 → 185) and adds ``test_mock_get_samples_include_hash_ids`` for
  the opt-in path.
- Updates ``test_samples_results`` and ``test_samples_activity``
  similarly.

Stable identifiers (``Org_Identifier``, ``Location_Identifier``,
``Project_Identifier``, ``USGSpcode``, …) are kept unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-PR-281 cleanup from a code review pass. No behavior change for
users; the three code paths that drop hash columns (OGC, stats,
samples) now share one helper, and a few smaller wins.

- New ``_drop_hash_columns(df, include_hash_ids, keep=None)`` helper
  replaces three near-identical drop blocks across ``_arrange_cols``,
  ``get_stats_data``, and ``get_samples``. Uses
  ``set(df.columns) & _HASH_ID_COLUMNS - keep`` (one Index intersection)
  in place of the per-call list-comprehension + ``if drop_cols:``
  truthiness guard. ``df.drop`` accepts an empty Index, so the guard
  was unnecessary.

- ``_HASH_ID_COLUMNS`` no longer leaks into ``api.py`` — the samples
  branch now calls the helper instead of touching the constant.

- New ``_properties_unspecified(properties)`` extracts the
  ``properties is None or all(pd.isna(properties))`` predicate so the
  five call sites that need it stop drifting between ``not properties``
  and ``properties is None`` variants.

- ``_default_non_hash_properties`` memoizes its result by
  ``(service, output_id)`` via ``_default_props_cache``. The previous
  version rebuilt a ~30–100-item list on every OGC call after the
  queryables cache was warm; this saves the per-call list rebuild on
  the hot path.

- ``get_ogc_data`` flattens its nested ``try/except + if/else``
  branches into a single ``use_server_trim`` flag, removing the
  duplicated ``_switch_properties_id`` assignment.

- The benchmark no longer clears ``_queryables_cache`` between
  measured rounds. Clearing per round only penalizes the default path
  (the legacy path doesn't consult the cache), so the previous
  comparison was pessimistic against the default. Real-world callers
  pay the queryables fetch once per process.

- The benchmark's ``capture_fixtures`` now calls
  ``_default_non_hash_properties`` directly instead of reimplementing
  the filter, so the "trimmed" fixture matches what the runtime
  actually sends bit-for-bit.

All 157 unit and mock tests pass after the refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``benchmarks/`` directory was useful during development for
measuring the hash-ID-drop impact, but it's not part of the runtime
or test surface and the numbers are captured in the PR description.
Removing to keep the diff focused on the library change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant