Skip to content

waterdata GeoDataFrames are returned with .crs = None (should be EPSG:4326) #342

Description

@matinguku

Summary

waterdata getters return GeoDataFrames whose .crs is None, even though the
data is published in EPSG:4326 (WGS84) and the docstrings say so. The GeoDataFrame
is built via gpd.GeoDataFrame.from_features(...) without a crs= argument, so the
coordinate reference system is never attached.

This affects the modern, primary Water Data API path — the one most users hit.

Where

  • dataretrieval/ogc/shaping.py:143_get_resp_data:
    df = gpd.GeoDataFrame.from_features(
        [f if "geometry" in f else {**f, "geometry": None} for f in features]
    )  # no crs=
  • dataretrieval/waterdata/stats.py:113 — same pattern in the stats path.

Meanwhile the docstring for get_monitoring_locations states coordinates are
published in EPSG:4326 (dataretrieval/waterdata/api.py:638), so the returned
object contradicts the documentation.

Expected vs. actual

  • Expected: gdf.crs == "EPSG:4326".
  • Actual: gdf.crs is None.

Reproduction

from dataretrieval import waterdata

df, md = waterdata.get_monitoring_locations(state="Maryland", site_type_code="ST")
print(type(df))   # geopandas.GeoDataFrame
print(df.crs)     # -> None   (expected: EPSG:4326)

# Any CRS-aware operation now fails or warns:
df.to_crs("EPSG:3857")
# ValueError: Cannot transform naive geometries.
#   Please set a crs on the object first.

Impact

Because the CRS is missing, standard GeoPandas workflows break or emit warnings:

  • .to_crs(...) raises ValueError: Cannot transform naive geometries.
  • .explore() / folium mapping requires a manual .set_crs(4326) first — the
    repo's own map demo has to do this (acknowledged in the pyproject.toml
    comment on the doc extra, which notes .set_crs().explore() is needed).
  • Spatial joins and distance computations against other layers are unreliable
    without a defined CRS.

Inconsistency across modules

The CRS policy is applied inconsistently across the package:

Module Sets CRS? Value
nldi EPSG:4326 (dataretrieval/nldi.py:15,48)
nwis (legacy) EPSG:4269 / NAD83 (dataretrieval/nwis.py:43,162)
waterdata (primary) None

The most-used module is the only one that omits the CRS.

Proposed fix

Attach the documented CRS where the GeoDataFrame is constructed, e.g. define a
module-level _CRS = "EPSG:4326" (mirroring nldi/nwis) and pass it:

df = gpd.GeoDataFrame.from_features(
    [f if "geometry" in f else {**f, "geometry": None} for f in features],
    crs="EPSG:4326",
)

Apply the same in dataretrieval/waterdata/stats.py:113. A regression test should
assert gdf.crs is EPSG:4326 for a getter that returns geometry (and remains
consistent through pagination/pd.concat of empty + non-empty pages).

Note: when skip_geometry=True (or all geometries are null) the result is a plain
DataFrame with no geometry, so the CRS assertion should only apply to the
geometry-bearing case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions