Skip to content

Spark 4.1: Read and write geometry and geography values in Parquet#17073

Open
huan233usc wants to merge 2 commits into
apache:mainfrom
huan233usc:geo-parquet-internal
Open

Spark 4.1: Read and write geometry and geography values in Parquet#17073
huan233usc wants to merge 2 commits into
apache:mainfrom
huan233usc:geo-parquet-internal

Conversation

@huan233usc

Copy link
Copy Markdown
Contributor

Follow-up to the geo type work: the Spark type mapping (#16851) and Iceberg's
own Parquet value path (#16982) are in place, but the Spark Parquet
reader/writer did not handle geometry/geography values.

Geometry and geography columns carry a Parquet LogicalTypeAnnotation with no
legacy OriginalType. SparkParquetReaders and SparkParquetWriters dispatch
geo through the OriginalType / logical-type paths, so:

  • the reader fell through to the physical BINARY case and returned a raw
    byte[], which is the wrong in-memory type for a geo column (Spark's
    InternalRow.getGeometry / getGeography expect GeometryVal / GeographyVal);
  • the writer hit the unsupported-logical-type branch and threw.

This reads a WKB BINARY column into Spark's GeometryVal / GeographyVal and
writes those values back as their WKB bytes, mirroring the existing binary
handling. Geo values are stored as pure WKB, so no transformation is needed
beyond wrapping/unwrapping the byte payload.

Testing:

  • Enables the shared geospatial DataTest coverage for the Spark Parquet
    reader (supportsGeospatial()), exercising geometry and geography read
    round-trips through SparkParquetReaders.
  • Adds a Spark writer round-trip test (TestSparkParquetWriter) that writes
    GeometryVal / GeographyVal through SparkParquetWriters and reads them
    back, including null values.

Vectorized (Arrow) geo reads are out of scope and remain a follow-up.

The Spark type mapping (apache#16851) and Iceberg's own Parquet value path
(apache#16982) are in place, but the Spark Parquet reader/writer did not handle
geo values: geometry/geography carry a LogicalTypeAnnotation with no legacy
OriginalType, so the reader fell through to a raw byte[] (mis-typed for a
GeometryVal/GeographyVal column) and the writer threw on the
unsupported-logical-type path.

Read a geo WKB BINARY column into Spark's GeometryVal/GeographyVal and write
those values back as their WKB bytes, mirroring the existing binary handling.
Enable the shared geospatial DataTest coverage for the Spark Parquet reader
and add a Spark writer round-trip test, including null values.
@github-actions github-actions Bot added the spark label Jul 3, 2026
Iceberg stores geometry and geography as pure WKB, but Spark's GeometryVal
and GeographyVal wrap [SRID | WKB]. The initial value-path support treated
GeometryVal bytes as pure WKB, so a geo column read through Spark's scan
surfaced the WKB prefix as a bogus SRID (GEO_ENCODER_SRID_MISMATCH_ERROR)
and writes would have persisted the SRID header on disk.

Convert the 4-byte SRID header at the boundary: the writer strips it with
STUtils.stAsBinary before writing pure WKB, and the reader attaches the
column's SRID (derived from the geometry CRS; geography is always the
default) with STUtils.stGeomFromWKB / stGeogFromWKB. Add an end-to-end test
that reads geo WKB back through a Spark scan, and update the value-path tests
to build and compare Spark geo values through the SRID header.
@huan233usc huan233usc force-pushed the geo-parquet-internal branch from 29de293 to d34780e Compare July 5, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant