Spark 4.1: Read and write geometry and geography values in Parquet by huan233usc · Pull Request #17073 · apache/iceberg

huan233usc · 2026-07-03T21:47:12Z

Follow-up to the geo type work: the Spark type mapping (#16851) and Iceberg's
own Parquet value path (#16982) are in place, but the Spark Parquet
reader/writer did not handle geometry/geography values.

Geometry and geography columns carry a Parquet LogicalTypeAnnotation with no
legacy OriginalType. SparkParquetReaders and SparkParquetWriters dispatch
geo through the OriginalType / logical-type paths, so:

the reader fell through to the physical BINARY case and returned a raw
byte[], which is the wrong in-memory type for a geo column (Spark's
InternalRow.getGeometry / getGeography expect GeometryVal / GeographyVal);
the writer hit the unsupported-logical-type branch and threw.

This reads a WKB BINARY column into Spark's GeometryVal / GeographyVal and
writes those values back as their WKB bytes, mirroring the existing binary
handling. Geo values are stored as pure WKB, so no transformation is needed
beyond wrapping/unwrapping the byte payload.

Testing:

Enables the shared geospatial DataTest coverage for the Spark Parquet
reader (supportsGeospatial()), exercising geometry and geography read
round-trips through SparkParquetReaders.
Adds a Spark writer round-trip test (TestSparkParquetWriter) that writes
GeometryVal / GeographyVal through SparkParquetWriters and reads them
back, including null values.

Vectorized (Arrow) geo reads are out of scope and remain a follow-up.

The Spark type mapping (apache#16851) and Iceberg's own Parquet value path (apache#16982) are in place, but the Spark Parquet reader/writer did not handle geo values: geometry/geography carry a LogicalTypeAnnotation with no legacy OriginalType, so the reader fell through to a raw byte[] (mis-typed for a GeometryVal/GeographyVal column) and the writer threw on the unsupported-logical-type path. Read a geo WKB BINARY column into Spark's GeometryVal/GeographyVal and write those values back as their WKB bytes, mirroring the existing binary handling. Enable the shared geospatial DataTest coverage for the Spark Parquet reader and add a Spark writer round-trip test, including null values.

Iceberg stores geometry and geography as pure WKB, but Spark's GeometryVal and GeographyVal wrap [SRID | WKB]. The initial value-path support treated GeometryVal bytes as pure WKB, so a geo column read through Spark's scan surfaced the WKB prefix as a bogus SRID (GEO_ENCODER_SRID_MISMATCH_ERROR) and writes would have persisted the SRID header on disk. Convert the 4-byte SRID header at the boundary: the writer strips it with STUtils.stAsBinary before writing pure WKB, and the reader attaches the column's SRID (derived from the geometry CRS; geography is always the default) with STUtils.stGeomFromWKB / stGeogFromWKB. Add an end-to-end test that reads geo WKB back through a Spark scan, and update the value-path tests to build and compare Spark geo values through the SRID header.

github-actions Bot added the spark label Jul 3, 2026

huan233usc force-pushed the geo-parquet-internal branch from 29de293 to d34780e Compare July 5, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 4.1: Read and write geometry and geography values in Parquet#17073

Spark 4.1: Read and write geometry and geography values in Parquet#17073
huan233usc wants to merge 2 commits into
apache:mainfrom
huan233usc:geo-parquet-internal

huan233usc commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

huan233usc commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant