Spark 4.1: Read and write geometry and geography values in Parquet#17073
Open
huan233usc wants to merge 2 commits into
Open
Spark 4.1: Read and write geometry and geography values in Parquet#17073huan233usc wants to merge 2 commits into
huan233usc wants to merge 2 commits into
Conversation
The Spark type mapping (apache#16851) and Iceberg's own Parquet value path (apache#16982) are in place, but the Spark Parquet reader/writer did not handle geo values: geometry/geography carry a LogicalTypeAnnotation with no legacy OriginalType, so the reader fell through to a raw byte[] (mis-typed for a GeometryVal/GeographyVal column) and the writer threw on the unsupported-logical-type path. Read a geo WKB BINARY column into Spark's GeometryVal/GeographyVal and write those values back as their WKB bytes, mirroring the existing binary handling. Enable the shared geospatial DataTest coverage for the Spark Parquet reader and add a Spark writer round-trip test, including null values.
Iceberg stores geometry and geography as pure WKB, but Spark's GeometryVal and GeographyVal wrap [SRID | WKB]. The initial value-path support treated GeometryVal bytes as pure WKB, so a geo column read through Spark's scan surfaced the WKB prefix as a bogus SRID (GEO_ENCODER_SRID_MISMATCH_ERROR) and writes would have persisted the SRID header on disk. Convert the 4-byte SRID header at the boundary: the writer strips it with STUtils.stAsBinary before writing pure WKB, and the reader attaches the column's SRID (derived from the geometry CRS; geography is always the default) with STUtils.stGeomFromWKB / stGeogFromWKB. Add an end-to-end test that reads geo WKB back through a Spark scan, and update the value-path tests to build and compare Spark geo values through the SRID header.
29de293 to
d34780e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to the geo type work: the Spark type mapping (#16851) and Iceberg's
own Parquet value path (#16982) are in place, but the Spark Parquet
reader/writer did not handle geometry/geography values.
Geometry and geography columns carry a Parquet
LogicalTypeAnnotationwith nolegacy
OriginalType.SparkParquetReadersandSparkParquetWritersdispatchgeo through the
OriginalType/ logical-type paths, so:BINARYcase and returned a rawbyte[], which is the wrong in-memory type for a geo column (Spark'sInternalRow.getGeometry/getGeographyexpectGeometryVal/GeographyVal);This reads a WKB
BINARYcolumn into Spark'sGeometryVal/GeographyValandwrites those values back as their WKB bytes, mirroring the existing binary
handling. Geo values are stored as pure WKB, so no transformation is needed
beyond wrapping/unwrapping the byte payload.
Testing:
DataTestcoverage for the Spark Parquetreader (
supportsGeospatial()), exercising geometry and geography readround-trips through
SparkParquetReaders.TestSparkParquetWriter) that writesGeometryVal/GeographyValthroughSparkParquetWritersand reads themback, including null values.
Vectorized (Arrow) geo reads are out of scope and remain a follow-up.