[GH-3089] Dispose rasters in RS_DWithin join path and fix empty-envelope filter#3090
Draft
jiayuasu wants to merge 1 commit into
Draft
[GH-3089] Dispose rasters in RS_DWithin join path and fix empty-envelope filter#3090jiayuasu wants to merge 1 commit into
jiayuasu wants to merge 1 commit into
Conversation
…-envelope filter Three related issues in the raster distance-join machinery introduced with RS_DWithin (apacheGH-3089): 1. Off-heap raster leak: RS_DWithin.eval and the WGS84 envelope builders in TraitJoinQueryBase.toExpandedWGS84EnvelopeRDD and BroadcastIndexJoinExec.streamShapeToExpandedEnvelope deserialize GridCoverage2D rasters but never dispose them. On long-running executors a raster distance join leaks native memory per row. Wrap each deserialization in try/finally and call raster.dispose(true), mirroring the discipline already used by RS_Predicate.evaluator. 2. Empty-envelope false positive: expandRasterFilterEnvelope expanded the degenerate envelope of the empty GeometryCollection substituted for a NULL raster/geometry, producing a non-empty filter geometry that spuriously matched rows the per-row predicate would reject. Return the empty shape unchanged so the coarse R-tree filter excludes it. 3. Misleading EXPLAIN output: the raster distance branch of BroadcastIndexJoinExec.simpleString printed "RS_Distance(left, right) < r", a non-existent function. Print "RS_DWithin(left, right, r)" to match the actual predicate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes RS_DWithin / raster distance join leaks GridCoverage2D rasters (plus empty-envelope and EXPLAIN issues) #3089What changes were proposed in this PR?
This PR fixes three related issues in the raster distance-join machinery built around
RS_DWithin, reported in #3089. All changes are confined tospark/common.Native raster memory leak.
RS_DWithin.evaland the WGS84 envelope builders used by the join planner deserializeGridCoverage2Drasters but never dispose them, so a raster distance join leaks off-heap memory on every row — a real problem on long-running executors. The affected sites are:RS_DWithin.eval(RasterPredicates.scala)TraitJoinQueryBase.toExpandedWGS84EnvelopeRDDBroadcastIndexJoinExecthat build the expanded WGS84 envelopeEach deserialized raster is now wrapped in
try/finallyand released withraster.dispose(true), mirroring the discipline already used byRS_Predicate.evaluator.Empty-envelope false positive. When a raster or geometry input is
NULL, the join substitutes an emptyGeometryCollection.expandRasterFilterEnvelopethen expanded that geometry's degenerate envelope bydistance, yielding a non-empty filter geometry that spuriously matched rows the per-row predicate would reject. It now returns the empty shape unchanged so the coarse R-tree filter excludes it.Misleading EXPLAIN output. The raster distance branch of
BroadcastIndexJoinExec.simpleStringprintedRS_Distance(left, right) < r— a function that does not exist in Sedona. It now printsRS_DWithin(left, right, r), naming the actual predicate that drives the join.How was this patch tested?
sedona-spark-commoncompiles cleanly with the changes. The fixes are minimal, mechanical resource-management changes within existing code paths exercised by the existingRasterJoinSuiteandBroadcastIndexJoinSuite.Did this PR include necessary documentation updates?