[VL][Delta] Add scan-scoped JVM-serialized deletion vector handoff for native Delta scans#11963
[VL][Delta] Add scan-scoped JVM-serialized deletion vector handoff for native Delta scans#11963malinjawi wants to merge 43 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
a8cb672 to
8833daa
Compare
|
Run Gluten Clickhouse CI on x86 |
8833daa to
f9e04fc
Compare
|
Run Gluten Clickhouse CI on x86 |
f9e04fc to
ef15fef
Compare
|
Run Gluten Clickhouse CI on x86 |
ef15fef to
06c8ada
Compare
|
Run Gluten Clickhouse CI on x86 |
06c8ada to
19c47c7
Compare
|
Run Gluten Clickhouse CI on x86 |
19c47c7 to
e532ba3
Compare
|
Run Gluten Clickhouse CI on x86 |
e532ba3 to
ba4765f
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
This PR implements a scan-scoped JVM-serialized deletion vector (DV) handoff for native Delta scans.
The earlier native DV PoC showed that native DV consume/apply during scan is the hot path, while fully native DV IO/materialization did not justify the extra complexity as the default design. This PR keeps the native scan-time path and moves DV materialization/loading to the JVM side.
Final ownership split:
This PR also:
Current scope / behavior:
DeltaScanTransformer_delta_logscans still fall back and are not offloadedThis keeps the supported native path simple and maintainable, while leaving the door open for a future native DV IO layer if later DV-heavy workloads justify it.
How was this patch tested?
Build / validation:
./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-3.5 -Pdelta -DskipTests compileFocused local validation:
org.apache.spark.sql.delta.DeltaDeletionVectorHandoffSuiteorg.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQueryNameColumnMappingSuiteorg.apache.spark.sql.delta.DeltaDeletionVectorHandoffSuiteorg.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQueryNameColumnMappingSuiteFunctional validation:
DeltaScanTransformerfor supported Spark versions_delta_logscans stay on fallbackCI validation:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: IBM BOB
issue: #11901
Follow-up work
This PR intentionally keeps the scope limited to the supported scan path. Reasonable follow-ups are: