Skip to content

[VL][Delta] Add scan-scoped JVM-serialized deletion vector handoff for native Delta scans#11963

Open
malinjawi wants to merge 43 commits intoapache:mainfrom
malinjawi:delta-dv-java-materialized-handoff-clean
Open

[VL][Delta] Add scan-scoped JVM-serialized deletion vector handoff for native Delta scans#11963
malinjawi wants to merge 43 commits intoapache:mainfrom
malinjawi:delta-dv-java-materialized-handoff-clean

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented Apr 19, 2026

What changes are proposed in this pull request?

This PR implements a scan-scoped JVM-serialized deletion vector (DV) handoff for native Delta scans.

The earlier native DV PoC showed that native DV consume/apply during scan is the hot path, while fully native DV IO/materialization did not justify the extra complexity as the default design. This PR keeps the native scan-time path and moves DV materialization/loading to the JVM side.

Final ownership split:

  • JVM / Delta side materializes the deletion vector
  • JVM / Delta side serializes the DV payload
  • JVM / Delta side attaches DV scan metadata for native consumption
  • native / Velox side deserializes the serialized DV payload
  • native / Velox side applies DV filtering during scan
  • native / Velox side keeps Delta split / connector / datasource integration in the scan path

This PR also:

  • removes native DV file/inline/materialization ownership from this path
  • strips the synthetic Spark DV predicate and internal DV columns from the offloaded plan so DV is not applied twice
  • keeps the change scoped to Delta scan metadata handling
  • adds focused Delta DV handoff tests for Spark 3.5 and Spark 4.0
  • keeps unsupported or internal Delta scan paths on fallback
  • does not introduce global fallback-reporting or query-listener behavior changes

Current scope / behavior:

  • Spark 3.5 and Spark 4.0 Delta DV scans offload through DeltaScanTransformer
  • Spark 3.4 DV scans still fall back for correctness
  • _delta_log scans still fall back and are not offloaded

This keeps the supported native path simple and maintainable, while leaving the door open for a future native DV IO layer if later DV-heavy workloads justify it.

How was this patch tested?

Build / validation:

  • ./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-3.5 -Pdelta -DskipTests compile

Focused local validation:

  • Spark 3.5 org.apache.spark.sql.delta.DeltaDeletionVectorHandoffSuite
  • Spark 3.5 org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQueryNameColumnMappingSuite
  • Spark 4.0 org.apache.spark.sql.delta.DeltaDeletionVectorHandoffSuite
  • Spark 4.0 org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQueryNameColumnMappingSuite

Functional validation:

  • verified offload through DeltaScanTransformer for supported Spark versions
  • verified the synthetic Spark DV columns are not left in the offloaded plan
  • verified result correctness matches vanilla Spark on DV-backed reads
  • verified Spark 3.4 stays on fallback for DV scans
  • verified _delta_log scans stay on fallback

CI validation:

  • DV-relevant Spark 3.4, Spark 3.5, and Spark 4.0 jobs passed on the PR branch

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

issue: #11901

Follow-up work

This PR intentionally keeps the scope limited to the supported scan path. Reasonable follow-ups are:

  • enable native DV scan on Spark 3.4 after adding dedicated end-to-end correctness coverage for that path
  • evaluate whether any remaining Spark-version-specific Delta scan preparation can be unified across shims
  • add broader perf coverage for DV-heavy workloads to decide whether a future native DV IO/materialization layer is worth the extra complexity
  • keep tightening Delta-specific scan tests so regressions are caught by focused suites before showing up in broad CI shards

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi changed the title [VL][Delta] Add JVM-materialized deletion vector handoff for Delta scans [VL][Delta] Add JVM-serialized deletion vector handoff for native Delta scans Apr 19, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from a8cb672 to 8833daa Compare April 19, 2026 21:34
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from 8833daa to f9e04fc Compare April 19, 2026 21:50
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from f9e04fc to ef15fef Compare April 19, 2026 22:01
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from ef15fef to 06c8ada Compare April 19, 2026 22:08
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from 06c8ada to 19c47c7 Compare April 19, 2026 22:25
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from 19c47c7 to e532ba3 Compare April 19, 2026 22:32
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the delta-dv-java-materialized-handoff-clean branch from e532ba3 to ba4765f Compare April 20, 2026 10:09
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi changed the title [VL][Delta] Add JVM-serialized deletion vector handoff for native Delta scans [VL][Delta] Add scan-scoped JVM-serialized deletion vector handoff for native Delta scans Apr 26, 2026
@malinjawi malinjawi marked this pull request as ready for review April 26, 2026 22:15
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant