Skip to content

feat(scan): apply V3 deletion vectors on read (stacked on #2678)#2681

Open
raghav-reglobe wants to merge 22 commits into
apache:mainfrom
raghav-reglobe:dv-scan
Open

feat(scan): apply V3 deletion vectors on read (stacked on #2678)#2681
raghav-reglobe wants to merge 22 commits into
apache:mainfrom
raghav-reglobe:dv-scan

Conversation

@raghav-reglobe

Copy link
Copy Markdown

What

Make iceberg-rust apply V3 deletion vectors when scanning. Today the scan
applies V2 Parquet positional/equality deletes but ignores V3 deletion vectors,
so rows marked deleted by a DV are incorrectly returned. This is the read
complement to the DV-write work in #2678.

End to end:

  • DeleteFileIndex recognizes a DV (a PositionDeletes entry stored as Puffin
    with a referenced_data_file) and indexes it by the data file it applies to,
    superseding positional-delete files for that data file (per spec). Mirrors
    apache/iceberg-go buildDVIndex / matchDVToData.
  • The caching delete-file loader reads the deletion-vector-v1 blob located by
    the manifest entry's content_offset / content_size_in_bytes — the spec's
    "direct access" path — and parses it with DeleteVector::from_serialized_bytes.
  • This works whether the DV is wrapped in a full Puffin container or written as a
    standalone blob (e.g. by DuckDB), and falls back to the Puffin footer to honor a
    compression codec if a (non-conforming) writer ever compresses a DV blob.

Stacked on #2678

This branch contains #2678's commits (the DV-write side) plus this PR's read
commits
, both rebased on current main. Please review #2678 first; this PR's
own changes are the commits from feat(scan): carry file_format + referenced_data_file
onward (6 commits touching delete_file_index.rs, arrow/caching_delete_file_loader.rs,
scan/task.rs, delete_vector.rs). It can be reduced to just those once #2678 lands.

Validation

  • Round-trip unit test (test_load_deletes_applies_v3_deletion_vector), ported from
    iceberg-go TestFilterByDeletionVector / TestReadAllDeletionVectors.
  • Real-data cross-engine check: iceberg-rust scans tables whose DVs were written
    by DuckDB and returns the correct post-delete row counts (900K/1M and 9M/10M),
    proving it reads a foreign-written spec DV — not just its own.
  • Full cargo test -p iceberg --lib: 1375 passed, 0 failed; clippy clean.

wirybeaver and others added 14 commits June 20, 2026 14:20
…fications

This commit implements the core transaction infrastructure for MERGE INTO,
UPDATE, and DELETE operations in Apache Iceberg-Rust. Based on the official
Iceberg Java implementation (RowDelta API).

**New file: `crates/iceberg/src/transaction/row_delta.rs`**
- RowDeltaAction: Transaction action supporting both data file additions
  and deletions in a single snapshot
- add_data_files(): Add new data files (inserts/rewrites in COW mode)
- remove_data_files(): Mark data files as deleted (COW mode)
- add_delete_files(): Reserved for future Merge-on-Read (MOR) support
- validate_from_snapshot(): Conflict detection for concurrent modifications
- RowDeltaOperation: Implements SnapshotProduceOperation trait
  - Determines operation type (Append/Delete/Overwrite) based on changes
  - Generates DELETED manifest entries for removed files
  - Carries forward existing manifests for unchanged data

**Modified: `crates/iceberg/src/transaction/mod.rs`**
- Add row_delta() method to Transaction API
- Export row_delta module

**Modified: `crates/iceberg/src/transaction/snapshot.rs`**
- Add write_delete_manifest() to write DELETED manifest entries
- Update manifest_file() to process delete entries from SnapshotProduceOperation
- Update validation to allow delete-only operations

Comprehensive unit tests with ~85% coverage:
- test_row_delta_add_only: Pure append operation
- test_row_delta_remove_only: Delete-only operation
- test_row_delta_add_and_remove: COW update (remove old, add new)
- test_row_delta_with_snapshot_properties: Custom snapshot properties
- test_row_delta_validate_from_snapshot: Snapshot validation logic
- test_row_delta_empty_action: Empty operation error handling
- test_row_delta_incompatible_partition_value: Partition validation

All existing tests pass (1135 passed; 0 failed).

Copy-on-Write (COW) Strategy:
- For row-level modifications: read target files, apply changes,
  write new files, mark old files deleted
- For inserts: write new data files
- Merge-on-Read (MOR) with delete files is reserved for future optimization

References:
- Java implementation: org.apache.iceberg.RowDelta, BaseRowDelta
- Based on implementation plan for MERGE INTO support
- Fix test_row_delta_validate_from_snapshot to assert ErrorKind::DataInvalid
  directly rather than matching against error message strings
- Correct operation() doc comment: remove inaccurate "Only adds delete files → Delete"
  bullet; add explicit note that Operation::Delete is deferred until MoR is wired up
- Add comment explaining why removed_data_files are not validated (already-committed
  files, matches Java MergingSnapshotProducer behavior)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
unwrap_err() requires T: Debug on the Ok type (ActionCommit), which is
not derived. Use a match instead to extract and assert the error kind.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…eview

Blocking issues fixed:
- existing_manifest() now rewrites manifests that contain deleted files instead
  of dropping them entirely. Surviving files get EXISTING entries (original
  sequence numbers preserved), removed files get DELETED entries (snapshot_id
  updated to current, sequence numbers preserved). Matches Java
  ManifestFilterManager.filterManifestWithDeletedFiles behavior.
- DELETED manifest entries now carry original sequence numbers (copied from the
  loaded manifest entry), fixing the spec violation where 0 was used as a
  placeholder.
- Snapshot summary now tracks removed files via
  SnapshotSummaryCollector.remove_file(), populating deleted-data-files,
  deleted-records, and removed-files-size metrics.
- add_delete_files() now returns ErrorKind::FeatureUnsupported immediately on
  commit instead of silently dropping the files.

Design improvements:
- Renamed write_delete_manifest → write_manifest_with_deleted_entries to
  distinguish data manifests with DELETED-status entries from Iceberg delete
  manifests (content=Deletes, used for MoR delete files).
- SnapshotProduceOperation::existing_manifest now takes &mut SnapshotProducer
  so implementations can call new_manifest_writer() for rewrites.
- Added removed_data_files() default method to the trait for summary tracking.
- Removed added_delete_files from RowDeltaOperation (only needed for the
  fail-fast check in RowDeltaAction::commit).
- Trimmed struct/method doc comments to match codebase convention.

Tests:
- Added test_row_delta_cow_manifest_rewrite: FastAppend 2 files, RowDelta
  remove one + add one, then verify DELETED/EXISTING/ADDED entries and
  sequence numbers in the resulting manifests.
- Added test_row_delta_add_delete_files_errors for the fail-fast path.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Implements DeleteVector::to_puffin_blob / from_puffin_blob for the Iceberg deletion-vector-v1 Puffin blob format: 4-byte BE length, magic 0xD1D33964, portable 64-bit RoaringTreemap, 4-byte BE CRC32. Adds the crc32fast dependency.

Tests: self serialize<->deserialize round-trip; pyiceberg-portable layout validation (Spark/Iceberg-Java byte compatibility); full Puffin-file round-trip via PuffinWriter/PuffinReader.

Serializer design referenced from risingwavelabs/iceberg-rust apache#113. Toward finishing RowDelta merge-on-read DV-write (PR apache#2203).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…es manifest)

SnapshotProducer now carries added_delete_files and writes a separate content=Deletes manifest (build_v2_deletes / build_v3_deletes). RowDeltaAction.add_delete_files commits MoR delete files (position/equality, incl. V3 deletion vectors) instead of returning FeatureUnsupported; operation() returns Delete (only deletes) or Overwrite (with data files) per Java BaseRowDelta semantics.

Replaces the stub-error test with test_row_delta_add_delete_files_mor (commits a position-delete file; asserts Operation::Delete + a PositionDeletes manifest entry). All 45 transaction tests pass (fast_append / overwrite / CoW unregressed).

Toward finishing RowDelta merge-on-read DV-write (PR apache#2203).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
Adds PuffinWriter::close_with_metadata() + PuffinWriteResult (file size + per-blob offsets/lengths). DeleteVector::write_to_puffin_file() writes a deletion-vector-v1 Puffin file and returns the V3 DataFile{content=PositionDeletes, referenced_data_file, content_offset, content_size_in_bytes} ready to feed RowDeltaAction::add_delete_files.

Test test_dv_write_to_puffin_file (DV -> Puffin file -> DataFile, read back). Completes the end-to-end Rust MoR DV-write path: DeleteVector -> Puffin file -> DataFile -> RowDelta content=Deletes manifest commit. cf. RW deletion_vector_writer.rs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…ation example

Exposes 'pub mod delete_vector' (+ rustdoc on the public surface to satisfy #![deny(missing_docs)]) so downstream crates can use DeleteVector + write_to_puffin_file.

Adds crates/catalog/rest/examples/pulse_dv_realdata.rs: connects to a REST catalog (Polaris), loads a real V3 table, writes a deletion vector via write_to_puffin_file, commits via RowDelta (content=Deletes manifest) — for cross-engine (Doris/Spark) validation before opening PR apache#2203. Compiles clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…mple

RestCatalog requires explicit StorageFactory injection (S3 storage moved to the iceberg-storage-opendal crate, apache#2207). Wire OpenDalResolvingStorageFactory into the pulse_dv_realdata example and dev-dep the crate.

Verified end-to-end against a real V3 Iceberg table on S3 (Polaris REST catalog): the Rust harness wrote a deletion-vector-v1 and committed via RowDelta; Doris (independent engine) read COUNT(*) 10 -> 7 with positions 0,1,2 (ids 1,2,3) deleted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…ayload

Adds testdata/puffin/deletion-vector-v1-payload.bin (Java-produced deletion-vector-v1 payload for positions {1,3,5,7,9}, from apache/iceberg test resources via apache/iceberg-go's fixture) and a test asserting DeleteVector::to_puffin_blob produces byte-identical output. Proves our roaring serialization + length/magic/CRC framing match the Iceberg-Java reference exactly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…rg-go)

Mirrors apache/iceberg-go TestSerializeDVEmpty + TestSerializeDVLargePositions: empty DV round-trips; positions straddling the 2^31 (Java-signed) and 2^32 (roaring bucket) boundaries round-trip. 12 delete_vector tests now pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…mall+large)

From apache/iceberg core test resources: empty-position-index.bin + small-and-large-values-position-index.bin. Now 3 Java byte-identical DV-payload vectors (empty; alternating 1/3/5/7/9; small+large across two 16-bit roaring containers). 14 delete_vector tests pass. all-container-types fixture intentionally NOT byte-pinned: Java run-optimizes containers, roaring-rs does not — spec-valid but different bytes (round-trip still covered).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
Apache iceberg-rust CI gates on rustfmt + clippy; both now clean on the changed crates (iceberg, iceberg-catalog-rest).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
raghav-reglobe and others added 8 commits June 20, 2026 17:27
- adapt pulse_dv_realdata example to manifest_list_reader (load_manifest_list
  was removed from Snapshot in current main)
- allowlist 'mor' (merge-on-read) in .typos.toml
- regenerate crates/iceberg/public-api.txt and DEPENDENCIES.rust.tsv files

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eleteFile

Groundwork for V3 deletion-vector-applied scan: the delete-file scan task now carries the manifest entry's file format (Puffin marks a DV) and referenced_data_file, so the delete-file loader can distinguish a Puffin DV from a Parquet positional/equality delete and know which data file it applies to. Populated in From<&DeleteFileContext>; test constructors default to Parquet/None.

Part 1 of DV-applied scan (complements the DV-write in apache#2678). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Raghvendra Singh <raghav@cashify.in>
DeleteFileIndex now recognizes V3 deletion vectors (PositionDeletes content + Puffin format + referenced_data_file) and indexes them by the data file path they apply to, separate from partition-scoped Parquet positional deletes. get_deletes_for_data_file returns the matching DV (sequence-number rule) and, per spec, has it supersede positional-delete files for that data file. Mirrors apache/iceberg-go buildDVIndex + matchDVToData.

Part 2 of DV-applied scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Raghvendra Singh <raghav@cashify.in>
The caching delete-file loader now recognizes a deletion vector (PositionDeletes task whose file_format is Puffin), reads the deletion-vector-v1 blob via PuffinReader, and parses it with DeleteVector::from_puffin_blob into the delete filter — completing the V3 DV read path (the DelVecs -> upsert_delete_vector output side was already present). With this, iceberg-rust correctly excludes rows marked by a V3 deletion vector on read; previously such rows were returned. Read complement to the DV-write in apache#2678.

Part 3 of DV-applied scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Raghvendra Singh <raghav@cashify.in>
Adds a round-trip test (ported from apache/iceberg-go TestFilterByDeletionVector / TestReadAllDeletionVectors): write a deletion vector marking positions {1,3} to a Puffin file, load it through the caching delete-file loader, and assert those positions surface as the data file's delete vector. Also refreshes the load_deletes doc comment now that deletion-vector loading is implemented (was 'Not yet Implemented').

Part 4 of DV-applied scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Raghvendra Singh <raghav@cashify.in>
…size

The deletion vector read path located the DV blob by parsing a Puffin footer (PuffinReader::file_metadata). That only works when the DV is wrapped in a full Puffin container — but the Iceberg spec addresses a DV blob by the manifest entry's content_offset + content_size_in_bytes, and writers such as DuckDB store the deletion-vector-v1 blob standalone (no PFA1 framing). Reading those via the footer fails with 'Bad magic value ... should be PFA1'.

Now FileScanTaskDeleteFile carries content_offset + content_size_in_bytes, and the loader reads exactly those bytes and parses them with the new DeleteVector::from_serialized_bytes (factored out of from_puffin_blob). This works whether the DV is standalone or inside a Puffin container. Validated cross-engine on real data: iceberg-rust scans tables with DuckDB-written DVs and returns the correct post-delete row counts (900K/1M, 9M/10M).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
The Iceberg spec requires a deletion-vector-v1 blob to be stored uncompressed ('Omit compression-codec; deletion-vector-v1 is not compressed'), so reading the bytes at content_offset/content_size and parsing them directly is correct for all compliant deletion vectors (the fast path). A non-conforming writer could, however, store a compressed deletion vector blob; in that case the bytes at content_offset are compressed and the direct parse fails. Fall back to reading the blob through the Puffin footer, which records the per-blob compression-codec and decompresses accordingly (matching the blob by content_offset). Container-less blob files (no footer) stay on the fast path.

The fast path is covered by the existing round-trip test and real-data validation; the compressed fallback is defensive (the spec forbids compressed DVs, so no fixture is produced by conforming writers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
FileScanTaskDeleteFile gains file_format / referenced_data_file /
content_offset / content_size_in_bytes (+ their with_ builder setters)
and DeleteVector::from_serialized_bytes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@raghav-reglobe raghav-reglobe marked this pull request as ready for review June 20, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants