Skip to content

feat: Implement PositionDeleteWriter for position delete files#582

Merged
wgtmac merged 4 commits intoapache:mainfrom
shangxinli:implement-position-delete-writer
Mar 15, 2026
Merged

feat: Implement PositionDeleteWriter for position delete files#582
wgtmac merged 4 commits intoapache:mainfrom
shangxinli:implement-position-delete-writer

Conversation

@shangxinli
Copy link
Contributor

Implement the PositionDeleteWriter following the same PIMPL pattern as DataWriter. The writer supports both buffered WriteDelete(file_path, pos) calls and direct Write(ArrowArray*) for pre-formed batches. Metadata reports content=kPositionDeletes with sort_order_id=nullopt per spec, and tracks referenced_data_file when all deletes target a single file.

Implement the PositionDeleteWriter following the same PIMPL pattern as
DataWriter. The writer supports both buffered WriteDelete(file_path, pos)
calls and direct Write(ArrowArray*) for pre-formed batches. Metadata
reports content=kPositionDeletes with sort_order_id=nullopt per spec,
and tracks referenced_data_file when all deletes target a single file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@evindj evindj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on implementing this feature. The change looks good, my only comment is around the threshold for flushing data.

Make kFlushThreshold configurable via PositionDeleteWriterOptions with
a default of 1000. Add AutoFlushOnThreshold test that uses a small
threshold to verify the automatic flush logic.
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I have reviewed the changes against the Java implementation for strict parity, C++ styling, and logic issues. There are a few logic and parity concerns that need to be addressed, mainly regarding RAII memory safety in FlushBuffer, metrics filtering for delete columns, and tracking referenced_paths_ during batch writes.

Note: This review was generated by Gemini.

- Use ArrowSchemaGuard/ArrowArrayGuard in FlushBuffer for memory safety
  on early returns, fixing potential leaks when nanoarrow macros fail
- Fix guards to handle already-consumed arrays (null release check)
- Filter out value_counts/null_value_counts/nan_value_counts for
  delete metadata columns (file_path, pos) to match Java parity;
  also drop bounds when referencing multiple data files
- Add TODO for extracting paths from ArrowArray in Write() to update
  referenced_paths_ for batch writes
- Add TODO for row_schema support in position deletes (V2 spec)
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shangxinli for adding this and @evindj for the review!

@wgtmac wgtmac force-pushed the implement-position-delete-writer branch from 89c5ca7 to 035f171 Compare March 15, 2026 02:25
@wgtmac wgtmac merged commit 69cf2d3 into apache:main Mar 15, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants