Switch batch scan from Arrow IPC to Arrow C Data Interface#100
Open
robertbuessow wants to merge 1 commit into
Open
Switch batch scan from Arrow IPC to Arrow C Data Interface#100robertbuessow wants to merge 1 commit into
robertbuessow wants to merge 1 commit into
Conversation
Replace the Arrow IPC serialization path (RecordBatch → bytes → Julia
deserializes with Arrow.Table) with the Arrow C Data Interface. Each
RecordBatch is exported as a struct-typed (ArrowSchema, ArrowArray) pair
via arrow-rs to_ffi; Julia imports it zero-copy with Arrow.from_c_data.
**Rust changes**
- `record_batch_to_c_ffi`: StructArray::from(batch) → to_ffi → store FFI
structs in a Box<ArrowBatchInner>; schema/array pointers point into that
box. No serialization, no thread pool, no transmute.
- `iceberg_arrow_batch_free`: drops Box<ArrowBatchInner>, which calls the
arrow-rs Drop impls and frees all buffer Arcs.
- Removed SERIALIZE_POOL (rayon thread pool) and arrow-ipc dependency.
- ArrowBatch FFI struct: data/length → schema/array (C Data Interface ptrs).
**arrow-julia changes**
- Removed `finalizer(_release_cdata_handle, handle)` from both from_c_data
overloads. The library no longer auto-frees on GC; the producer (Rust)
manages lifetime via free_batch. Callers needing explicit release can
still call Arrow.release_c_data.
- Fixed DataType[] → Type[] in the "+s" struct branch of _import_arrowvec
(nullable types like Union{Missing,String} are not DataType).
**Julia / test changes**
- ArrowBatch struct: data/length → schema/array (Ptr{Arrow.ArrowSchema/Array}).
- All batch reads: Arrow.Table(unsafe_wrap(...)) → Arrow.from_c_data(schema, array).
- Test patterns that accumulated CImportedArray views before calling
free_batch now push DataFrame(arrow_table) instead (materialise before
free; from_c_data is zero-copy so the view becomes dangling after free).
- Arrow dependency switched to local ../arrow-julia via Pkg.develop.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
robertbuessow
added a commit
that referenced
this pull request
May 18, 2026
IO_NETWORK and IO_S3 user messages no longer embed a truncated slice of the technical detail string (which can contain S3 bucket paths, file paths, or endpoint URLs). The `msg` field of IcebergException now always contains a short, generic description; the full context remains available in `detail` for log files and bug reports. Labels: dismiss-release-notes, build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
hall-alex
approved these changes
May 20, 2026
Comment on lines
+48
to
58
| /// Arrow batch exported via Arrow C Data Interface. | ||
| /// | ||
| /// `schema` and `array` point into the `ArrowBatchInner` owned by `rust_ptr`. | ||
| /// `iceberg_arrow_batch_free` drops the inner allocation, which calls the | ||
| /// arrow-rs release callbacks and frees all buffer data. | ||
| #[repr(C)] | ||
| pub struct ArrowBatch { | ||
| pub data: *const u8, | ||
| pub length: usize, | ||
| pub rust_ptr: *mut std::ffi::c_void, | ||
| pub schema: *mut FFI_ArrowSchema, | ||
| pub array: *mut FFI_ArrowArray, | ||
| pub rust_ptr: *mut std::ffi::c_void, // Box<ArrowBatchInner> | ||
| } |
Collaborator
There was a problem hiding this comment.
Perhaps you could have Claude add more details on why this indirection with ArrowBatchInner is needed. I'm assuming it has to do with handling (and eventually dropping) the allocation somehow? Would be nice to have a brief explanation her for "Rust dummies" like myself.
hall-alex
approved these changes
May 20, 2026
Collaborator
hall-alex
left a comment
There was a problem hiding this comment.
We need to bump the version somehow, right? Otherwise, we may run into issues, since this is an incompatible change or am I misunderstanding something?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(ArrowSchema, ArrowArray)pair viaarrow-rs to_ffiand imported zero-copy in Julia withArrow.from_c_data.arrow-ipcRust crate dependency and theSERIALIZE_POOLrayon thread pool — export is now O(columns), inline on the async task.../arrow-julia(same UUID, addsArrowSchema/ArrowArrayexports and C Data Interface support).from_c_datafinalizer removed from arrow-julia so the producer (Rust) manages lifetime explicitly viafree_batch.child_types = DataType[]→Type[]in the+sstruct branch —Union{Missing,String}is not aDataType.Ownership model
ArrowBatchnow containsschema/arraypointers into aBox<ArrowBatchInner>held byrust_ptr.free_batchdrops that box, which calls the arrow-rsDropimpls on the FFI structs and frees all buffer Arcs. No GC finalizer is registered on the Julia side — the caller must not access theCImportedArrayview afterfree_batch.Test plan
cargo checkiniceberg_rust_ffi/— no arrow-ipc referencesmake run-containers && make test-dev— 27728 tests pass, 0 fail🤖 Generated with Claude Code