Switch batch scan from Arrow IPC to Arrow C Data Interface by robertbuessow · Pull Request #100 · RelationalAI/RustyIceberg.jl

robertbuessow · 2026-05-15T16:24:16Z

Summary

Replace Arrow IPC serialization (RecordBatch → bytes → Julia deserializes) with the Arrow C Data Interface: each batch is exported as a struct-typed (ArrowSchema, ArrowArray) pair via arrow-rs to_ffi and imported zero-copy in Julia with Arrow.from_c_data.
Remove the arrow-ipc Rust crate dependency and the SERIALIZE_POOL rayon thread pool — export is now O(columns), inline on the async task.
Arrow dependency switched to local ../arrow-julia (same UUID, adds ArrowSchema/ArrowArray exports and C Data Interface support). from_c_data finalizer removed from arrow-julia so the producer (Rust) manages lifetime explicitly via free_batch.
Fix a bug in arrow-julia: child_types = DataType[] → Type[] in the +s struct branch — Union{Missing,String} is not a DataType.

Ownership model

ArrowBatch now contains schema / array pointers into a Box<ArrowBatchInner> held by rust_ptr. free_batch drops that box, which calls the arrow-rs Drop impls on the FFI structs and frees all buffer Arcs. No GC finalizer is registered on the Julia side — the caller must not access the CImportedArray view after free_batch.

Test plan

cargo check in iceberg_rust_ffi/ — no arrow-ipc references
make run-containers && make test-dev — 27728 tests pass, 0 fail
Verify scan data is correct end-to-end (nations table, incremental scan, builder API)

🤖 Generated with Claude Code

Replace the Arrow IPC serialization path (RecordBatch → bytes → Julia deserializes with Arrow.Table) with the Arrow C Data Interface. Each RecordBatch is exported as a struct-typed (ArrowSchema, ArrowArray) pair via arrow-rs to_ffi; Julia imports it zero-copy with Arrow.from_c_data. **Rust changes** - `record_batch_to_c_ffi`: StructArray::from(batch) → to_ffi → store FFI structs in a Box<ArrowBatchInner>; schema/array pointers point into that box. No serialization, no thread pool, no transmute. - `iceberg_arrow_batch_free`: drops Box<ArrowBatchInner>, which calls the arrow-rs Drop impls and frees all buffer Arcs. - Removed SERIALIZE_POOL (rayon thread pool) and arrow-ipc dependency. - ArrowBatch FFI struct: data/length → schema/array (C Data Interface ptrs). **arrow-julia changes** - Removed `finalizer(_release_cdata_handle, handle)` from both from_c_data overloads. The library no longer auto-frees on GC; the producer (Rust) manages lifetime via free_batch. Callers needing explicit release can still call Arrow.release_c_data. - Fixed DataType[] → Type[] in the "+s" struct branch of _import_arrowvec (nullable types like Union{Missing,String} are not DataType). **Julia / test changes** - ArrowBatch struct: data/length → schema/array (Ptr{Arrow.ArrowSchema/Array}). - All batch reads: Arrow.Table(unsafe_wrap(...)) → Arrow.from_c_data(schema, array). - Test patterns that accumulated CImportedArray views before calling free_batch now push DataFrame(arrow_table) instead (materialise before free; from_c_data is zero-copy so the view becomes dangling after free). - Arrow dependency switched to local ../arrow-julia via Pkg.develop. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

IO_NETWORK and IO_S3 user messages no longer embed a truncated slice of the technical detail string (which can contain S3 bucket paths, file paths, or endpoint URLs). The `msg` field of IcebergException now always contains a short, generic description; the full context remains available in `detail` for log files and bug reports. Labels: dismiss-release-notes, build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

hall-alex

Nice!

hall-alex · 2026-05-20T07:50:51Z

+/// Arrow batch exported via Arrow C Data Interface.
+///
+/// `schema` and `array` point into the `ArrowBatchInner` owned by `rust_ptr`.
+/// `iceberg_arrow_batch_free` drops the inner allocation, which calls the
+/// arrow-rs release callbacks and frees all buffer data.
 #[repr(C)]
 pub struct ArrowBatch {
-    pub data: *const u8,
-    pub length: usize,
-    pub rust_ptr: *mut std::ffi::c_void,
+    pub schema: *mut FFI_ArrowSchema,
+    pub array: *mut FFI_ArrowArray,
+    pub rust_ptr: *mut std::ffi::c_void, // Box<ArrowBatchInner>
 }


Perhaps you could have Claude add more details on why this indirection with ArrowBatchInner is needed. I'm assuming it has to do with handling (and eventually dropping) the allocation somehow? Would be nice to have a brief explanation her for "Rust dummies" like myself.

hall-alex

We need to bump the version somehow, right? Otherwise, we may run into issues, since this is an incompatible change or am I misunderstanding something?

hall-alex approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch batch scan from Arrow IPC to Arrow C Data Interface#100

Switch batch scan from Arrow IPC to Arrow C Data Interface#100
robertbuessow wants to merge 1 commit into
mainfrom
rb-arrow-c-interface-1

robertbuessow commented May 15, 2026

Uh oh!

hall-alex left a comment

Uh oh!

hall-alex May 20, 2026

Uh oh!

hall-alex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robertbuessow commented May 15, 2026

Summary

Ownership model

Test plan

Uh oh!

hall-alex left a comment

Choose a reason for hiding this comment

Uh oh!

hall-alex May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hall-alex left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants