Skip to content

Switch batch scan from Arrow IPC to Arrow C Data Interface#100

Open
robertbuessow wants to merge 1 commit into
mainfrom
rb-arrow-c-interface-1
Open

Switch batch scan from Arrow IPC to Arrow C Data Interface#100
robertbuessow wants to merge 1 commit into
mainfrom
rb-arrow-c-interface-1

Conversation

@robertbuessow
Copy link
Copy Markdown
Contributor

Summary

  • Replace Arrow IPC serialization (RecordBatch → bytes → Julia deserializes) with the Arrow C Data Interface: each batch is exported as a struct-typed (ArrowSchema, ArrowArray) pair via arrow-rs to_ffi and imported zero-copy in Julia with Arrow.from_c_data.
  • Remove the arrow-ipc Rust crate dependency and the SERIALIZE_POOL rayon thread pool — export is now O(columns), inline on the async task.
  • Arrow dependency switched to local ../arrow-julia (same UUID, adds ArrowSchema/ArrowArray exports and C Data Interface support). from_c_data finalizer removed from arrow-julia so the producer (Rust) manages lifetime explicitly via free_batch.
  • Fix a bug in arrow-julia: child_types = DataType[]Type[] in the +s struct branch — Union{Missing,String} is not a DataType.

Ownership model

ArrowBatch now contains schema / array pointers into a Box<ArrowBatchInner> held by rust_ptr. free_batch drops that box, which calls the arrow-rs Drop impls on the FFI structs and frees all buffer Arcs. No GC finalizer is registered on the Julia side — the caller must not access the CImportedArray view after free_batch.

Test plan

  • cargo check in iceberg_rust_ffi/ — no arrow-ipc references
  • make run-containers && make test-dev — 27728 tests pass, 0 fail
  • Verify scan data is correct end-to-end (nations table, incremental scan, builder API)

🤖 Generated with Claude Code

Replace the Arrow IPC serialization path (RecordBatch → bytes → Julia
deserializes with Arrow.Table) with the Arrow C Data Interface. Each
RecordBatch is exported as a struct-typed (ArrowSchema, ArrowArray) pair
via arrow-rs to_ffi; Julia imports it zero-copy with Arrow.from_c_data.

**Rust changes**
- `record_batch_to_c_ffi`: StructArray::from(batch) → to_ffi → store FFI
  structs in a Box<ArrowBatchInner>; schema/array pointers point into that
  box.  No serialization, no thread pool, no transmute.
- `iceberg_arrow_batch_free`: drops Box<ArrowBatchInner>, which calls the
  arrow-rs Drop impls and frees all buffer Arcs.
- Removed SERIALIZE_POOL (rayon thread pool) and arrow-ipc dependency.
- ArrowBatch FFI struct: data/length → schema/array (C Data Interface ptrs).

**arrow-julia changes**
- Removed `finalizer(_release_cdata_handle, handle)` from both from_c_data
  overloads. The library no longer auto-frees on GC; the producer (Rust)
  manages lifetime via free_batch.  Callers needing explicit release can
  still call Arrow.release_c_data.
- Fixed DataType[] → Type[] in the "+s" struct branch of _import_arrowvec
  (nullable types like Union{Missing,String} are not DataType).

**Julia / test changes**
- ArrowBatch struct: data/length → schema/array (Ptr{Arrow.ArrowSchema/Array}).
- All batch reads: Arrow.Table(unsafe_wrap(...)) → Arrow.from_c_data(schema, array).
- Test patterns that accumulated CImportedArray views before calling
  free_batch now push DataFrame(arrow_table) instead (materialise before
  free; from_c_data is zero-copy so the view becomes dangling after free).
- Arrow dependency switched to local ../arrow-julia via Pkg.develop.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
robertbuessow added a commit that referenced this pull request May 18, 2026
IO_NETWORK and IO_S3 user messages no longer embed a truncated slice of
the technical detail string (which can contain S3 bucket paths, file
paths, or endpoint URLs). The `msg` field of IcebergException now always
contains a short, generic description; the full context remains available
in `detail` for log files and bug reports.

Labels: dismiss-release-notes, build:benchmark

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hall-alex hall-alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +48 to 58
/// Arrow batch exported via Arrow C Data Interface.
///
/// `schema` and `array` point into the `ArrowBatchInner` owned by `rust_ptr`.
/// `iceberg_arrow_batch_free` drops the inner allocation, which calls the
/// arrow-rs release callbacks and frees all buffer data.
#[repr(C)]
pub struct ArrowBatch {
pub data: *const u8,
pub length: usize,
pub rust_ptr: *mut std::ffi::c_void,
pub schema: *mut FFI_ArrowSchema,
pub array: *mut FFI_ArrowArray,
pub rust_ptr: *mut std::ffi::c_void, // Box<ArrowBatchInner>
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could have Claude add more details on why this indirection with ArrowBatchInner is needed. I'm assuming it has to do with handling (and eventually dropping) the allocation somehow? Would be nice to have a brief explanation her for "Rust dummies" like myself.

Copy link
Copy Markdown
Collaborator

@hall-alex hall-alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bump the version somehow, right? Otherwise, we may run into issues, since this is an incompatible change or am I misunderstanding something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants