Add Arrow C Data Interface support#1
Conversation
Implements both directions of the Arrow C Data Interface spec (https://arrow.apache.org/docs/format/CDataInterface.html): - `Arrow.from_c_data(schema_ptr, array_ptr)` — import an Arrow array from C-owned memory; zero-copy via `unsafe_wrap`; `CDataHandle` finalizer calls the C `release` callbacks automatically. - `Arrow.to_c_data(col)` — export an `ArrowVector` or `Arrow.Table` to C; GC roots kept alive via a token-keyed global dict; `@cfunction` release callbacks (initialised in `__init__`) delete roots on consumer release. New public types: `ArrowSchema`, `ArrowArray`, `CImportedArray`, `CImportedTable`, and `release_c_data`. Supports all Arrow column types: primitives, Bool, String/binary, List (generic, large, fixed-size), Struct, Map, DenseUnion, SparseUnion, DictEncoded, Null, and all time/date/duration types. Handles nullable columns, non-zero array offsets, and custom metadata. 68 new tests in `test/cdatainterface.jl` covering format strings, buffer layout, validity bitmaps, round-trips, release semantics, non-zero offsets, and multi-column table import. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate finalizer registrations in `from_c_data`: the
`CDataHandle` was getting a finalizer attached but `CImportedArray`
already owns and manages the handle's lifetime, causing a potential
double-free when the GC collected the handle.
Also widen `child_types` from `DataType[]` to `Type[]` so that
abstract element types (e.g. Union{...}) are accepted without a
type assertion error when building struct arrays.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CategoricalRefPool uses 0-based indices (0:n) with pool[0] as a missing sentinel. When passed to arrowvector, ToArrow wraps it with 1-based iteration (1..length(pool)). Since length(pool) == n+1, the last iteration calls pool[n+1], which is out of bounds. Fix: when firstindex(pool) != 1, skip the sentinel to give arrowvector a standard 1-based view (pool[1:end]). The existing inds adjustment (inds .-= firstindex(refa)) already produces correct Arrow dict indices (-1 for missing, 0..n-1 for valid values). Also add Table(::NamedTuple) constructor for the Arrow C Data path, and add Arrow as an explicit dep in test/Project.toml so that `julia --project=test test/runtests.jl` works in a local dev setup. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend test/cdatainterface.jl with 13 new testsets covering all
previously untested code paths in src/cdatainterface.jl:
- Generic lists (+l): Int32, missing, String
- Fixed-size lists (+w:N): Float32 and Int64 tuples
- Maps (+m): Dict{String,Int32}
- Dense unions (+ud:) and sparse unions (+us:)
- All four Duration units (tDs/tDm/tDu/tDn)
- Time nanoseconds (ttn)
- Timestamp with UTC timezone
- Interval year-month (tiM) and day-time (tiD)
- Decimal{10,2,Int128} (d:10,2,128)
- Arrow.Table export round-trip via to_c_data
- Bool import with non-byte-aligned bit offset
- release_c_data idempotency (double-release is a no-op)
Writing the tests uncovered three bugs, all fixed in src/cdatainterface.jl:
1. Dense and sparse union import used `child_types = DataType[]`, which
rejects abstract element types such as `Union{Missing, Int32}`.
Fixed to `child_types = Type[]` (same fix already applied to the
struct path in a prior commit).
2. Decimal precision and scale were parsed as Int32 in
`_fmt_to_storage_type`, producing `Decimal{Int32(10),...}` instead
of `Decimal{Int64(10),...}`. Since Julia type parameters carry their
integer type, the two Decimal types compared unequal even with
identical values. Fixed by using `parse(Int, ...)`.
3. `to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)`, which
crashes because `Table` is a type, not a module. Fixed to
`Tables.columnnames(tbl)`.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
hall-alex
left a comment
There was a problem hiding this comment.
Very cool 💥💪🚀
For test-coverage: do we also check empty, large, huge tables / vectors? I would have the agent double-check coverage once more .. a la is really every code-path covered?
| Mirrors `struct ArrowSchema` from the Arrow C Data Interface specification. | ||
| Layout must be ABI-compatible with the C struct (9 pointer-sized fields). |
There was a problem hiding this comment.
For this and for below: do we have dedicated tests that check this ABI compatibility? What happens on little-endian / big-endian machines? Do we somehow protect against this or is the C and Julia-version always of "compatible endianess"?
| the C `release` callbacks will be called when the returned `CImportedArray` is GC'd | ||
| or when `Arrow.release_c_data` is called on it. |
There was a problem hiding this comment.
I don't see where the GC is hooked up to call Arrow.release_c_data -- shouldn't there be a finalizer for that somewhere? Sorry if I missed that.
Do we have tests that check that this GC clean-up works?
Alternatively, one could also let the finalizer fail badly if the thing wasn't released (if we want to avoid accidental, dangling vectors / tables).
Summary
Implements the Arrow C Data Interface specification, enabling zero-copy interop between Julia and any other Arrow-compatible runtime (Python/PyArrow, C++, R, etc.) within the same process.
What's added
Core implementation (
src/cdatainterface.jl, ~1 200 lines)C struct bindings — ABI-compatible Julia
mutable structdefinitions forArrowSchemaandArrowArray, with size assertions to catch layout drift.Import path (C → Julia) —
from_c_data(schema_ptr, array_ptr)andfrom_c_data(schema_ptrs, array_ptrs)import single arrays or entire tables. Supported Arrow formats:c C s S i I l L e f g b n)u U z Z)w:N)+l/+L)+w:N)+s)+m)+ud:/+us:)Export path (Julia → C) —
to_c_data(col)andto_c_data(tbl::Arrow.Table)export any Arrow vector or table. Uses a global roots dictionary to keep Julia objects alive while C holds pointers, and a@cfunction-based release callback (registered in__init__) to free them when the consumer callsrelease.Lifetime management —
CDataHandlewraps the C pointers and calls the producer'sreleasecallbacks when GC'd or whenrelease_c_datais called explicitly. Areleasedflag prevents double-free.Public API —
ArrowSchema,ArrowArray,CImportedArray,CImportedTable,from_c_data,to_c_data,release_c_dataare all exported.Supporting changes
src/Arrow.jl: export the public API symbols,include("cdatainterface.jl"), register@cfunctionrelease callbacks in__init__.src/table.jl: addTable(::NamedTuple)constructor used by the import path when building aCImportedTable.Tests (
test/cdatainterface.jl, 108 assertions)Full round-trip coverage for every supported format, plus:
release_c_dataidempotency (double-release is a no-op)Bug fixes (found while implementing / testing)
src/arraytypes/dictencoding.jlchild_types = DataType[]in struct import rejects abstract types likeUnion{Missing, Int32}Type[]src/cdatainterface.jlDataType[]issue in dense and sparse union importType[]src/cdatainterface.jlInt32; creates a distinct parametric type from theInt64-parametered original, so equality always failsparse(Int, ...)src/cdatainterface.jlto_c_data(::Arrow.Table)calledArrow.Table.names(tbl)— crashes becauseTableis a type, not a moduleTables.columnnames(tbl)src/cdatainterface.jlfrom_c_dataattached a finalizer toCDataHandleinsideCImportedArray, causing potential double-freeTest plan
🤖 Generated with Claude Code