Add Arrow C Data Interface support by robertbuessow · Pull Request #1 · RelationalAI/arrow-julia

robertbuessow · 2026-05-12T15:25:47Z

Summary

Implements the Arrow C Data Interface specification, enabling zero-copy interop between Julia and any other Arrow-compatible runtime (Python/PyArrow, C++, R, etc.) within the same process.

What's added

Core implementation (`src/cdatainterface.jl`, ~1 200 lines)

C struct bindings — ABI-compatible Julia mutable struct definitions for ArrowSchema and ArrowArray, with size assertions to catch layout drift.

Import path (C → Julia) — from_c_data(schema_ptr, array_ptr) and from_c_data(schema_ptrs, array_ptrs) import single arrays or entire tables. Supported Arrow formats:

All primitives (c C s S i I l L e f g b n)
Strings and binary (u U z Z)
Fixed-size binary (w:N)
Generic lists (+l / +L)
Fixed-size lists (+w:N)
Structs (+s)
Maps (+m)
Dense and sparse unions (+ud: / +us:)
Dictionary-encoded arrays (all signed index types)
Dates, times, timestamps (with and without timezone), durations, intervals, decimals
Non-zero offsets and validity bitmaps (including non-byte-aligned boolean offsets)

Export path (Julia → C) — to_c_data(col) and to_c_data(tbl::Arrow.Table) export any Arrow vector or table. Uses a global roots dictionary to keep Julia objects alive while C holds pointers, and a @cfunction-based release callback (registered in __init__) to free them when the consumer calls release.

Lifetime management — CDataHandle wraps the C pointers and calls the producer's release callbacks when GC'd or when release_c_data is called explicitly. A released flag prevents double-free.

Public API — ArrowSchema, ArrowArray, CImportedArray, CImportedTable, from_c_data, to_c_data, release_c_data are all exported.

Supporting changes

src/Arrow.jl: export the public API symbols, include("cdatainterface.jl"), register @cfunction release callbacks in __init__.
src/table.jl: add Table(::NamedTuple) constructor used by the import path when building a CImportedTable.

Tests (`test/cdatainterface.jl`, 108 assertions)

Full round-trip coverage for every supported format, plus:

Non-byte-aligned boolean bit offset
release_c_data idempotency (double-release is a no-op)
Table export/import round-trip

Bug fixes (found while implementing / testing)

Location	Bug	Fix
`src/arraytypes/dictencoding.jl`	`child_types = DataType[]` in struct import rejects abstract types like `Union{Missing, Int32}`	Changed to `Type[]`
`src/cdatainterface.jl`	Same `DataType[]` issue in dense and sparse union import	Changed to `Type[]`
`src/cdatainterface.jl`	Decimal precision/scale parsed as `Int32`; creates a distinct parametric type from the `Int64`-parametered original, so equality always fails	Use `parse(Int, ...)`
`src/cdatainterface.jl`	`to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)` — crashes because `Table` is a type, not a module	Use `Tables.columnnames(tbl)`
`src/cdatainterface.jl`	`from_c_data` attached a finalizer to `CDataHandle` inside `CImportedArray`, causing potential double-free	Removed duplicate finalizer

Test plan

# Full suite (~66 000 tests)
julia --project=. -e 'using Pkg; Pkg.develop(path="src/ArrowTypes"); Pkg.test()'

# C Data Interface only (fast, ~30 s)
julia --project=test -e '
  using Test, Arrow, Dates, TimeZones, DataAPI, Tables, PooledArrays
  include("test/cdatainterface.jl")'

🤖 Generated with Claude Code

Implements both directions of the Arrow C Data Interface spec (https://arrow.apache.org/docs/format/CDataInterface.html): - `Arrow.from_c_data(schema_ptr, array_ptr)` — import an Arrow array from C-owned memory; zero-copy via `unsafe_wrap`; `CDataHandle` finalizer calls the C `release` callbacks automatically. - `Arrow.to_c_data(col)` — export an `ArrowVector` or `Arrow.Table` to C; GC roots kept alive via a token-keyed global dict; `@cfunction` release callbacks (initialised in `__init__`) delete roots on consumer release. New public types: `ArrowSchema`, `ArrowArray`, `CImportedArray`, `CImportedTable`, and `release_c_data`. Supports all Arrow column types: primitives, Bool, String/binary, List (generic, large, fixed-size), Struct, Map, DenseUnion, SparseUnion, DictEncoded, Null, and all time/date/duration types. Handles nullable columns, non-zero array offsets, and custom metadata. 68 new tests in `test/cdatainterface.jl` covering format strings, buffer layout, validity bitmaps, round-trips, release semantics, non-zero offsets, and multi-column table import. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Remove duplicate finalizer registrations in `from_c_data`: the `CDataHandle` was getting a finalizer attached but `CImportedArray` already owns and manages the handle's lifetime, causing a potential double-free when the GC collected the handle. Also widen `child_types` from `DataType[]` to `Type[]` so that abstract element types (e.g. Union{...}) are accepted without a type assertion error when building struct arrays. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

CategoricalRefPool uses 0-based indices (0:n) with pool[0] as a missing sentinel. When passed to arrowvector, ToArrow wraps it with 1-based iteration (1..length(pool)). Since length(pool) == n+1, the last iteration calls pool[n+1], which is out of bounds. Fix: when firstindex(pool) != 1, skip the sentinel to give arrowvector a standard 1-based view (pool[1:end]). The existing inds adjustment (inds .-= firstindex(refa)) already produces correct Arrow dict indices (-1 for missing, 0..n-1 for valid values). Also add Table(::NamedTuple) constructor for the Arrow C Data path, and add Arrow as an explicit dep in test/Project.toml so that `julia --project=test test/runtests.jl` works in a local dev setup. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Extend test/cdatainterface.jl with 13 new testsets covering all previously untested code paths in src/cdatainterface.jl: - Generic lists (+l): Int32, missing, String - Fixed-size lists (+w:N): Float32 and Int64 tuples - Maps (+m): Dict{String,Int32} - Dense unions (+ud:) and sparse unions (+us:) - All four Duration units (tDs/tDm/tDu/tDn) - Time nanoseconds (ttn) - Timestamp with UTC timezone - Interval year-month (tiM) and day-time (tiD) - Decimal{10,2,Int128} (d:10,2,128) - Arrow.Table export round-trip via to_c_data - Bool import with non-byte-aligned bit offset - release_c_data idempotency (double-release is a no-op) Writing the tests uncovered three bugs, all fixed in src/cdatainterface.jl: 1. Dense and sparse union import used `child_types = DataType[]`, which rejects abstract element types such as `Union{Missing, Int32}`. Fixed to `child_types = Type[]` (same fix already applied to the struct path in a prior commit). 2. Decimal precision and scale were parsed as Int32 in `_fmt_to_storage_type`, producing `Decimal{Int32(10),...}` instead of `Decimal{Int64(10),...}`. Since Julia type parameters carry their integer type, the two Decimal types compared unequal even with identical values. Fixed by using `parse(Int, ...)`. 3. `to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)`, which crashes because `Table` is a type, not a module. Fixed to `Tables.columnnames(tbl)`. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

hall-alex

Very cool 💥💪🚀

For test-coverage: do we also check empty, large, huge tables / vectors? I would have the agent double-check coverage once more .. a la is really every code-path covered?

hall-alex · 2026-05-18T16:52:37Z

+Mirrors `struct ArrowSchema` from the Arrow C Data Interface specification.
+Layout must be ABI-compatible with the C struct (9 pointer-sized fields).


For this and for below: do we have dedicated tests that check this ABI compatibility? What happens on little-endian / big-endian machines? Do we somehow protect against this or is the C and Julia-version always of "compatible endianess"?

hall-alex · 2026-05-18T17:00:54Z

+the C `release` callbacks will be called when the returned `CImportedArray` is GC'd
+or when `Arrow.release_c_data` is called on it.


I don't see where the GC is hooked up to call Arrow.release_c_data -- shouldn't there be a finalizer for that somewhere? Sorry if I missed that.

Do we have tests that check that this GC clean-up works?

Alternatively, one could also let the finalizer fail badly if the thing wasn't released (if we want to avoid accidental, dangling vectors / tables).

robertbuessow and others added 6 commits April 8, 2026 09:47

Merge branch 'main' of github.com:apache/arrow-julia

7f7b72d

cleanup

ad641dd

robertbuessow changed the title ~~Fix BoundsError when dict-encoding CategoricalArrays with missing values~~ Add Arrow C Data Interface support May 15, 2026

hall-alex approved these changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Arrow C Data Interface support#1

Add Arrow C Data Interface support#1
robertbuessow wants to merge 6 commits into
mainfrom
rb-arrow-c-interface

robertbuessow commented May 12, 2026 •

edited

Loading

Uh oh!

hall-alex left a comment

Uh oh!

hall-alex May 18, 2026

Uh oh!

hall-alex May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Mirrors `struct ArrowSchema` from the Arrow C Data Interface specification.
		Layout must be ABI-compatible with the C struct (9 pointer-sized fields).

		the C `release` callbacks will be called when the returned `CImportedArray` is GC'd
		or when `Arrow.release_c_data` is called on it.

Conversation

robertbuessow commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Core implementation (src/cdatainterface.jl, ~1 200 lines)

Supporting changes

Tests (test/cdatainterface.jl, 108 assertions)

Bug fixes (found while implementing / testing)

Test plan

Uh oh!

hall-alex left a comment

Choose a reason for hiding this comment

Uh oh!

hall-alex May 18, 2026

Choose a reason for hiding this comment

Uh oh!

hall-alex May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertbuessow commented May 12, 2026 •

edited

Loading

Core implementation (`src/cdatainterface.jl`, ~1 200 lines)

Tests (`test/cdatainterface.jl`, 108 assertions)