Skip to content

Add Arrow C Data Interface support#1

Open
robertbuessow wants to merge 6 commits into
mainfrom
rb-arrow-c-interface
Open

Add Arrow C Data Interface support#1
robertbuessow wants to merge 6 commits into
mainfrom
rb-arrow-c-interface

Conversation

@robertbuessow
Copy link
Copy Markdown
Collaborator

@robertbuessow robertbuessow commented May 12, 2026

Summary

Implements the Arrow C Data Interface specification, enabling zero-copy interop between Julia and any other Arrow-compatible runtime (Python/PyArrow, C++, R, etc.) within the same process.

What's added

Core implementation (src/cdatainterface.jl, ~1 200 lines)

C struct bindings — ABI-compatible Julia mutable struct definitions for ArrowSchema and ArrowArray, with size assertions to catch layout drift.

Import path (C → Julia)from_c_data(schema_ptr, array_ptr) and from_c_data(schema_ptrs, array_ptrs) import single arrays or entire tables. Supported Arrow formats:

  • All primitives (c C s S i I l L e f g b n)
  • Strings and binary (u U z Z)
  • Fixed-size binary (w:N)
  • Generic lists (+l / +L)
  • Fixed-size lists (+w:N)
  • Structs (+s)
  • Maps (+m)
  • Dense and sparse unions (+ud: / +us:)
  • Dictionary-encoded arrays (all signed index types)
  • Dates, times, timestamps (with and without timezone), durations, intervals, decimals
  • Non-zero offsets and validity bitmaps (including non-byte-aligned boolean offsets)

Export path (Julia → C)to_c_data(col) and to_c_data(tbl::Arrow.Table) export any Arrow vector or table. Uses a global roots dictionary to keep Julia objects alive while C holds pointers, and a @cfunction-based release callback (registered in __init__) to free them when the consumer calls release.

Lifetime managementCDataHandle wraps the C pointers and calls the producer's release callbacks when GC'd or when release_c_data is called explicitly. A released flag prevents double-free.

Public APIArrowSchema, ArrowArray, CImportedArray, CImportedTable, from_c_data, to_c_data, release_c_data are all exported.

Supporting changes

  • src/Arrow.jl: export the public API symbols, include("cdatainterface.jl"), register @cfunction release callbacks in __init__.
  • src/table.jl: add Table(::NamedTuple) constructor used by the import path when building a CImportedTable.

Tests (test/cdatainterface.jl, 108 assertions)

Full round-trip coverage for every supported format, plus:

  • Non-byte-aligned boolean bit offset
  • release_c_data idempotency (double-release is a no-op)
  • Table export/import round-trip

Bug fixes (found while implementing / testing)

Location Bug Fix
src/arraytypes/dictencoding.jl child_types = DataType[] in struct import rejects abstract types like Union{Missing, Int32} Changed to Type[]
src/cdatainterface.jl Same DataType[] issue in dense and sparse union import Changed to Type[]
src/cdatainterface.jl Decimal precision/scale parsed as Int32; creates a distinct parametric type from the Int64-parametered original, so equality always fails Use parse(Int, ...)
src/cdatainterface.jl to_c_data(::Arrow.Table) called Arrow.Table.names(tbl) — crashes because Table is a type, not a module Use Tables.columnnames(tbl)
src/cdatainterface.jl from_c_data attached a finalizer to CDataHandle inside CImportedArray, causing potential double-free Removed duplicate finalizer

Test plan

# Full suite (~66 000 tests)
julia --project=. -e 'using Pkg; Pkg.develop(path="src/ArrowTypes"); Pkg.test()'

# C Data Interface only (fast, ~30 s)
julia --project=test -e '
  using Test, Arrow, Dates, TimeZones, DataAPI, Tables, PooledArrays
  include("test/cdatainterface.jl")'

🤖 Generated with Claude Code

robertbuessow and others added 6 commits April 8, 2026 09:47
Implements both directions of the Arrow C Data Interface spec
(https://arrow.apache.org/docs/format/CDataInterface.html):

- `Arrow.from_c_data(schema_ptr, array_ptr)` — import an Arrow array
  from C-owned memory; zero-copy via `unsafe_wrap`; `CDataHandle`
  finalizer calls the C `release` callbacks automatically.
- `Arrow.to_c_data(col)` — export an `ArrowVector` or `Arrow.Table`
  to C; GC roots kept alive via a token-keyed global dict; `@cfunction`
  release callbacks (initialised in `__init__`) delete roots on consumer
  release.

New public types: `ArrowSchema`, `ArrowArray`, `CImportedArray`,
`CImportedTable`, and `release_c_data`.

Supports all Arrow column types: primitives, Bool, String/binary,
List (generic, large, fixed-size), Struct, Map, DenseUnion,
SparseUnion, DictEncoded, Null, and all time/date/duration types.
Handles nullable columns, non-zero array offsets, and custom metadata.

68 new tests in `test/cdatainterface.jl` covering format strings,
buffer layout, validity bitmaps, round-trips, release semantics,
non-zero offsets, and multi-column table import.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate finalizer registrations in `from_c_data`: the
`CDataHandle` was getting a finalizer attached but `CImportedArray`
already owns and manages the handle's lifetime, causing a potential
double-free when the GC collected the handle.

Also widen `child_types` from `DataType[]` to `Type[]` so that
abstract element types (e.g. Union{...}) are accepted without a
type assertion error when building struct arrays.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CategoricalRefPool uses 0-based indices (0:n) with pool[0] as a missing
sentinel. When passed to arrowvector, ToArrow wraps it with 1-based
iteration (1..length(pool)). Since length(pool) == n+1, the last
iteration calls pool[n+1], which is out of bounds.

Fix: when firstindex(pool) != 1, skip the sentinel to give arrowvector
a standard 1-based view (pool[1:end]). The existing inds adjustment
(inds .-= firstindex(refa)) already produces correct Arrow dict indices
(-1 for missing, 0..n-1 for valid values).

Also add Table(::NamedTuple) constructor for the Arrow C Data path,
and add Arrow as an explicit dep in test/Project.toml so that
`julia --project=test test/runtests.jl` works in a local dev setup.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend test/cdatainterface.jl with 13 new testsets covering all
previously untested code paths in src/cdatainterface.jl:
- Generic lists (+l): Int32, missing, String
- Fixed-size lists (+w:N): Float32 and Int64 tuples
- Maps (+m): Dict{String,Int32}
- Dense unions (+ud:) and sparse unions (+us:)
- All four Duration units (tDs/tDm/tDu/tDn)
- Time nanoseconds (ttn)
- Timestamp with UTC timezone
- Interval year-month (tiM) and day-time (tiD)
- Decimal{10,2,Int128} (d:10,2,128)
- Arrow.Table export round-trip via to_c_data
- Bool import with non-byte-aligned bit offset
- release_c_data idempotency (double-release is a no-op)

Writing the tests uncovered three bugs, all fixed in src/cdatainterface.jl:

1. Dense and sparse union import used `child_types = DataType[]`, which
   rejects abstract element types such as `Union{Missing, Int32}`.
   Fixed to `child_types = Type[]` (same fix already applied to the
   struct path in a prior commit).

2. Decimal precision and scale were parsed as Int32 in
   `_fmt_to_storage_type`, producing `Decimal{Int32(10),...}` instead
   of `Decimal{Int64(10),...}`. Since Julia type parameters carry their
   integer type, the two Decimal types compared unequal even with
   identical values. Fixed by using `parse(Int, ...)`.

3. `to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)`, which
   crashes because `Table` is a type, not a module. Fixed to
   `Tables.columnnames(tbl)`.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@robertbuessow robertbuessow changed the title Fix BoundsError when dict-encoding CategoricalArrays with missing values Add Arrow C Data Interface support May 15, 2026
Copy link
Copy Markdown

@hall-alex hall-alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool 💥💪🚀

For test-coverage: do we also check empty, large, huge tables / vectors? I would have the agent double-check coverage once more .. a la is really every code-path covered?

Comment thread src/cdatainterface.jl
Comment on lines +29 to +30
Mirrors `struct ArrowSchema` from the Arrow C Data Interface specification.
Layout must be ABI-compatible with the C struct (9 pointer-sized fields).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and for below: do we have dedicated tests that check this ABI compatibility? What happens on little-endian / big-endian machines? Do we somehow protect against this or is the C and Julia-version always of "compatible endianess"?

Comment thread src/cdatainterface.jl
Comment on lines +514 to +515
the C `release` callbacks will be called when the returned `CImportedArray` is GC'd
or when `Arrow.release_c_data` is called on it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the GC is hooked up to call Arrow.release_c_data -- shouldn't there be a finalizer for that somewhere? Sorry if I missed that.

Do we have tests that check that this GC clean-up works?

Alternatively, one could also let the finalizer fail badly if the thing wasn't released (if we want to avoid accidental, dangling vectors / tables).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants