Skip to content

[Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++ #33997

@jorisvandenbossche

Description

@jorisvandenbossche

When wrapping a type (or array) in a pyarrow object, we need to define which Python class to use. Currently, for extension types, this logic lives here in pyarrow_wrap_data_type:

elif type.get().id() == _Type_EXTENSION:
ext_type = <const CExtensionType*> type.get()
cpy_ext_type = dynamic_cast[_CPyExtensionTypePtr](ext_type)
if cpy_ext_type != nullptr:
return cpy_ext_type.GetInstance()
else:
out = BaseExtensionType.__new__(BaseExtensionType)

So there are currently two options:

  • The ExtensionType is implemented in Python, by subclassing pyarrow.(Py)ExtensionType, and which links to the C++ arrow::py::PyExtensionType (a subclass of arrow::ExtensionType). In this case, we store the python type instance on the C++ instance, and return this as python object in pyarrow_wrap_data_type.
  • The ExtensionType is implemented in C++, and then we currently always fall back to wrap this in the pyarrow.BaseExtenstionType base class (there is currently a bug in this, but that is getting fixed in GH-33802).

However, that means that for such extension types implemented in C++, there is currently no way to have a "richer" python Type object (or Array object, since that is determined by the Type, and for a BaseExtensionType, that will always use the base ExtensionArray). While for an extension type, you might want to add type-specific attributes or methods.

For canonical extension types that are implemented in Arrow C++ itself (for example, the currently discussed Tensor extension type in #8510, or a previous effort to add complex type as extension type in #10565), I think it will work today to create a custom subclass of pyarrow.BaseExtensionType for the specific canonical type, and then we could add a special case to pyarrow_wrap_data_type checking the name of the extension type, and if it is a canonical one we implement ourselves, use the python subclass we implemented ourselves.

But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions