Skip to content

EPIC: complete GroupValuesColumn type coverage (nested types + remaining primitives) #22715

@zhuqi-lucas

Description

@zhuqi-lucas

Background

GroupValuesColumn (the column-wise multi-column GROUP BY storage) provides type-specific specializations under multi_group_by/ so a wide GROUP BY can use the column-native + short-circuit fast path instead of falling back to the byte-encoded GroupValuesRows path. Today the type allow-list is partial: any column outside the supported set drags the entire grouping onto the slow path, even when every other column would have qualified for the fast one.

This EPIC tracks completing the GroupValuesColumn type coverage so that the row-encoded fallback is needed only as an explicit opt-in, not as a forced fallback for missing specializations.

Already supported (today on main)

Int8..Int64, UInt8..UInt64, Float32, Float64, Decimal128, Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, BinaryView, Boolean, Date32, Date64, Time32(Second/Millisecond), Time64(Microsecond/Nanosecond)†, Timestamp(*).

† Time64 alignment between supported_type and the dispatcher is being fixed as part of PR #22706.

Tracking

Nested types

Remaining primitives

Each item below blocks on the make_group_column factory and recursive supported_type landing in PR 1 of the #22706 sequence. After that, each is an independently mergeable PR.

  • FixedSizeBinary. Fixed-width bytes per row. Closest in shape to PrimitiveGroupValueBuilder but with a runtime-known fixed byte width. Likely the smallest new builder.
  • Float16. Arrow already has the primitive type; need explicit NaN handling in is_eq (match the Float32 / Float64 behavior in PrimitiveGroupValueBuilder).
  • Duration(TimeUnit). Same shape as Timestamp (four TimeUnit arms in the dispatcher), four DurationXxxType slot-ins.
  • Interval(IntervalUnit). Three variants (YearMonth = 4 bytes, DayTime = 8 bytes, MonthDayNano = 16 bytes), three separate dispatcher arms and three native widths.
  • Decimal256. arrow::array::types::Decimal256Type has Native = arrow_buffer::i256, a 32-byte struct rather than a Copy-cheap native scalar. Either relax the T: Copy requirement in PrimitiveGroupValueBuilder or add a sibling builder specialized to wide native types.
  • Dictionary<K, V>. Most involved. Need to decide:
    • Option A: hash / compare on the dictionary's decoded logical value. Conceptually clean, behaves like Utf8 / Binary. Costlier in memory because each unique decoded value is materialized at intern time.
    • Option B: hash / compare on the encoded key under a fixed-dictionary contract (i.e. the same K -> V mapping is asserted across batches). Cheaper but only safe if the dictionary is shared / known-stable, which is not guaranteed by Arrow at the schema level.

Related strategic direction (not blocked by this EPIC)

#22701 proposes a generic FallbackGroupColumn so any Arrow type can go through GroupValuesColumn with a type-erased Arrow comparator. If that lands, the items in this EPIC become opt-in fast-path specializations on top of the generic fallback rather than prerequisites for the column-wise path. The two directions are complementary.

Cross-cutting requirements

Every new builder added under this EPIC should follow the testing structure established by PR #22706:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions