Skip to content

[Feature] Support for VARIANT data type#2875

Open
XuQianJin-Stars wants to merge 1 commit intoapache:mainfrom
XuQianJin-Stars:feature/variant-data-type
Open

[Feature] Support for VARIANT data type#2875
XuQianJin-Stars wants to merge 1 commit intoapache:mainfrom
XuQianJin-Stars:feature/variant-data-type

Conversation

@XuQianJin-Stars
Copy link
Contributor

Purpose

Linked issue: close #2873

Introduce a first-class Variant data type throughout Fluss's row infrastructure, replacing the previous opaque byte[] representation. This aligns Fluss with the Variant Binary Encoding spec and follows the same design pattern as Apache Paimon's Variant implementation, providing structured value() and metadata() access for semi-structured data.

Brief change log

1. Core Types (fluss-common)

  • New Variant interface (fluss-common/.../row/Variant.java): Defines value(), metadata(), sizeInBytes(), copy() accessors, plus static helpers bytesToVariant(byte[]) / variantToBytes(Variant) for backward-compatible wire format conversion.
  • New GenericVariant class (fluss-common/.../row/GenericVariant.java): Default implementation storing two byte[] fields (value + metadata) with proper equals(), hashCode(), toString().
  • New VariantType (fluss-common/.../types/VariantType.java): Adds VARIANT to Fluss's type system with DataTypeRoot.VARIANT enum entry, parser support, JSON serde, and visitor pattern integration.

2. Row Infrastructure (fluss-common)

  • DataGetters: Added Variant getVariant(int pos) abstract method.
  • All InternalRow implementations: GenericRow, BinaryRow (via AlignedRow), CompactedRow, IndexedRow, ProjectedRow, PaddingRow, ColumnarRow — all implement getVariant().
  • All InternalArray implementations: GenericArray, BinaryArray, ColumnarArray — all implement getVariant().
  • Binary writers: BinaryWriter, AbstractBinaryWriter, BinaryArrayWriter, SequentialBinaryWriter — added writeVariant(int pos, Variant).
  • Compacted/Indexed readers & writers: CompactedRowReader/Writer, IndexedRowReader/Writer — added readVariant() / writeVariant().
  • Arrow integration: New ArrowVariantColumnVector (reader) and ArrowVariantWriter (writer) for Arrow-based columnar storage; ArrowUtils updated.

3. Client Converters (fluss-client)

  • PojoToRowConverter: Added VARIANT handling, supporting both byte[] and Variant inputs.
  • RowToPojoConverter: Added VARIANT → byte[] conversion.

4. Flink Bridge (fluss-flink)

  • FlussTypeToFlinkType: Maps VARIANT → Flink BYTES (with TODO for future native VariantType support).
  • FlussRowToFlinkRowConverter: Converts Variantbyte[] via variantToBytes().
  • FlussRowToJsonConverters: Serializes Variant as binary JSON node.
  • FlinkAsFlussRow / FlinkAsFlussArray: Implements getVariant() by deserializing from Flink's byte[].
  • PojoToRowConverter (Flink): Added VARIANT handling.
  • RowDataSerializationSchema: Added VARIANT case.

5. Spark Bridge (fluss-spark)

  • FlussToSparkTypeVisitor: Maps VARIANT → Spark BinaryType (with TODO for Spark 4 native VariantType via shim mechanism).
  • SparkAsFlussRow / SparkAsFlussArray: Implements getVariant() by deserializing from Spark's byte[].

6. Lake Connectors (fluss-lake)

  • Paimon: FlussDataTypeToPaimonDataType maps VARIANT; PaimonRowAsFlussRow / PaimonArrayAsFlussArray implement getVariant() (with TODO for native Paimon Variant once dependency is upgraded).
  • Iceberg: FlussDataTypeToIcebergDataType maps VARIANT; IcebergRecordAsFlussRow / IcebergArrayAsFlussArray implement getVariant().
  • Lance: LanceArrowUtils, ArrowDataConverter, ShadedArrowBatchWriter updated with VARIANT support.

7. Utilities

  • InternalRowUtils, TypeUtils, PartitionUtils — added VARIANT handling.
  • DataTypeJsonSerde — VARIANT serialization/deserialization.

Tests

  • DataTypeVisitorTest: Updated to cover VariantType in the type visitor test.
  • InternalArrayAssert: Added getVariant() assertion support for test utilities.
  • Existing tests pass with mvn compile -DskipTests across all modules (fluss-common, fluss-flink, fluss-spark, fluss-lake-paimon, fluss-lake-iceberg, fluss-lake-lance).

API and Format

  • New public API: Variant interface and GenericVariant class in fluss-common.
  • New data type: VARIANT added to DataTypeRoot and DataTypes.
  • Storage format: No change. The on-wire binary format remains [4-byte value length (big-endian)][value bytes][metadata bytes]. The Variant.variantToBytes() / Variant.bytesToVariant() methods handle the conversion, ensuring backward compatibility.
  • Engine mappings: VARIANT is mapped to BYTES / BinaryType in Flink and Spark respectively (degraded mapping due to lack of native engine support), and to the corresponding binary types in lake formats.

Documentation

  • website/docs/_configs/_partial_config.mdx: Auto-generated config documentation updated (unrelated config changes included in the regeneration).
  • No new user-facing documentation page is needed at this stage. VARIANT type documentation will be added in a follow-up when end-to-end SQL support is available.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/variant-data-type branch from 84e0b17 to 905563e Compare March 15, 2026 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support for VARIANT data type

1 participant