Skip to content

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328

Open
rahil-c wants to merge 18 commits intoapache:masterfrom
rahil-c:rahil/vector-schema-spark-converters-parquet
Open

feat(vector): Support writing VECTOR to parquet and avro formats using Spark#18328
rahil-c wants to merge 18 commits intoapache:masterfrom
rahil-c:rahil/vector-schema-spark-converters-parquet

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Mar 17, 2026

Describe the issue this Pull Request addresses

Builds on #18146 (VECTOR type in HoodieSchema) and #18190 (Spark↔HoodieSchema converters) to complete the full read/write pipeline for vector columns in Apache Hudi backed by Parquet.

Vectors are stored as Parquet FIXED_LEN_BYTE_ARRAY (little-endian, IEEE-754) rather than repeated groups.

Summary and Changelog

Write path

  • HoodieRowParquetWriteSupport: detects ArrayType columns annotated with hudi_type=VECTOR(dim, elementType) metadata and serialises them as FIXED_LEN_BYTE_ARRAY instead of a Parquet list. Dimension mismatch at write time throws HoodieException to prevent silent data corruption.
  • Handles FLOAT32, FLOAT64, INT8

Read path

  • HoodieSparkParquetReader and SparkFileFormatInternalRowReaderContext: detect FIXED_LEN_BYTE_ARRAY columns carrying hudi_type metadata and deserialise them back to Spark ArrayData.
  • HoodieFileGroupReaderBasedFileFormat: propagates vector column metadata through the file-group reader so schema is not lost during Spark's internal schema resolution.
  • VectorConversionUtils (new): shared utility extracted to eliminate duplicated byte-buffer decode logic across the two reader paths.

Schema / compatibility

  • InternalSchemaConverter: maps VectorType to/from Avro bytes with hudi_type prop, preserving dimension and element-type metadata through the Avro layer.
  • HoodieSchemaCompatibilityChecker: rejects illegal vector evolution (e.g. dimension change) rather than silently coercing.
  • HoodieSchemaComparatorForSchemaEvolution: treats vector columns as incompatible when dimension or element type differs.
  • HoodieTableMetadataUtil: skips column statistics for vector columns (min/max on raw bytes is meaningless).
  • AvroSchemaConverterWithTimestampNTZ: passes through hudi_type property on bytes fields so vector metadata survives Avro↔Spark schema round-trips.
  • Types.VectorType: adds byteSize() helper used by the write path to compute FIXED_LEN_BYTE_ARRAY length.

Tests

  • TestVectorDataSource (808 lines): end-to-end Spark functional tests covering FLOAT32, FLOAT64, INT8 across COPY_ON_WRITE and MERGE_ON_READ table types; includes column projection, schema evolution rejection, and multi-batch upsert round-trips.
  • TestHoodieSchemaCompatibility, TestHoodieSchemaComparatorForSchemaEvolution, TestHoodieTableMetadataUtil: unit tests for schema-layer changes.

Impact

  • New feature — no existing behaviour is changed for non-vector columns.
  • Parquet files written with this change store vector columns as FIXED_LEN_BYTE_ARRAY. Reading those files with an older Hudi version will surface raw bytes rather than a float array; users should upgrade readers alongside writers.
  • No public Java/Scala API changes; vector behaviour is opt-in via schema metadata.

Risk Level

Low. All changes are gated behind hudi_type=VECTOR(...) metadata presence. Tables that do not use vector columns are unaffected. New paths are covered by functional tests across both table types.

Documentation Update

A follow-up website doc page covering vector column usage (schema annotation, supported element types, Parquet layout) will be raised separately. Config changes: none.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 17, 2026
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 79398b2 to 8adeccb Compare March 17, 2026 17:53
@rahil-c rahil-c requested review from yihua March 17, 2026 18:41
@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 17, 2026

@rahil-c to update pr overview

rahil-c and others added 10 commits March 18, 2026 16:16
…tion test

- Write path (HoodieRowParquetWriteSupport.makeWriter) now switches on
  VectorElementType (FLOAT/DOUBLE/INT8) instead of hardcoding float,
  matching the read paths
- Remove redundant detectVectorColumns call in readBaseFile by reusing
  vectorCols from requiredSchema for requestedSchema
- Add testColumnProjectionWithVector covering 3 scenarios: exclude vector,
  vector-only, and all columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use VectorLogicalType.VECTOR_BYTE_ORDER instead of hardcoded
  ByteOrder.LITTLE_ENDIAN in all 4 locations (write support, reader,
  Scala reader context, file group format)
- Add Math.multiplyExact overflow guard for dimension * elementSize
  in HoodieRowParquetWriteSupport
- Remove unnecessary array clone in HoodieSparkParquetReader
- Add clarifying comment on non-vector column else branch
- Fix misleading "float arrays" comment to "typed arrays"
- Move inline JavaConverters import to top-level in
  SparkFileFormatInternalRowReaderContext
- Import Metadata at top level instead of fully-qualified reference
- Consolidate duplicate detectVectorColumns, replaceVectorColumnsWithBinary,
  and convertBinaryToVectorArray into SparkFileFormatInternalRowReaderContext
  companion object; HoodieFileGroupReaderBasedFileFormat now delegates
- Add Javadoc on VectorType explaining it's needed for InternalSchema
  type hierarchy (cannot reuse HoodieSchema.Vector)
- Clean up unused imports (ByteOrder, ByteBuffer, GenericArrayData,
  StructField, BinaryType, HoodieSchemaType)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e types

New tests added to TestVectorDataSource:

- testDoubleVectorRoundTrip: DOUBLE element type end-to-end (64-dim)
- testInt8VectorRoundTrip: INT8/byte element type end-to-end (256-dim)
- testMultipleVectorColumns: two vector columns (float + double) in
  same schema with selective nulls and per-column projection
- testMorTableWithVectors: MOR table type with insert + upsert,
  verifying merged view returns correct vectors
- testCowUpsertWithVectors: COW upsert (update existing + insert new)
  verifying vector values after merge
- testLargeDimensionVector: 1536-dim float vectors (OpenAI embedding
  size) to exercise large buffer allocation
- testSmallDimensionVector: 2-dim vectors with edge values
  (Float.MaxValue) to verify boundary handling
- testVectorWithNonVectorArrayColumn: vector column alongside a
  regular ArrayType(StringType) to ensure non-vector arrays are
  not incorrectly treated as vectors
- testMorWithMultipleUpserts: MOR with 3 successive upsert batches
  of DOUBLE vectors, verifying the latest value wins per key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix hot-path allocation

- Create shared VectorConversionUtils utility class to eliminate duplicated
  vector conversion logic across HoodieSparkParquetReader,
  SparkFileFormatInternalRowReaderContext, and HoodieFileGroupReaderBasedFileFormat
- Add explicit dimension validation in HoodieRowParquetWriteSupport to prevent
  silent data corruption when array length doesn't match declared vector dimension
- Reuse GenericInternalRow in HoodieSparkParquetReader's vector post-processing
  loop to reduce GC pressure on large scans
…eSchema.Vector] to fix Scala 2.12 type inference error
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 52f6db8 to 959bcd8 Compare March 18, 2026 23:17
@rahil-c rahil-c changed the title Rahil/vector schema spark converters parquet feat(vector): Support writing VECTOR to parquet and avro formats using Spark Mar 18, 2026
@rahil-c rahil-c requested review from bvaradar and voonhous March 18, 2026 23:28
@rahil-c rahil-c force-pushed the rahil/vector-schema-spark-converters-parquet branch from 3f7e2d0 to f8ce228 Compare March 18, 2026 23:31
@rahil-c rahil-c marked this pull request as ready for review March 18, 2026 23:32
@rahil-c rahil-c requested a review from vinothchandar March 18, 2026 23:32
rahil-c and others added 2 commits March 18, 2026 17:21
- Move VectorConversionUtils import into hudi group (was misplaced in 3rdParty)
- Add blank line between hudi and 3rdParty import groups
- Add blank line between java and scala import groups

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 19, 2026

@yihua @voonhous @balaji-varadarajan-ai will need a review from one of you guys if possible

Copy link
Contributor

@balaji-varadarajan-ai balaji-varadarajan-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the PR. here are the initial comments

StructType structSchema = HoodieInternalRowUtils.getCachedSchema(nonNullSchema);

// Detect vector columns: ordinal → Vector schema
Map<Integer, HoodieSchema.Vector> vectorColumnInfo = VectorConversionUtils.detectVectorColumns(nonNullSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing the pattern:

  1. Detecting vector columns.
  2. Replacing Schema
  3. Post-process rows
    in HoodieSparkParquetReader, SparkFileFormatInternalRowReaderContext and HoodieFileGroupReaderBasedFileFormat. Wondering if you can bring them under one common method with specific callback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can look into this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

* @param schema a HoodieSchema of type RECORD (or null)
* @return map from field index to Vector schema; empty map if schema is null or has no vectors
*/
public static Map<Integer, HoodieSchema.Vector> detectVectorColumns(HoodieSchema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, As we are using integer ordinal position in the schema, can you check if things end to end with projections and schema evolution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe i have tests for this in the pr but will check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more tests

HoodieSchema.Vector vectorSchema = (HoodieSchema.Vector) resolvedSchema;
int fixedSize = vectorSchema.getDimension()
* vectorSchema.getVectorElementType().getElementSize();
return Types.primitive(FIXED_LEN_BYTE_ARRAY, repetition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectors are stored as bare FIXED_LEN_BYTE_ARRAY in Parquet with no logical type annotation or key-value metadata on the Parquet column. I think it would be useful to track this. Any thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balaji-varadarajan-ai so you mean we want to keep track of the hudi type info around VECTOR within parquet itself? If so i think i can look into this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balaji-varadarajan-ai my question is what benefit do we get from keeping this info in the file footer or as a column annotation, since no other reader would be able to interpret this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is generally, keeping this metadata (to disambiguate) would be helpful in the data path for future scenarios.

@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 23, 2026

@voonhous @yihua can i get a review for this today/tommorow?

}
return new GenericArrayData(doubles);
case INT8:
byte[] int8s = new byte[dim];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: GenericArrayData(byte[]) is kinda inefficient — it actually boxes every byte into a Byte and stores it as an Object[]. So for a 1536-dim INT8 vector, that’s 1536 tiny allocations per row 😬. FLOAT/DOUBLE don’t have this issue since they use optimized primitive array constructors. If this becomes a bottleneck, consider using UnsafeArrayData.fromPrimitiveArray(int8s) to avoid all that boxing.

default:
throw new UnsupportedOperationException("Unsupported vector element type: " + elemType);
}
recordConsumer.addBinary(Binary.fromReusedByteArray(buffer.array()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We reuse the same buffer for every row without copying. Make sure ColumnWriteStoreV2 doesn't hold references between writes, or consecutive rows will overwrite each other!
The decimal path does this too so it's probably safe, but keep in mind the vector buffer is much larger (~6KB vs ~16B).

&& resolvedSchema != null
&& resolvedSchema.getType() == HoodieSchemaType.VECTOR) {
HoodieSchema.Vector vectorSchema = (HoodieSchema.Vector) resolvedSchema;
int fixedSize = vectorSchema.getDimension()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Math.multiplyExact() here to prevent silent overflow with massive vectors!

StructField[] newFields = new StructField[fields.length];
for (int i = 0; i < fields.length; i++) {
if (vectorColumns.containsKey(i)) {
newFields[i] = new StructField(fields[i].name(), BinaryType$.MODULE$, fields[i].nullable(), Metadata.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Metadata.empty() correct here ?

result.update(i, convertBinaryToVectorArray(row.getBinary(i), vectorColumns.get(i)));
} else {
// Non-vector column: copy value as-is using the read schema's data type
result.update(i, row.get(i, readSchema.apply(i).dataType()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of copying every field through GenericInternalRow, consider a byte-level copy of the UnsafeRow with surgical replacement of only the vector column offsets. It might be faster. You can separately micro benchmark this and test.

Copy link
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor nit comments.

Comment on lines +547 to +552
case VECTOR: {
Types.VectorType vector = (Types.VectorType) primitive;
return HoodieSchema.createVector(
vector.getDimension(),
HoodieSchema.Vector.VectorElementType.fromString(vector.getElementType()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This StorageBacking is lost in InternalSchema round-trip here iIUC. Types.VectorType stores storageBacking and VectorType.get() accepts it, but this conversion back to HoodieSchema doesn't pass it through.

This is fine for now since only PARQUET_FIXED_LEN_BYTE_ARRAY exists, but it'll silently lose data when new backing types are added. Maybe we should pass it through or add a comment noting the assumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch thanks voon

Comment on lines 448 to 451
private def readBaseFile(file: PartitionedFile, parquetFileReader: SparkColumnarFileReader, requestedSchema: StructType,
remainingPartitionSchema: StructType, fixedPartitionIndexes: Set[Int], requiredSchema: StructType,
partitionSchema: StructType, outputSchema: StructType, filters: Seq[Filter],
storageConf: StorageConfiguration[Configuration]): Iterator[InternalRow] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to reduce boilerplate to this function to reduce its complexity?

There's 3 separate detectVectorColumns + replaceVectorFieldsWithBinary calls. We can add:

 private def withVectorRewrite(schema: StructType): (StructType, Map[Int, HoodieSchema.Vector]) = {                                                                                                                                                                                 
    val vecs = detectVectorColumns(schema)                                                                                                                                                                                                                                             
    if (vecs.nonEmpty) (replaceVectorFieldsWithBinary(schema, vecs), vecs) else (schema, vecs)
...
  }  

LOCAL_TIMESTAMP_MILLIS(Long.class),
LOCAL_TIMESTAMP_MICROS(Long.class);
LOCAL_TIMESTAMP_MICROS(Long.class),
VECTOR(ByteBuffer.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand this, vectors aren't accessed as ByteBuffer through the InternalSchema API, we are using byte[].class right?

Possible to add a comment here for choosing ByteBuffer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave a comment for this explaining why its ByteBuffer

@rahil-c
Copy link
Collaborator Author

rahil-c commented Mar 25, 2026

@balaji-varadarajan-ai @voonhous if you can take a look again?

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 68.97810% with 85 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.35%. Comparing base (14a549f) to head (b590036).
⚠️ Report is 20 commits behind head on master.

Files with missing lines Patch % Lines
...i/io/storage/row/HoodieRowParquetWriteSupport.java 9.09% 28 Missing and 2 partials ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 75.47% 3 Missing and 10 partials ⚠️
...ache/hudi/io/storage/HoodieSparkParquetReader.java 30.76% 7 Missing and 2 partials ⚠️
.../apache/hudi/io/storage/VectorConversionUtils.java 84.74% 3 Missing and 6 partials ⚠️
...va/org/apache/hudi/common/schema/HoodieSchema.java 74.19% 3 Missing and 5 partials ⚠️
...in/java/org/apache/hudi/internal/schema/Types.java 68.18% 3 Missing and 4 partials ⚠️
...hudi/SparkFileFormatInternalRowReaderContext.scala 75.00% 0 Missing and 5 partials ⚠️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 33.33% 1 Missing and 1 partial ⚠️
...hema/HoodieSchemaComparatorForSchemaEvolution.java 83.33% 0 Missing and 1 partial ⚠️
...ommon/schema/HoodieSchemaCompatibilityChecker.java 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18328      +/-   ##
============================================
- Coverage     68.48%   68.35%   -0.13%     
- Complexity    27362    27620     +258     
============================================
  Files          2420     2434      +14     
  Lines        132127   133490    +1363     
  Branches      15909    16094     +185     
============================================
+ Hits          90491    91251     +760     
- Misses        34627    35144     +517     
- Partials       7009     7095      +86     
Flag Coverage Δ
common-and-other-modules 44.32% <30.29%> (-0.05%) ⬇️
hadoop-mr-java-client 45.09% <10.41%> (-0.02%) ⬇️
spark-client-hadoop-common 48.43% <4.97%> (+0.10%) ⬆️
spark-java-tests 48.74% <68.97%> (-0.18%) ⬇️
spark-scala-tests 45.32% <23.35%> (+0.20%) ⬆️
utilities 38.47% <19.70%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ain/java/org/apache/hudi/internal/schema/Type.java 80.32% <100.00%> (+0.32%) ⬆️
...ternal/schema/convert/InternalSchemaConverter.java 89.27% <100.00%> (+0.48%) ⬆️
...a/org/apache/hudi/avro/HoodieAvroWriteSupport.java 100.00% <100.00%> (ø)
...quet/avro/AvroSchemaConverterWithTimestampNTZ.java 76.02% <100.00%> (+0.45%) ⬆️
...hema/HoodieSchemaComparatorForSchemaEvolution.java 88.76% <83.33%> (-0.40%) ⬇️
...ommon/schema/HoodieSchemaCompatibilityChecker.java 65.85% <92.30%> (+1.47%) ⬆️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 82.28% <33.33%> (-0.06%) ⬇️
...hudi/SparkFileFormatInternalRowReaderContext.scala 79.38% <75.00%> (-0.26%) ⬇️
...in/java/org/apache/hudi/internal/schema/Types.java 77.74% <68.18%> (-0.76%) ⬇️
...va/org/apache/hudi/common/schema/HoodieSchema.java 81.43% <74.19%> (+0.06%) ⬆️
... and 4 more

... and 88 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants