Skip to content

[lake/hudi] Introduce Hudi LakeCatalog to create table#3395

Open
fhan688 wants to merge 11 commits into
apache:mainfrom
fhan688:Introduce-hudi-LakeCatalog-to-create-table
Open

[lake/hudi] Introduce Hudi LakeCatalog to create table#3395
fhan688 wants to merge 11 commits into
apache:mainfrom
fhan688:Introduce-hudi-LakeCatalog-to-create-table

Conversation

@fhan688
Copy link
Copy Markdown
Contributor

@fhan688 fhan688 commented May 28, 2026

Purpose

Linked issue: #3275

This PR introduces the Hudi LakeCatalog implementation, enabling Fluss to create tables in Hudi data lake storage. This aligns with the existing Paimon and Iceberg lake catalog support, completing the trio of supported lake formats for table creation.

Brief change log

New modules & classes (fluss-lake/fluss-lake-hudi):

  • HudiLakeCatalog: Implements LakeCatalog interface, supporting both HMS (Hive Metastore) and DFS (filesystem) catalog modes. Handles table creation with schema compatibility check for crash-recovery idempotency, automatic database creation, and system column (__bucket, __offset, __timestamp) appending.

  • FlussDataTypeToHudiDataType: Implements DataTypeVisitor to convert Fluss data types to Flink types (Hudi's type system). Handles LocalZonedTimestampType specially: maps to BIGINT under HMS mode, TIMESTAMP_WITH_LOCAL_TIME_ZONE under DFS mode.

  • HudiConversions: Core conversion utility. Converts Fluss TablePath → Hudi ObjectPath, TableDescriptorResolvedSchema / Hudi table properties. Validates HUDI_UNSETTABLE_OPTIONS (6 protected options that Fluss auto-manages), checks system column name conflicts, and handles property prefix rewriting (hudi.xxxxxx, others → fluss.xxx).

  • HudiCatalogUtils: Factory for creating Hudi Catalog instances (HoodieHiveCatalog for HMS, HoodieCatalog for DFS). Uses copied Configuration to avoid mutating the original.

Modifications to existing modules (fluss-flink/fluss-flink-common):

  • LakeFlinkCatalog: Adds HUDI branch in getLakeCatalog() with a new HudiCatalogFactory inner class that uses reflection to instantiate Hudi catalog (mirroring the Iceberg pattern to avoid compile-time dependency on hudi-flink-bundle).

  • LakeTableFactory: Adds HUDI branch in getLakeTableFactory() with getHudiFactory() that reflectively loads HoodieTableFactory.

  • HudiLakeStorage: Replaces the UnsupportedOperationException in createLakeCatalog() with new HudiLakeCatalog(hudiConfig) to wire the SPI path.

Key design decisions:

Aspect Decision
Table type mapping PK table → MERGE_ON_READ, Log table → COPY_ON_WRITE
Index strategy BUCKET index type, aligned with Fluss's bucketing model
Dependency isolation Hudi bundle loaded via reflection/plugin classloader (no compile-time dependency in fluss-flink-common)
Catalog mode Supports hms (Hive Metastore) and dfs (filesystem)
Property rewriting hudi. prefix stripped; non-hudi properties prefixed with fluss.
Idempotency Schema-compatible duplicate creation is treated as success for crash recovery

Tests

HudiLakeCatalogTest (14 test cases):

  • testPropertyPrefixRewriting — verifies hudi.xxx → xxx and non-hudi → fluss.xxx prefix rewriting

  • testCreatePrimaryKeyTable — PK table (MOR) creation with system columns and primary key

  • testCreateLogTable — Log table (COW) creation with record key from customProperties

  • testIsHudiSchemaCompatibleWithSameSchema — compatible schemas return true

  • testIsHudiSchemaCompatibleWithDifferentColumnCount — different column count returns false

  • testIsHudiSchemaCompatibleWithDifferentColumnName — different column name returns false

  • testIsHudiSchemaCompatibleWithDifferentColumnType — different column type returns false

  • testCreateDuplicateTableWithCompatibleSchema — duplicate creation with compatible schema is idempotent

  • testCreateDuplicateTableWithIncompatibleSchema — duplicate creation with incompatible schema throws TableAlreadyExistException

  • testUnsettableOptionInPropertiesThrowsException — protected option in properties throws InvalidConfigException

  • testUnsettableOptionInCustomPropertiesThrowsException — protected option in customProperties throws InvalidConfigException

  • testNonProtectedHudiOptionPassesValidation — non-protected option passes validation

  • testSystemColumnBucketConflictThrowsException__bucket conflict throws InvalidTableException

  • testSystemColumnOffsetConflictThrowsException__offset conflict throws InvalidTableException

  • testSystemColumnTimestampConflictThrowsException__timestamp conflict throws InvalidTableException

API and Format

No API or storage format changes. This PR only adds new implementations for the existing LakeCatalog and LakeStorage SPI interfaces.

Documentation

A new feature — Hudi lake catalog support for table creation. Will need documentation updates for the Hudi integration guide.

@fhan688
Copy link
Copy Markdown
Contributor Author

fhan688 commented May 28, 2026

all tests are passed, please help review, thanks! @XuQianJin-Stars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant