Skip to content

feat: support LookupMergeTreeCompactRewriter#186

Open
lszskye wants to merge 9 commits intoalibaba:mainfrom
lszskye:lookup_rewriter
Open

feat: support LookupMergeTreeCompactRewriter#186
lszskye wants to merge 9 commits intoalibaba:mainfrom
lszskye:lookup_rewriter

Conversation

@lszskye
Copy link
Collaborator

@lszskye lszskye commented Mar 19, 2026

Purpose

support LookupMergeTreeCompactRewriter for Rewrite process with lookup
Linked issue: #93

Tests

src/paimon/core/mergetree/compact/lookup_merge_tree_compact_rewriter_test.cpp

@lucasfang lucasfang requested a review from Copilot March 19, 2026 09:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds lookup-based merge-tree compaction rewriting (via LookupMergeTreeCompactRewriter) and introduces per-level file-format selection, updating compaction/read/write paths and expanding unit tests accordingly.

Changes:

  • Introduces LookupMergeTreeCompactRewriter + ChangelogMergeTreeRewriter to support lookup-driven rewrite/upgrade flows (including DV-aware paths).
  • Adds file.format.per.level support in CoreOptions and updates call sites to use GetFileFormat() / GetWriteFileFormat(level).
  • Adds FileStorePathFactoryCache and new tests for lookup rewrite behavior and wrapper logic.

Reviewed changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/paimon/core/utils/file_store_path_factory_test.cpp Updates tests to use GetFileFormat() for path factory creation.
src/paimon/core/utils/file_store_path_factory_cache_test.cpp Adds unit test for the new path-factory cache.
src/paimon/core/utils/file_store_path_factory_cache.h Introduces a cache to reuse FileStorePathFactory by format identifier.
src/paimon/core/table/source/table_scan.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/table/source/table_read.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/postpone/postpone_bucket_writer.cpp Updates write-format selection to GetWriteFileFormat(level).
src/paimon/core/postpone/postpone_bucket_file_store_write.h Updates write-format selection to GetWriteFileFormat(level).
src/paimon/core/options/lookup_strategy.h Adds LookupStrategy struct encapsulating lookup decision inputs.
src/paimon/core/operation/raw_file_split_read_test.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/orphan_files_cleaner.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/merge_file_split_read_test.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/merge_file_split_read.h Adds API to inject a merge-function wrapper and refactors wrapper retrieval.
src/paimon/core/operation/merge_file_split_read.cpp Implements SetMergeFunctionWrapper.
src/paimon/core/operation/manifest_file_merger_test.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/key_value_file_store_scan_test.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/file_store_write.cpp Uses GetWriteFileFormat(level) for write paths.
src/paimon/core/operation/file_store_commit.cpp Uses GetFileFormat() and updates assertions accordingly.
src/paimon/core/operation/expire_snapshots_test.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/operation/append_only_file_store_write.cpp Uses GetFileFormat() for writer creation.
src/paimon/core/migrate/file_meta_utils.cpp Uses GetFileFormat() for migration commit message generation.
src/paimon/core/mergetree/merge_tree_writer.cpp Updates write-format selection to GetWriteFileFormat(level).
src/paimon/core/mergetree/lookup_levels_test.cpp Adds coverage for closing and tmp-dir cleanup behavior.
src/paimon/core/mergetree/lookup_levels.h Adds Close() to clear lookup cache.
src/paimon/core/mergetree/compact/reducer_merge_function_wrapper.h Changes GetResult() to reset wrapper state after producing a result.
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter_test.cpp Updates rewriter creation to use FileStorePathFactoryCache.
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter.h Refactors rewriter to use path-factory cache; adds wrapper factory injection points.
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter.cpp Implements per-level format writing and wrapper injection during merge-read.
src/paimon/core/mergetree/compact/lookup_merge_tree_compact_rewriter_test.cpp Adds comprehensive tests for lookup-based rewrite/upgrade and DV behavior.
src/paimon/core/mergetree/compact/lookup_merge_tree_compact_rewriter.h Introduces lookup-based rewriter interface and wrapper factories.
src/paimon/core/mergetree/compact/lookup_merge_tree_compact_rewriter.cpp Implements lookup-based rewrite/upgrade decisions and DV updates.
src/paimon/core/mergetree/compact/lookup_merge_function_test.cpp Adds tests for high-level selection and insertion ordering.
src/paimon/core/mergetree/compact/lookup_merge_function.h Enhances merge function to track key, level-0 presence, and pick high-level candidate.
src/paimon/core/mergetree/compact/lookup_changelog_merge_function_wrapper_test.cpp Adds tests for lookup-changelog wrapper behavior including DV.
src/paimon/core/mergetree/compact/lookup_changelog_merge_function_wrapper.h Introduces wrapper that performs lookup/DV marking and merges candidates.
src/paimon/core/mergetree/compact/first_row_merge_function_wrapper_test.cpp Adds tests for first-row lookup wrapper behavior.
src/paimon/core/mergetree/compact/first_row_merge_function_wrapper.h Adds wrapper for first-row lookup behavior.
src/paimon/core/mergetree/compact/first_row_merge_function.h Exposes ContainsHighLevel() for wrapper logic.
src/paimon/core/mergetree/compact/compact_rewriter.h Changes Upgrade() to be non-const.
src/paimon/core/mergetree/compact/changelog_merge_tree_rewriter.h Adds a base rewriter that can rewrite and/or produce changelog per strategy.
src/paimon/core/mergetree/compact/changelog_merge_tree_rewriter.cpp Implements rewrite/upgrade routing based on strategy.
src/paimon/core/manifest/manifest_entry_writer_test.cpp Updates write-format selection to GetWriteFileFormat(level).
src/paimon/core/key_value.h Makes KeyValue copyable and changes value ownership to shared_ptr.
src/paimon/core/io/single_file_writer_test.cpp Updates write-format selection to GetWriteFileFormat(level).
src/paimon/core/global_index/global_index_write_task.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/global_index/global_index_scan_impl.cpp Switches path factory creation to use GetFileFormat().
src/paimon/core/core_options_test.cpp Adds coverage for per-level formats and lookup strategy; adds invalid-format tests.
src/paimon/core/core_options.h Adds APIs: GetFileFormat(), GetWriteFileFormat(level), GetLookupStrategy().
src/paimon/core/core_options.cpp Implements per-level format parsing and lookup strategy computation.
src/paimon/core/append/append_only_writer.cpp Uses GetFileFormat() for writer creation.
src/paimon/common/sst/sst_file_reader.cpp Adds null checks for cached block reads and returns errors on failure.
src/paimon/common/io/cache/cache_manager.cpp Handles null returns from cache get path.
src/paimon/common/defs.cpp Adds new option key file.format.per.level.
src/paimon/common/data/generic_row.h Changes data holder storage from unique_ptr to shared_ptr.
src/paimon/CMakeLists.txt Registers new rewriter sources and new unit tests.
include/paimon/defs.h Documents new file.format.per.level option.
Comments suppressed due to low confidence (6)

src/paimon/core/utils/file_store_path_factory_cache_test.cpp:1

  • This test accesses FileStorePathFactoryCache::format_to_path_factory_ and FileStorePathFactory::format_identifier_ directly. Those members are private in the new cache header (and likely private in FileStorePathFactory), so this will not compile. Prefer asserting the cache behavior via public surface: add a Size() accessor (or similar) on FileStorePathFactoryCache and validate format via a public getter on FileStorePathFactory (or by observing generated paths/extensions), or declare the test as a friend if you intentionally want to white-box test internals.
    src/paimon/core/utils/file_store_path_factory_cache_test.cpp:1
  • This test accesses FileStorePathFactoryCache::format_to_path_factory_ and FileStorePathFactory::format_identifier_ directly. Those members are private in the new cache header (and likely private in FileStorePathFactory), so this will not compile. Prefer asserting the cache behavior via public surface: add a Size() accessor (or similar) on FileStorePathFactoryCache and validate format via a public getter on FileStorePathFactory (or by observing generated paths/extensions), or declare the test as a friend if you intentionally want to white-box test internals.
    src/paimon/core/utils/file_store_path_factory_cache_test.cpp:1
  • This test accesses FileStorePathFactoryCache::format_to_path_factory_ and FileStorePathFactory::format_identifier_ directly. Those members are private in the new cache header (and likely private in FileStorePathFactory), so this will not compile. Prefer asserting the cache behavior via public surface: add a Size() accessor (or similar) on FileStorePathFactoryCache and validate format via a public getter on FileStorePathFactory (or by observing generated paths/extensions), or declare the test as a friend if you intentionally want to white-box test internals.
    src/paimon/core/utils/file_store_path_factory_cache_test.cpp:1
  • This test accesses FileStorePathFactoryCache::format_to_path_factory_ and FileStorePathFactory::format_identifier_ directly. Those members are private in the new cache header (and likely private in FileStorePathFactory), so this will not compile. Prefer asserting the cache behavior via public surface: add a Size() accessor (or similar) on FileStorePathFactoryCache and validate format via a public getter on FileStorePathFactory (or by observing generated paths/extensions), or declare the test as a friend if you intentionally want to white-box test internals.
    src/paimon/core/mergetree/compact/reducer_merge_function_wrapper.h:1
  • Reset() is only called on the success path. If merge_function_->GetResult() returns an error, PAIMON_ASSIGN_OR_RAISE will return early and skip resetting state, which can leave the wrapper in a partially-initialized state for subsequent use. Consider using a scope guard to ensure Reset() runs regardless of success/failure (or explicitly reset before returning the error).
    src/paimon/core/mergetree/lookup_levels.h:1
  • Correct the typo in the comment: 'TODDO' should be 'TODO'.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants