Search before asking
Motivation
Compaction is essential for maintaining high performance and storage efficiency in modern data systems. Key benefits include:
- For Append Tables: Reduces small files by merging existing data files, improving scan performance and metadata scalability.
- For Primary Key (PK) Tables: Minimizes the number of segments that need to be merged during read-time (
merge-on-read), significantly speeding up queries.
- For PK+DV Tables: Enables writing DV (
Delete Vector) files to mark outdated rows, allowing efficient read performance.
Currently, the lack of a dedicated compaction mechanism limits our ability to optimize storage layout and query latency.
Solution
The compaction framework should support the following capabilities:
- Support for both append tables and primary key (PK) tables, with appropriate strategies for each;
- Execution via background tasks or manual triggers, allowing flexibility in operation;
- Built-in basic compaction policies aligned with Java Paimon;
- Generation of Delete Vector (DV) files during/after compaction to track stale rows;
- Design support for data-evolution scenarios, including both vertical compaction (merging small files) and horizontal compaction (consolidating partial-column files);
- Ensure output data format is fully compatible with Java Paimon.
Anything else?
No response
Are you willing to submit a PR?