diff --git a/docs/table-design/data-partitioning/basic-concepts.mdx b/docs/table-design/data-partitioning/basic-concepts.mdx index 1c11a1f64ee46..5923953a0ec6c 100644 --- a/docs/table-design/data-partitioning/basic-concepts.mdx +++ b/docs/table-design/data-partitioning/basic-concepts.mdx @@ -1,18 +1,17 @@ --- { - "title": "Basic Concepts", + "title": "How Partitioning and Bucketing Work", + "sidebar_label": "How It Works", "language": "en", - "description": "A progressive introduction to Doris partitioning and bucketing: from core concepts and the first CREATE TABLE example to auto/dynamic partitioning, auto-bucketing, Colocate, and other advanced capabilities, along with design recommendations and operational guidance for partitions and buckets." + "description": "The data-distribution model behind Doris partitioning and bucketing: partitions, buckets, tablets, and nodes, plus advanced partition and bucket modes, design recommendations, and operational guidance." } --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; {/* Knowledge type: Concept introduction / Procedure */} {/* Applicable scenarios: Table design / Data organization and management */} -This document introduces the partitioning (Partition) and bucketing (Bucket) mechanisms of Doris, helping you design table structures reasonably to improve query performance and data management efficiency. New users are recommended to read the sections in order: Sections 1-3 cover core concepts and the first CREATE TABLE example, Sections 4-6 cover advanced features and design recommendations, and Section 7 covers the methods for viewing and modifying partitions needed for daily operations. +This page explains how Doris distributes data across partitions, buckets, and tablets. It also covers the advanced partition and bucket modes, design recommendations, and operational commands. For a recommended starting configuration and a decision guide, start with [Partitioning and Bucketing](./overview), then read this page when you want to understand the underlying model. ## 1. Overview @@ -148,152 +147,13 @@ Besides manually declaring partitions at table creation time, Doris also support | Dynamic partition | Automatically created/recycled by the system based on time scheduling rules | Time-series data, where you want to automatically maintain rolling partitions for the past N days/weeks/months | | Auto partition | Created on demand when data is written | Partition values are unpredictable (such as multi-tenant or sparse time), where pre-creation should be avoided | -The following shows CREATE TABLE examples for common combinations: - - - - -[Auto Partition](./auto-partitioning) supports automatically creating corresponding partitions according to user-defined rules during data ingestion, making it more convenient to use. The basic example rewritten as Auto Range partition: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- Use month as the partition granularity -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1" -); -``` - -With this CREATE TABLE statement, when data is loaded, Doris automatically creates corresponding partitions for the `date` column at the month level. For example, `2018-12-01` and `2018-12-31` fall into the same partition, while `2018-11-12` falls into another partition. Auto Partition also supports List partitioning. For more usage, see the Auto Partition documentation. - - - - - -[Dynamic Partition](./dynamic-partitioning) is a management approach that automatically creates and recycles partitions based on real time. The basic example rewritten as Dynamic Partition: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -PARTITION BY RANGE(`date`) -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "WEEK", --- Partition granularity is week - "dynamic_partition.start" = "-2", --- Retain the past two weeks - "dynamic_partition.end" = "2", --- Pre-create the next two weeks - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -Dynamic Partition supports tiered storage, custom replica counts, and more. See the Dynamic Partition documentation for details. - - - - - -Auto Partition and Dynamic Partition each have their own advantages. Combining the two enables flexible on-demand creation and automatic recycling of partitions: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- Use month as the partition granularity -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "month", --- The two granularities must be the same - "dynamic_partition.start" = "-2", --- Dynamic Partition automatically cleans up historical partitions older than two weeks - "dynamic_partition.end" = "0", --- Dynamic Partition does not create future partitions; this is fully delegated to Auto Partition - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -For details about this feature, see [Using Auto Partition with Dynamic Partition](./auto-partitioning#lifecycle-management). - - - - +For ready-to-use CREATE TABLE examples of each mode, including combining auto with dynamic partitioning, see [Auto Partitioning](./auto-partitioning), [Dynamic Partitioning](./dynamic-partitioning), and [Manual Partitioning](./manual-partitioning). ## 5. Advanced: Bucketing ### 5.1 Auto Bucketing -When you are not sure about a reasonable number of buckets, you can use Auto Bucketing to let Doris perform the estimation. You only need to provide the estimated table data size: - -```sql -CREATE TABLE IF NOT EXISTS example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -PARTITION BY RANGE(`date`) -( - PARTITION `p201701` VALUES LESS THAN ("2017-02-01"), - PARTITION `p201702` VALUES LESS THAN ("2017-03-01"), - PARTITION `p201703` VALUES LESS THAN ("2017-04-01"), - PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01")) -) -DISTRIBUTED BY HASH(`user_id`) BUCKETS AUTO -PROPERTIES -( - "replication_num" = "1", - "estimate_partition_size" = "2G" --- Estimated data volume for one partition; defaults to 10G if not provided -); -``` - -Note that this approach is not suitable for scenarios with extremely large table data volumes. +When you are unsure how many buckets to use, set `BUCKETS AUTO` and let Doris size them from an estimated data volume (`estimate_partition_size`). This is not suitable for extremely large tables. For details, see [Data Bucketing](./data-bucketing). ### 5.2 Colocate diff --git a/docs/table-design/data-partitioning/dynamic-partitioning.md b/docs/table-design/data-partitioning/dynamic-partitioning.md index daff66ad6d763..440dae1687b85 100644 --- a/docs/table-design/data-partitioning/dynamic-partitioning.md +++ b/docs/table-design/data-partitioning/dynamic-partitioning.md @@ -1,13 +1,14 @@ --- { "title": "Dynamic Partitioning", + "sidebar_label": "Dynamic Partitioning (Legacy)", "language": "en", "description": "Dynamic partitioning rolls partitions forward by creating and dropping them on a schedule, providing partition lifecycle management (TTL) for tables. It applies to scenarios such as logs and time-series data that need automatic cleanup of expired data." } --- -:::info Tip -[Auto Partitioning](./auto-partitioning) is the recommended approach for automatic partition management. It is the successor to dynamic partitioning. +:::info Legacy +Dynamic partitioning is superseded by [auto partitioning](./auto-partitioning), its successor for automatic partition management. Use auto partitioning for new tables; this page is kept for existing dynamic-partition tables. ::: diff --git a/docs/table-design/data-partitioning/overview.md b/docs/table-design/data-partitioning/overview.md new file mode 100644 index 0000000000000..5e8d6b8c854bf --- /dev/null +++ b/docs/table-design/data-partitioning/overview.md @@ -0,0 +1,78 @@ +--- +{ + "title": "Partitioning and Bucketing", + "language": "en", + "description": "The recommended partitioning and bucketing for a Doris table, and when to customize: auto, dynamic, and manual partitioning, bucketing method, and bucket count." +} +--- + +Doris organizes a table in two tiers: partitions split rows by column value, and buckets split each partition into shards for parallel processing. This page gives the recommended starting point and shows when to customize. + +## Recommended Starting Point + +For most tables, partition by time and let Doris manage partition creation and bucket sizing automatically: + +```sql +CREATE TABLE sales ( + sale_time DATETIME NOT NULL, + order_id BIGINT NOT NULL, + amount DECIMAL(10, 2) +) +DUPLICATE KEY(sale_time, order_id) +AUTO PARTITION BY RANGE (date_trunc(sale_time, 'day')) () +DISTRIBUTED BY HASH(order_id) BUCKETS AUTO; +``` + +- **Auto partitioning** creates a partition as data arrives, so you never pre-define or backfill partition ranges. +- **`BUCKETS AUTO`** lets Doris size the number of shards from the data. +- Partition pruning on `sale_time` and parallel scans across buckets keep queries fast. + +If the table has no time column or stays small (under about 1 GB), use a single partition with a fixed bucket count: + +```sql +DISTRIBUTED BY HASH(order_id) BUCKETS 10 +``` + +## Choose Your Design + +Customize only when the default does not fit: + +| Decision | Recommended default | Change it when | +| --- | --- | --- | +| How to partition | [Auto partitioning](./auto-partitioning) | Use [manual partitioning](./manual-partitioning) for schemes auto cannot express: custom or irregular ranges, ranges on a numeric column, or grouped LIST values. [Dynamic partitioning](./dynamic-partitioning) is superseded by auto. | +| Bucketing method | Hash on a high-cardinality column | If data skews, or you filter on arbitrary dimensions, use random bucketing ([Data Bucketing](./data-bucketing)) | +| Number of buckets | `BUCKETS AUTO` | If you know your data size and want fixed control, set a count ([Data Bucketing](./data-bucketing)) | + +## Expire Old Partitions + +To drop old data automatically, set a retention policy. Both modes keep the most recent partitions and drop older ones; they differ in how you express the limit: + +| Partition mode | Property | Retention limit | +| --- | --- | --- | +| [Dynamic partitioning](./dynamic-partitioning) | `dynamic_partition.start` (for example, `-7`) | A time window: keep partitions within the last N time units of now | +| [Auto partitioning](./auto-partitioning) (RANGE) | `partition.retention_count` (for example, `3`) | A partition count: keep the newest N historical partitions | + +With regular time partitions (such as one per day), the two are effectively equivalent: "last 7 days" matches "newest 7 daily partitions." They diverge when partitions are irregular or data is stale: a time window can drop every partition once the data is older than the window, whereas a count always keeps the newest N. + +Combining auto and dynamic partitioning for retention is no longer recommended; use `partition.retention_count` for auto-range tables. + +Retention **drops** data. To move cold data to cheaper storage instead of dropping it, use [tiered storage](../tiered-storage/overview) instead. + +## How It Works + +Doris maps data in two tiers: + +```text +Table ──► Partition (by column value) ──► Bucket (hash or random) ──► Tablet (shard on a BE node) +``` + +Partitions let Doris skip data that can't match a query, and make it easy to archive or drop data by time. Buckets spread each partition across tablets for parallel reads and writes. For the full data-distribution model, including tablets, replicas, and how they map to nodes, see [How Partitioning and Bucketing Work](./basic-concepts). + +## Next Steps + +- [Auto Partitioning](./auto-partitioning): the default, with no manual range maintenance. +- [Dynamic Partitioning](./dynamic-partitioning): rolling time windows with retention. +- [Manual Partitioning](./manual-partitioning): explicit ranges and list partitions. +- [Data Bucketing](./data-bucketing): choose the method, key, and bucket count. +- [How Partitioning and Bucketing Work](./basic-concepts): the underlying data-distribution model. +- [Common Issues](./common-issues): troubleshooting partition and bucket design. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx index a280f7cfc1efc..7d51e12be4f21 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx @@ -1,18 +1,17 @@ --- { - "title": "基本概念", + "title": "分区与分桶原理", + "sidebar_label": "工作原理", "language": "zh-CN", "description": "由浅入深介绍 Doris 的分区与分桶机制:从核心概念、第一个建表示例到自动/动态分区、自动分桶、Colocate 等进阶能力,并给出分区分桶的设计建议与运维方法。" } --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; {/* 知识类型: 概念介绍 / 操作步骤 */} {/* 适用场景: 建表设计 / 数据组织与管理 */} -本文介绍 Doris 的分区(Partition)与分桶(Bucket)机制,帮助用户合理设计表结构以提升查询性能与数据管理效率。建议新手按章节顺序阅读:第 1–3 节涵盖核心概念与第一个建表示例,第 4–6 节为进阶特性与设计建议,第 7 节为日常运维所需的查看与修改方法。 +本文介绍 Doris 如何将数据分布到分区、分桶与 Tablet,并涵盖进阶的分区与分桶模式、设计建议与运维命令。若需要推荐的起步配置与选型指南,请先阅读 [分区与分桶](./overview);当你希望理解底层模型时再阅读本文。 ## 1. 概述 @@ -148,152 +147,13 @@ PROPERTIES | 动态分区 | 系统按时间调度规则自动创建/回收 | 时间序列数据,希望自动滚动维护近 N 天/周/月分区 | | 自动分区 | 数据写入时按需创建 | 分区取值不可预知(如多租户、稀疏时间),希望避免预创建 | -下面给出常见组合的建表示例: - - - - -[自动分区](./auto-partitioning) 支持在数据导入时根据用户定义的规则自动创建对应分区,使用更为便捷。将基础示例改写为自动 Range 分区: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- 使用月作为分区粒度 -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1" -); -``` - -如上建表,当数据导入时,Doris 将自动按月级别为 `date` 列创建对应分区。例如 `2018-12-01` 与 `2018-12-31` 会落入同一个分区,而 `2018-11-12` 会落入另一个分区。自动分区还支持 List 分区,更多用法请查看自动分区文档。 - - - - - -[动态分区](./dynamic-partitioning) 是根据现实时间进行自动分区创建与回收的管理方式。将基础示例改写为动态分区: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -PARTITION BY RANGE(`date`) -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "WEEK", --- 分区粒度为周 - "dynamic_partition.start" = "-2", --- 向前保留两周 - "dynamic_partition.end" = "2", --- 提前创建后两周 - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -动态分区支持分层存储、自定义副本数等功能,详见动态分区文档。 - - - - - -自动分区与动态分区各有优势,二者结合可实现分区的灵活按需创建与自动回收: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- 使用月作为分区粒度 -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "month", --- 二者粒度必须相同 - "dynamic_partition.start" = "-2", --- 动态分区自动清理超过两周的历史分区 - "dynamic_partition.end" = "0", --- 动态分区不创建未来分区,完全交给自动分区 - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -关于该功能的细节说明,详见 [自动分区与动态分区联用](./auto-partitioning#与动态分区联用)。 - - - - +各模式开箱即用的建表示例,以及自动分区与动态分区的组合用法,见[自动分区](./auto-partitioning)、[动态分区](./dynamic-partitioning)与[手动分区](./manual-partitioning)。 ## 5. 进阶:分桶进阶 ### 5.1 自动分桶 -当用户不确定合理的分桶数时,可以使用自动分桶让 Doris 完成估计,用户仅需提供预估的表数据量: - -```sql -CREATE TABLE IF NOT EXISTS example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -PARTITION BY RANGE(`date`) -( - PARTITION `p201701` VALUES LESS THAN ("2017-02-01"), - PARTITION `p201702` VALUES LESS THAN ("2017-03-01"), - PARTITION `p201703` VALUES LESS THAN ("2017-04-01"), - PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01")) -) -DISTRIBUTED BY HASH(`user_id`) BUCKETS AUTO -PROPERTIES -( - "replication_num" = "1", - "estimate_partition_size" = "2G" --- 用户估计一个分区将有的数据量,不提供则默认为 10G -); -``` - -需要注意的是,该方式不适用于表数据量特别大的场景。 +当不确定合理的分桶数时,可设置 `BUCKETS AUTO`,由 Doris 根据预估数据量(`estimate_partition_size`)自动确定分桶数。该方式不适用于数据量极大的表。详见[数据分桶](./data-bucketing)。 ### 5.2 Colocate(同分布) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/dynamic-partitioning.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/dynamic-partitioning.md index 093088b0ed49f..101da7c7a3bba 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/dynamic-partitioning.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/dynamic-partitioning.md @@ -1,13 +1,14 @@ --- { "title": "动态分区", + "sidebar_label": "动态分区(旧版)", "language": "zh-CN", "description": "动态分区按规则滚动创建和删除分区,实现表分区生命周期管理(TTL)。适用于日志、时序数据等需要自动清理过期数据的场景。" } --- -:::info 提示 -更推荐使用[自动分区](./auto-partitioning)实现分区自动管理,它是动态分区的上位替代。 +:::info 旧版 +动态分区已被[自动分区](./auto-partitioning)取代,后者是其在分区自动管理上的上位替代。新表请使用自动分区;本文用于维护已有的动态分区表。 ::: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/overview.md new file mode 100644 index 0000000000000..ec54326a1eaff --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/overview.md @@ -0,0 +1,78 @@ +--- +{ + "title": "分区与分桶", + "language": "zh-CN", + "description": "为 Doris 表选择推荐的分区与分桶方式,以及何时自定义:自动、动态、手动分区,分桶方式与分桶数。" +} +--- + +Doris 将一张表分为两层组织:分区按列值拆分数据行,分桶将每个分区切分为多个分片以实现并行。本文给出推荐的起步配置,并说明何时需要自定义。 + +## 推荐起步配置 + +大多数表建议按时间分区,并让 Doris 自动创建分区、自动确定分桶数: + +```sql +CREATE TABLE sales ( + sale_time DATETIME NOT NULL, + order_id BIGINT NOT NULL, + amount DECIMAL(10, 2) +) +DUPLICATE KEY(sale_time, order_id) +AUTO PARTITION BY RANGE (date_trunc(sale_time, 'day')) () +DISTRIBUTED BY HASH(order_id) BUCKETS AUTO; +``` + +- **自动分区(Auto Partition)**:数据写入时按需创建分区,无需预先定义或回填分区范围。 +- **`BUCKETS AUTO`**:由 Doris 根据数据量自动确定分片数量。 +- 基于 `sale_time` 的分区裁剪与跨分桶的并行扫描可保证查询性能。 + +如果表没有时间列,或数据量较小(约 1 GB 以内),使用单分区加固定分桶数即可: + +```sql +DISTRIBUTED BY HASH(order_id) BUCKETS 10 +``` + +## 选择你的设计 + +仅在默认方式不适用时才自定义: + +| 决策项 | 推荐默认 | 何时调整 | +| --- | --- | --- | +| 如何分区 | [自动分区](./auto-partitioning) | 对于自动分区无法表达的方案,使用[手动分区](./manual-partitioning):自定义或不规则范围、数值列范围,或将多个值归入同一分区的 LIST。[动态分区](./dynamic-partitioning)已被自动分区取代。 | +| 分桶方式 | 按高基数列做 Hash 分桶 | 数据倾斜或需按任意维度过滤时,用 Random 分桶([数据分桶](./data-bucketing)) | +| 分桶数量 | `BUCKETS AUTO` | 已知数据量并希望固定控制时,手动设置分桶数([数据分桶](./data-bucketing)) | + +## 让旧分区过期 + +如需自动删除旧数据,可设置保留策略。两种模式都保留最近的分区、删除更早的分区,区别在于保留上限的表达方式: + +| 分区模式 | 属性 | 保留上限 | +| --- | --- | --- | +| [动态分区](./dynamic-partitioning) | `dynamic_partition.start`(例如 `-7`) | 时间窗口:保留相对当前时间最近 N 个时间单位内的分区 | +| [自动分区](./auto-partitioning)(RANGE) | `partition.retention_count`(例如 `3`) | 分区数量:保留最新的 N 个历史分区 | + +对于规则的时间分区(如每天一个),两者基本等价:“最近 7 天”等于“最新的 7 个按天分区”。当分区不规则或数据陈旧时二者会出现差异:一旦数据比时间窗口更旧,按时间窗口可能删除全部分区,而按数量始终保留最新的 N 个。 + +不再推荐将自动分区与动态分区组合用于数据保留;自动 RANGE 分区表请使用 `partition.retention_count`。 + +数据保留是**删除**数据。如果希望将冷数据迁移到更廉价的存储而非删除,请改用[分层存储](../tiered-storage/overview)。 + +## 工作原理 + +Doris 将数据按两层映射: + +```text +表 ──► 分区(按列值)──► 分桶(Hash 或 Random)──► Tablet(BE 节点上的分片) +``` + +分区用于数据裁剪与生命周期管理(如按时间归档或删除),分桶将每个分区分散到多个 Tablet 以实现读写并行。完整的数据分布模型(包括 Tablet、副本及其与节点的映射),见[分区与分桶原理](./basic-concepts)。 + +## 后续步骤 + +- [自动分区](./auto-partitioning):默认方式,无需手动维护分区范围。 +- [动态分区](./dynamic-partitioning):按时间滚动并保留窗口。 +- [手动分区](./manual-partitioning):显式声明 Range 与 List 分区。 +- [数据分桶](./data-bucketing):选择分桶方式、分桶键与分桶数。 +- [分区与分桶原理](./basic-concepts):底层数据分布模型。 +- [常见问题](./common-issues):分区与分桶设计的排查方法。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx index a280f7cfc1efc..7d51e12be4f21 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx @@ -1,18 +1,17 @@ --- { - "title": "基本概念", + "title": "分区与分桶原理", + "sidebar_label": "工作原理", "language": "zh-CN", "description": "由浅入深介绍 Doris 的分区与分桶机制:从核心概念、第一个建表示例到自动/动态分区、自动分桶、Colocate 等进阶能力,并给出分区分桶的设计建议与运维方法。" } --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; {/* 知识类型: 概念介绍 / 操作步骤 */} {/* 适用场景: 建表设计 / 数据组织与管理 */} -本文介绍 Doris 的分区(Partition)与分桶(Bucket)机制,帮助用户合理设计表结构以提升查询性能与数据管理效率。建议新手按章节顺序阅读:第 1–3 节涵盖核心概念与第一个建表示例,第 4–6 节为进阶特性与设计建议,第 7 节为日常运维所需的查看与修改方法。 +本文介绍 Doris 如何将数据分布到分区、分桶与 Tablet,并涵盖进阶的分区与分桶模式、设计建议与运维命令。若需要推荐的起步配置与选型指南,请先阅读 [分区与分桶](./overview);当你希望理解底层模型时再阅读本文。 ## 1. 概述 @@ -148,152 +147,13 @@ PROPERTIES | 动态分区 | 系统按时间调度规则自动创建/回收 | 时间序列数据,希望自动滚动维护近 N 天/周/月分区 | | 自动分区 | 数据写入时按需创建 | 分区取值不可预知(如多租户、稀疏时间),希望避免预创建 | -下面给出常见组合的建表示例: - - - - -[自动分区](./auto-partitioning) 支持在数据导入时根据用户定义的规则自动创建对应分区,使用更为便捷。将基础示例改写为自动 Range 分区: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- 使用月作为分区粒度 -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1" -); -``` - -如上建表,当数据导入时,Doris 将自动按月级别为 `date` 列创建对应分区。例如 `2018-12-01` 与 `2018-12-31` 会落入同一个分区,而 `2018-11-12` 会落入另一个分区。自动分区还支持 List 分区,更多用法请查看自动分区文档。 - - - - - -[动态分区](./dynamic-partitioning) 是根据现实时间进行自动分区创建与回收的管理方式。将基础示例改写为动态分区: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -PARTITION BY RANGE(`date`) -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "WEEK", --- 分区粒度为周 - "dynamic_partition.start" = "-2", --- 向前保留两周 - "dynamic_partition.end" = "2", --- 提前创建后两周 - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -动态分区支持分层存储、自定义副本数等功能,详见动态分区文档。 - - - - - -自动分区与动态分区各有优势,二者结合可实现分区的灵活按需创建与自动回收: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- 使用月作为分区粒度 -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "month", --- 二者粒度必须相同 - "dynamic_partition.start" = "-2", --- 动态分区自动清理超过两周的历史分区 - "dynamic_partition.end" = "0", --- 动态分区不创建未来分区,完全交给自动分区 - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -关于该功能的细节说明,详见 [自动分区与动态分区联用](./auto-partitioning#与动态分区联用)。 - - - - +各模式开箱即用的建表示例,以及自动分区与动态分区的组合用法,见[自动分区](./auto-partitioning)、[动态分区](./dynamic-partitioning)与[手动分区](./manual-partitioning)。 ## 5. 进阶:分桶进阶 ### 5.1 自动分桶 -当用户不确定合理的分桶数时,可以使用自动分桶让 Doris 完成估计,用户仅需提供预估的表数据量: - -```sql -CREATE TABLE IF NOT EXISTS example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "用户 id", - `date` DATE NOT NULL COMMENT "数据灌入日期时间", - `timestamp` DATETIME NOT NULL COMMENT "数据灌入的时间戳", - `city` VARCHAR(20) COMMENT "用户所在城市", - `age` SMALLINT COMMENT "用户年龄", - `sex` TINYINT COMMENT "用户性别", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", - `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" -) -PARTITION BY RANGE(`date`) -( - PARTITION `p201701` VALUES LESS THAN ("2017-02-01"), - PARTITION `p201702` VALUES LESS THAN ("2017-03-01"), - PARTITION `p201703` VALUES LESS THAN ("2017-04-01"), - PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01")) -) -DISTRIBUTED BY HASH(`user_id`) BUCKETS AUTO -PROPERTIES -( - "replication_num" = "1", - "estimate_partition_size" = "2G" --- 用户估计一个分区将有的数据量,不提供则默认为 10G -); -``` - -需要注意的是,该方式不适用于表数据量特别大的场景。 +当不确定合理的分桶数时,可设置 `BUCKETS AUTO`,由 Doris 根据预估数据量(`estimate_partition_size`)自动确定分桶数。该方式不适用于数据量极大的表。详见[数据分桶](./data-bucketing)。 ### 5.2 Colocate(同分布) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md index 093088b0ed49f..101da7c7a3bba 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md @@ -1,13 +1,14 @@ --- { "title": "动态分区", + "sidebar_label": "动态分区(旧版)", "language": "zh-CN", "description": "动态分区按规则滚动创建和删除分区,实现表分区生命周期管理(TTL)。适用于日志、时序数据等需要自动清理过期数据的场景。" } --- -:::info 提示 -更推荐使用[自动分区](./auto-partitioning)实现分区自动管理,它是动态分区的上位替代。 +:::info 旧版 +动态分区已被[自动分区](./auto-partitioning)取代,后者是其在分区自动管理上的上位替代。新表请使用自动分区;本文用于维护已有的动态分区表。 ::: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/overview.md new file mode 100644 index 0000000000000..ec54326a1eaff --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/overview.md @@ -0,0 +1,78 @@ +--- +{ + "title": "分区与分桶", + "language": "zh-CN", + "description": "为 Doris 表选择推荐的分区与分桶方式,以及何时自定义:自动、动态、手动分区,分桶方式与分桶数。" +} +--- + +Doris 将一张表分为两层组织:分区按列值拆分数据行,分桶将每个分区切分为多个分片以实现并行。本文给出推荐的起步配置,并说明何时需要自定义。 + +## 推荐起步配置 + +大多数表建议按时间分区,并让 Doris 自动创建分区、自动确定分桶数: + +```sql +CREATE TABLE sales ( + sale_time DATETIME NOT NULL, + order_id BIGINT NOT NULL, + amount DECIMAL(10, 2) +) +DUPLICATE KEY(sale_time, order_id) +AUTO PARTITION BY RANGE (date_trunc(sale_time, 'day')) () +DISTRIBUTED BY HASH(order_id) BUCKETS AUTO; +``` + +- **自动分区(Auto Partition)**:数据写入时按需创建分区,无需预先定义或回填分区范围。 +- **`BUCKETS AUTO`**:由 Doris 根据数据量自动确定分片数量。 +- 基于 `sale_time` 的分区裁剪与跨分桶的并行扫描可保证查询性能。 + +如果表没有时间列,或数据量较小(约 1 GB 以内),使用单分区加固定分桶数即可: + +```sql +DISTRIBUTED BY HASH(order_id) BUCKETS 10 +``` + +## 选择你的设计 + +仅在默认方式不适用时才自定义: + +| 决策项 | 推荐默认 | 何时调整 | +| --- | --- | --- | +| 如何分区 | [自动分区](./auto-partitioning) | 对于自动分区无法表达的方案,使用[手动分区](./manual-partitioning):自定义或不规则范围、数值列范围,或将多个值归入同一分区的 LIST。[动态分区](./dynamic-partitioning)已被自动分区取代。 | +| 分桶方式 | 按高基数列做 Hash 分桶 | 数据倾斜或需按任意维度过滤时,用 Random 分桶([数据分桶](./data-bucketing)) | +| 分桶数量 | `BUCKETS AUTO` | 已知数据量并希望固定控制时,手动设置分桶数([数据分桶](./data-bucketing)) | + +## 让旧分区过期 + +如需自动删除旧数据,可设置保留策略。两种模式都保留最近的分区、删除更早的分区,区别在于保留上限的表达方式: + +| 分区模式 | 属性 | 保留上限 | +| --- | --- | --- | +| [动态分区](./dynamic-partitioning) | `dynamic_partition.start`(例如 `-7`) | 时间窗口:保留相对当前时间最近 N 个时间单位内的分区 | +| [自动分区](./auto-partitioning)(RANGE) | `partition.retention_count`(例如 `3`) | 分区数量:保留最新的 N 个历史分区 | + +对于规则的时间分区(如每天一个),两者基本等价:“最近 7 天”等于“最新的 7 个按天分区”。当分区不规则或数据陈旧时二者会出现差异:一旦数据比时间窗口更旧,按时间窗口可能删除全部分区,而按数量始终保留最新的 N 个。 + +不再推荐将自动分区与动态分区组合用于数据保留;自动 RANGE 分区表请使用 `partition.retention_count`。 + +数据保留是**删除**数据。如果希望将冷数据迁移到更廉价的存储而非删除,请改用[分层存储](../tiered-storage/overview)。 + +## 工作原理 + +Doris 将数据按两层映射: + +```text +表 ──► 分区(按列值)──► 分桶(Hash 或 Random)──► Tablet(BE 节点上的分片) +``` + +分区用于数据裁剪与生命周期管理(如按时间归档或删除),分桶将每个分区分散到多个 Tablet 以实现读写并行。完整的数据分布模型(包括 Tablet、副本及其与节点的映射),见[分区与分桶原理](./basic-concepts)。 + +## 后续步骤 + +- [自动分区](./auto-partitioning):默认方式,无需手动维护分区范围。 +- [动态分区](./dynamic-partitioning):按时间滚动并保留窗口。 +- [手动分区](./manual-partitioning):显式声明 Range 与 List 分区。 +- [数据分桶](./data-bucketing):选择分桶方式、分桶键与分桶数。 +- [分区与分桶原理](./basic-concepts):底层数据分布模型。 +- [常见问题](./common-issues):分区与分桶设计的排查方法。 diff --git a/sidebars.ts b/sidebars.ts index 34b9d609ff1b7..864acc50b437c 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -208,12 +208,13 @@ const sidebars: SidebarsConfig = { { type: 'category', label: 'Partitioning & Bucketing', - link: {type: 'doc', id: 'table-design/data-partitioning/basic-concepts'}, + link: {type: 'doc', id: 'table-design/data-partitioning/overview'}, items: [ + 'table-design/data-partitioning/auto-partitioning', 'table-design/data-partitioning/manual-partitioning', 'table-design/data-partitioning/dynamic-partitioning', - 'table-design/data-partitioning/auto-partitioning', 'table-design/data-partitioning/data-bucketing', + 'table-design/data-partitioning/basic-concepts', 'table-design/data-partitioning/common-issues', ], }, diff --git a/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx b/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx index 1c11a1f64ee46..5923953a0ec6c 100644 --- a/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx +++ b/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx @@ -1,18 +1,17 @@ --- { - "title": "Basic Concepts", + "title": "How Partitioning and Bucketing Work", + "sidebar_label": "How It Works", "language": "en", - "description": "A progressive introduction to Doris partitioning and bucketing: from core concepts and the first CREATE TABLE example to auto/dynamic partitioning, auto-bucketing, Colocate, and other advanced capabilities, along with design recommendations and operational guidance for partitions and buckets." + "description": "The data-distribution model behind Doris partitioning and bucketing: partitions, buckets, tablets, and nodes, plus advanced partition and bucket modes, design recommendations, and operational guidance." } --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; {/* Knowledge type: Concept introduction / Procedure */} {/* Applicable scenarios: Table design / Data organization and management */} -This document introduces the partitioning (Partition) and bucketing (Bucket) mechanisms of Doris, helping you design table structures reasonably to improve query performance and data management efficiency. New users are recommended to read the sections in order: Sections 1-3 cover core concepts and the first CREATE TABLE example, Sections 4-6 cover advanced features and design recommendations, and Section 7 covers the methods for viewing and modifying partitions needed for daily operations. +This page explains how Doris distributes data across partitions, buckets, and tablets. It also covers the advanced partition and bucket modes, design recommendations, and operational commands. For a recommended starting configuration and a decision guide, start with [Partitioning and Bucketing](./overview), then read this page when you want to understand the underlying model. ## 1. Overview @@ -148,152 +147,13 @@ Besides manually declaring partitions at table creation time, Doris also support | Dynamic partition | Automatically created/recycled by the system based on time scheduling rules | Time-series data, where you want to automatically maintain rolling partitions for the past N days/weeks/months | | Auto partition | Created on demand when data is written | Partition values are unpredictable (such as multi-tenant or sparse time), where pre-creation should be avoided | -The following shows CREATE TABLE examples for common combinations: - - - - -[Auto Partition](./auto-partitioning) supports automatically creating corresponding partitions according to user-defined rules during data ingestion, making it more convenient to use. The basic example rewritten as Auto Range partition: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- Use month as the partition granularity -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1" -); -``` - -With this CREATE TABLE statement, when data is loaded, Doris automatically creates corresponding partitions for the `date` column at the month level. For example, `2018-12-01` and `2018-12-31` fall into the same partition, while `2018-11-12` falls into another partition. Auto Partition also supports List partitioning. For more usage, see the Auto Partition documentation. - - - - - -[Dynamic Partition](./dynamic-partitioning) is a management approach that automatically creates and recycles partitions based on real time. The basic example rewritten as Dynamic Partition: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -PARTITION BY RANGE(`date`) -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "WEEK", --- Partition granularity is week - "dynamic_partition.start" = "-2", --- Retain the past two weeks - "dynamic_partition.end" = "2", --- Pre-create the next two weeks - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -Dynamic Partition supports tiered storage, custom replica counts, and more. See the Dynamic Partition documentation for details. - - - - - -Auto Partition and Dynamic Partition each have their own advantages. Combining the two enables flexible on-demand creation and automatic recycling of partitions: - -```sql -CREATE TABLE example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -AUTO PARTITION BY RANGE(date_trunc(`date`, 'month')) --- Use month as the partition granularity -() -DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 -PROPERTIES -( - "replication_num" = "1", - "dynamic_partition.enable" = "true", - "dynamic_partition.time_unit" = "month", --- The two granularities must be the same - "dynamic_partition.start" = "-2", --- Dynamic Partition automatically cleans up historical partitions older than two weeks - "dynamic_partition.end" = "0", --- Dynamic Partition does not create future partitions; this is fully delegated to Auto Partition - "dynamic_partition.prefix" = "p", - "dynamic_partition.buckets" = "8" -); -``` - -For details about this feature, see [Using Auto Partition with Dynamic Partition](./auto-partitioning#lifecycle-management). - - - - +For ready-to-use CREATE TABLE examples of each mode, including combining auto with dynamic partitioning, see [Auto Partitioning](./auto-partitioning), [Dynamic Partitioning](./dynamic-partitioning), and [Manual Partitioning](./manual-partitioning). ## 5. Advanced: Bucketing ### 5.1 Auto Bucketing -When you are not sure about a reasonable number of buckets, you can use Auto Bucketing to let Doris perform the estimation. You only need to provide the estimated table data size: - -```sql -CREATE TABLE IF NOT EXISTS example_range_tbl -( - `user_id` LARGEINT NOT NULL COMMENT "User ID", - `date` DATE NOT NULL COMMENT "Data ingestion date", - `timestamp` DATETIME NOT NULL COMMENT "Data ingestion timestamp", - `city` VARCHAR(20) COMMENT "User's city", - `age` SMALLINT COMMENT "User age", - `sex` TINYINT COMMENT "User gender", - `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time", - `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user spending", - `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum user dwell time", - `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum user dwell time" -) -PARTITION BY RANGE(`date`) -( - PARTITION `p201701` VALUES LESS THAN ("2017-02-01"), - PARTITION `p201702` VALUES LESS THAN ("2017-03-01"), - PARTITION `p201703` VALUES LESS THAN ("2017-04-01"), - PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01")) -) -DISTRIBUTED BY HASH(`user_id`) BUCKETS AUTO -PROPERTIES -( - "replication_num" = "1", - "estimate_partition_size" = "2G" --- Estimated data volume for one partition; defaults to 10G if not provided -); -``` - -Note that this approach is not suitable for scenarios with extremely large table data volumes. +When you are unsure how many buckets to use, set `BUCKETS AUTO` and let Doris size them from an estimated data volume (`estimate_partition_size`). This is not suitable for extremely large tables. For details, see [Data Bucketing](./data-bucketing). ### 5.2 Colocate diff --git a/versioned_docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md b/versioned_docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md index daff66ad6d763..440dae1687b85 100644 --- a/versioned_docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md +++ b/versioned_docs/version-4.x/table-design/data-partitioning/dynamic-partitioning.md @@ -1,13 +1,14 @@ --- { "title": "Dynamic Partitioning", + "sidebar_label": "Dynamic Partitioning (Legacy)", "language": "en", "description": "Dynamic partitioning rolls partitions forward by creating and dropping them on a schedule, providing partition lifecycle management (TTL) for tables. It applies to scenarios such as logs and time-series data that need automatic cleanup of expired data." } --- -:::info Tip -[Auto Partitioning](./auto-partitioning) is the recommended approach for automatic partition management. It is the successor to dynamic partitioning. +:::info Legacy +Dynamic partitioning is superseded by [auto partitioning](./auto-partitioning), its successor for automatic partition management. Use auto partitioning for new tables; this page is kept for existing dynamic-partition tables. ::: diff --git a/versioned_docs/version-4.x/table-design/data-partitioning/overview.md b/versioned_docs/version-4.x/table-design/data-partitioning/overview.md new file mode 100644 index 0000000000000..5e8d6b8c854bf --- /dev/null +++ b/versioned_docs/version-4.x/table-design/data-partitioning/overview.md @@ -0,0 +1,78 @@ +--- +{ + "title": "Partitioning and Bucketing", + "language": "en", + "description": "The recommended partitioning and bucketing for a Doris table, and when to customize: auto, dynamic, and manual partitioning, bucketing method, and bucket count." +} +--- + +Doris organizes a table in two tiers: partitions split rows by column value, and buckets split each partition into shards for parallel processing. This page gives the recommended starting point and shows when to customize. + +## Recommended Starting Point + +For most tables, partition by time and let Doris manage partition creation and bucket sizing automatically: + +```sql +CREATE TABLE sales ( + sale_time DATETIME NOT NULL, + order_id BIGINT NOT NULL, + amount DECIMAL(10, 2) +) +DUPLICATE KEY(sale_time, order_id) +AUTO PARTITION BY RANGE (date_trunc(sale_time, 'day')) () +DISTRIBUTED BY HASH(order_id) BUCKETS AUTO; +``` + +- **Auto partitioning** creates a partition as data arrives, so you never pre-define or backfill partition ranges. +- **`BUCKETS AUTO`** lets Doris size the number of shards from the data. +- Partition pruning on `sale_time` and parallel scans across buckets keep queries fast. + +If the table has no time column or stays small (under about 1 GB), use a single partition with a fixed bucket count: + +```sql +DISTRIBUTED BY HASH(order_id) BUCKETS 10 +``` + +## Choose Your Design + +Customize only when the default does not fit: + +| Decision | Recommended default | Change it when | +| --- | --- | --- | +| How to partition | [Auto partitioning](./auto-partitioning) | Use [manual partitioning](./manual-partitioning) for schemes auto cannot express: custom or irregular ranges, ranges on a numeric column, or grouped LIST values. [Dynamic partitioning](./dynamic-partitioning) is superseded by auto. | +| Bucketing method | Hash on a high-cardinality column | If data skews, or you filter on arbitrary dimensions, use random bucketing ([Data Bucketing](./data-bucketing)) | +| Number of buckets | `BUCKETS AUTO` | If you know your data size and want fixed control, set a count ([Data Bucketing](./data-bucketing)) | + +## Expire Old Partitions + +To drop old data automatically, set a retention policy. Both modes keep the most recent partitions and drop older ones; they differ in how you express the limit: + +| Partition mode | Property | Retention limit | +| --- | --- | --- | +| [Dynamic partitioning](./dynamic-partitioning) | `dynamic_partition.start` (for example, `-7`) | A time window: keep partitions within the last N time units of now | +| [Auto partitioning](./auto-partitioning) (RANGE) | `partition.retention_count` (for example, `3`) | A partition count: keep the newest N historical partitions | + +With regular time partitions (such as one per day), the two are effectively equivalent: "last 7 days" matches "newest 7 daily partitions." They diverge when partitions are irregular or data is stale: a time window can drop every partition once the data is older than the window, whereas a count always keeps the newest N. + +Combining auto and dynamic partitioning for retention is no longer recommended; use `partition.retention_count` for auto-range tables. + +Retention **drops** data. To move cold data to cheaper storage instead of dropping it, use [tiered storage](../tiered-storage/overview) instead. + +## How It Works + +Doris maps data in two tiers: + +```text +Table ──► Partition (by column value) ──► Bucket (hash or random) ──► Tablet (shard on a BE node) +``` + +Partitions let Doris skip data that can't match a query, and make it easy to archive or drop data by time. Buckets spread each partition across tablets for parallel reads and writes. For the full data-distribution model, including tablets, replicas, and how they map to nodes, see [How Partitioning and Bucketing Work](./basic-concepts). + +## Next Steps + +- [Auto Partitioning](./auto-partitioning): the default, with no manual range maintenance. +- [Dynamic Partitioning](./dynamic-partitioning): rolling time windows with retention. +- [Manual Partitioning](./manual-partitioning): explicit ranges and list partitions. +- [Data Bucketing](./data-bucketing): choose the method, key, and bucket count. +- [How Partitioning and Bucketing Work](./basic-concepts): the underlying data-distribution model. +- [Common Issues](./common-issues): troubleshooting partition and bucket design. diff --git a/versioned_sidebars/version-4.x-sidebars.json b/versioned_sidebars/version-4.x-sidebars.json index 1963c0873aace..fe81aa4dbb87c 100644 --- a/versioned_sidebars/version-4.x-sidebars.json +++ b/versioned_sidebars/version-4.x-sidebars.json @@ -246,13 +246,14 @@ "label": "Partitioning & Bucketing", "link": { "type": "doc", - "id": "table-design/data-partitioning/basic-concepts" + "id": "table-design/data-partitioning/overview" }, "items": [ + "table-design/data-partitioning/auto-partitioning", "table-design/data-partitioning/manual-partitioning", "table-design/data-partitioning/dynamic-partitioning", - "table-design/data-partitioning/auto-partitioning", "table-design/data-partitioning/data-bucketing", + "table-design/data-partitioning/basic-concepts", "table-design/data-partitioning/common-issues" ] },