diff --git a/docs/table-design/data-partitioning/basic-concepts.mdx b/docs/table-design/data-partitioning/basic-concepts.mdx index 1c11a1f64ee46..d6a0a1941e787 100644 --- a/docs/table-design/data-partitioning/basic-concepts.mdx +++ b/docs/table-design/data-partitioning/basic-concepts.mdx @@ -16,7 +16,7 @@ This document introduces the partitioning (Partition) and bucketing (Bucket) mec ## 1. Overview -Doris uses a two-tier data partitioning approach of **Partition + Bucket** to organize the data of a table in an orderly manner across the nodes of the cluster: +Doris distributes a table's data across the cluster in two tiers, **partitions** and **buckets**: - **Partition**: horizontally divides the table into smaller subsets by column values (such as time or region), making it easier to perform query pruning and data lifecycle management. - **Bucket**: further evenly distributes data within each partition into multiple data shards (Tablets), fully utilizing the parallelism of the cluster. @@ -29,7 +29,7 @@ The data flow can be summarized as: Table ──► Partition ──► Bucket ──► Tablet (data shard, stored on BE nodes) ``` -A reasonable partitioning and bucketing design brings the following benefits at the same time: **faster queries** (partition pruning, parallel scanning), **more flexible management** (archiving/cleanup by time), and **more even writes** (avoiding hotspots). +Good partitioning and bucketing give you three things at once: **faster queries** (partition pruning and parallel scans), **easier management** (archive or clean up by time), and **more even writes** (no hotspots). ## 2. Core Concepts @@ -62,27 +62,27 @@ Doris supports two **partition types**: If no partition is specified at table creation time, Doris generates a default partition that is transparent to the user, containing all the data in the table. -A reasonable partition design brings the following benefits: +Good partition design provides: -- **Improved query performance**: through partition pruning, the system can filter out irrelevant partitions based on query conditions, reducing the amount of data scanned and significantly lowering the I/O burden, which is especially suitable for large-scale datasets. -- **Management flexibility**: data can be split along logical dimensions such as time or region, making archiving, cleanup, and backup easier. For example, partitioning by time enables efficient management of historical and incremental data, supporting time-based data maintenance strategies. +- **Faster queries**: partition pruning skips partitions that can't match the query, so Doris scans less data and does less I/O. This matters most on large datasets. +- **Easier management**: splitting by time or region makes archiving, cleanup, and backup simpler. For example, partitioning by time lets you manage historical and incoming data separately. ### 2.3 Bucket Bucketing further divides the data within a partition into smaller, mutually disjoint data units according to certain rules. Each row of data belongs to exactly one specific bucket. -Unlike partitions that divide by ranges of column values, the goal of bucketing is to **evenly distribute** data across predefined buckets, thereby reducing data skew and improving query execution performance through better data locality. +Partitions divide data by ranges or lists of column values. Bucketing instead spreads data **evenly** across the buckets in a partition, which reduces skew and improves query performance through better data locality. Doris supports two **bucketing methods**: - **Hash bucketing**: computes the `crc32` hash of the bucketing column values and takes the modulo with the number of buckets to evenly distribute the data. - **Random bucketing**: randomly assigns data to buckets. When using Random bucketing, you can combine the `load_to_single_tablet` parameter to optimize fast writes for small-scale data. -A reasonable bucketing design brings the following benefits: +Good bucketing provides: -- **Even data distribution**: reduces the risk of data concentration or skew, and avoids overloading some nodes or storage devices. -- **Reduced hotspots**: prevents some nodes or partitions from being overloaded, improving system stability and processing capability. -- **Improved concurrent performance**: when multiple query requests need to access different data within the same partition, bucketing allows the system to process multiple requests in parallel effectively, thereby improving throughput. +- **Even data distribution**: less risk of skew, and no single node or disk gets overloaded. +- **Fewer hotspots**: no node or partition gets overloaded, which keeps the system stable. +- **Better concurrency**: Doris reads different buckets in the same partition in parallel, which improves throughput. ### 2.4 Tablet and Node Architecture diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx index a280f7cfc1efc..7666f3eb6b432 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/data-partitioning/basic-concepts.mdx @@ -64,14 +64,14 @@ Doris 支持两种**分区类型**: 合理分区可以带来以下收益: -- **查询性能提升**:通过分区裁剪,系统可以根据查询条件过滤掉不相关的分区,减少数据扫描量,显著降低 I/O 负担,特别适合大规模数据集; -- **管理灵活性**:可按时间、地域等逻辑维度对数据进行分割,便于归档、清理和备份。例如按时间分区可高效管理历史数据与新增数据,支持基于时间的数据维护策略。 +- **查询更快**:分区裁剪会跳过无法匹配查询的分区,从而减少扫描的数据量和 I/O;数据集越大,收益越明显。 +- **管理更简单**:按时间或地域切分,便于归档、清理和备份。例如按时间分区,可分别管理历史数据与新增数据。 ### 2.3 分桶(Bucket) 分桶是指将一个分区中的数据,按照某种规则进一步划分为更小的、互不相交的数据单元。每一行数据属于且仅属于一个特定的分桶。 -与按列值范围划分的分区不同,分桶的目标是将数据**均匀分布**到预定义的桶中,从而减少数据倾斜,并通过提高数据局部性来提升查询执行性能。 +分区按列值的范围或枚举来划分数据;分桶则在分区内将数据**均匀分布**到各个桶中,从而减少数据倾斜,并通过更好的数据局部性提升查询性能。 Doris 支持两种**分桶方式**: @@ -82,7 +82,7 @@ Doris 支持两种**分桶方式**: - **数据均匀分布**:减少数据集中或倾斜的风险,避免部分节点或存储设备资源过载; - **减少热点**:避免某些节点或分区过度负载,提升系统稳定性和处理能力; -- **提高并发性能**:当多个查询请求需要访问同一分区中的不同数据时,分桶可使系统有效地并行处理多个请求,从而提升吞吐量。 +- **提高并发性能**:Doris 可以并行读取同一分区中的不同分桶,从而提升吞吐量。 ### 2.4 数据分片(Tablet)与节点架构 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx index a280f7cfc1efc..7666f3eb6b432 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx @@ -64,14 +64,14 @@ Doris 支持两种**分区类型**: 合理分区可以带来以下收益: -- **查询性能提升**:通过分区裁剪,系统可以根据查询条件过滤掉不相关的分区,减少数据扫描量,显著降低 I/O 负担,特别适合大规模数据集; -- **管理灵活性**:可按时间、地域等逻辑维度对数据进行分割,便于归档、清理和备份。例如按时间分区可高效管理历史数据与新增数据,支持基于时间的数据维护策略。 +- **查询更快**:分区裁剪会跳过无法匹配查询的分区,从而减少扫描的数据量和 I/O;数据集越大,收益越明显。 +- **管理更简单**:按时间或地域切分,便于归档、清理和备份。例如按时间分区,可分别管理历史数据与新增数据。 ### 2.3 分桶(Bucket) 分桶是指将一个分区中的数据,按照某种规则进一步划分为更小的、互不相交的数据单元。每一行数据属于且仅属于一个特定的分桶。 -与按列值范围划分的分区不同,分桶的目标是将数据**均匀分布**到预定义的桶中,从而减少数据倾斜,并通过提高数据局部性来提升查询执行性能。 +分区按列值的范围或枚举来划分数据;分桶则在分区内将数据**均匀分布**到各个桶中,从而减少数据倾斜,并通过更好的数据局部性提升查询性能。 Doris 支持两种**分桶方式**: @@ -82,7 +82,7 @@ Doris 支持两种**分桶方式**: - **数据均匀分布**:减少数据集中或倾斜的风险,避免部分节点或存储设备资源过载; - **减少热点**:避免某些节点或分区过度负载,提升系统稳定性和处理能力; -- **提高并发性能**:当多个查询请求需要访问同一分区中的不同数据时,分桶可使系统有效地并行处理多个请求,从而提升吞吐量。 +- **提高并发性能**:Doris 可以并行读取同一分区中的不同分桶,从而提升吞吐量。 ### 2.4 数据分片(Tablet)与节点架构 diff --git a/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx b/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx index 1c11a1f64ee46..d6a0a1941e787 100644 --- a/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx +++ b/versioned_docs/version-4.x/table-design/data-partitioning/basic-concepts.mdx @@ -16,7 +16,7 @@ This document introduces the partitioning (Partition) and bucketing (Bucket) mec ## 1. Overview -Doris uses a two-tier data partitioning approach of **Partition + Bucket** to organize the data of a table in an orderly manner across the nodes of the cluster: +Doris distributes a table's data across the cluster in two tiers, **partitions** and **buckets**: - **Partition**: horizontally divides the table into smaller subsets by column values (such as time or region), making it easier to perform query pruning and data lifecycle management. - **Bucket**: further evenly distributes data within each partition into multiple data shards (Tablets), fully utilizing the parallelism of the cluster. @@ -29,7 +29,7 @@ The data flow can be summarized as: Table ──► Partition ──► Bucket ──► Tablet (data shard, stored on BE nodes) ``` -A reasonable partitioning and bucketing design brings the following benefits at the same time: **faster queries** (partition pruning, parallel scanning), **more flexible management** (archiving/cleanup by time), and **more even writes** (avoiding hotspots). +Good partitioning and bucketing give you three things at once: **faster queries** (partition pruning and parallel scans), **easier management** (archive or clean up by time), and **more even writes** (no hotspots). ## 2. Core Concepts @@ -62,27 +62,27 @@ Doris supports two **partition types**: If no partition is specified at table creation time, Doris generates a default partition that is transparent to the user, containing all the data in the table. -A reasonable partition design brings the following benefits: +Good partition design provides: -- **Improved query performance**: through partition pruning, the system can filter out irrelevant partitions based on query conditions, reducing the amount of data scanned and significantly lowering the I/O burden, which is especially suitable for large-scale datasets. -- **Management flexibility**: data can be split along logical dimensions such as time or region, making archiving, cleanup, and backup easier. For example, partitioning by time enables efficient management of historical and incremental data, supporting time-based data maintenance strategies. +- **Faster queries**: partition pruning skips partitions that can't match the query, so Doris scans less data and does less I/O. This matters most on large datasets. +- **Easier management**: splitting by time or region makes archiving, cleanup, and backup simpler. For example, partitioning by time lets you manage historical and incoming data separately. ### 2.3 Bucket Bucketing further divides the data within a partition into smaller, mutually disjoint data units according to certain rules. Each row of data belongs to exactly one specific bucket. -Unlike partitions that divide by ranges of column values, the goal of bucketing is to **evenly distribute** data across predefined buckets, thereby reducing data skew and improving query execution performance through better data locality. +Partitions divide data by ranges or lists of column values. Bucketing instead spreads data **evenly** across the buckets in a partition, which reduces skew and improves query performance through better data locality. Doris supports two **bucketing methods**: - **Hash bucketing**: computes the `crc32` hash of the bucketing column values and takes the modulo with the number of buckets to evenly distribute the data. - **Random bucketing**: randomly assigns data to buckets. When using Random bucketing, you can combine the `load_to_single_tablet` parameter to optimize fast writes for small-scale data. -A reasonable bucketing design brings the following benefits: +Good bucketing provides: -- **Even data distribution**: reduces the risk of data concentration or skew, and avoids overloading some nodes or storage devices. -- **Reduced hotspots**: prevents some nodes or partitions from being overloaded, improving system stability and processing capability. -- **Improved concurrent performance**: when multiple query requests need to access different data within the same partition, bucketing allows the system to process multiple requests in parallel effectively, thereby improving throughput. +- **Even data distribution**: less risk of skew, and no single node or disk gets overloaded. +- **Fewer hotspots**: no node or partition gets overloaded, which keeps the system stable. +- **Better concurrency**: Doris reads different buckets in the same partition in parallel, which improves throughput. ### 2.4 Tablet and Node Architecture