Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 94 additions & 45 deletions tidb-cloud/tidb-cloud-htap-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,39 @@ aliases: ['/tidbcloud/use-htap-cluster']

# TiDB Cloud HTAP Quick Start

[HTAP](https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing) means Hybrid Transactional and Analytical Processing. The HTAP architecture in TiDB Cloud is composed of [TiKV](https://tikv.org), a row-based storage engine designed for transactional processing, and [TiFlash](https://docs.pingcap.com/tidb/stable/tiflash-overview), a columnar storage designed for analytical processing. Your application data is first stored in TiKV and then replicated to TiFlash via the Raft consensus algorithm. So it is a real-time replication from the row-based storage to the columnar storage.
[HTAP](https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing) stands for Hybrid Transactional and Analytical Processing. The HTAP architecture in TiDB Cloud is composed of [TiKV](https://tikv.org), a row-based storage engine designed for transactional processing, and [TiFlash](https://docs.pingcap.com/tidb/stable/tiflash-overview), a columnar storage engine designed for analytical processing. Your application data is first stored in TiKV and then replicated to TiFlash via the Raft consensus algorithm. So it is a real-time replication from the row-based storage to the columnar storage.

This tutorial guides you through an easy way to experience the Hybrid Transactional and Analytical Processing (HTAP) feature of TiDB Cloud. The content includes how to replicate tables to TiFlash, how to run queries with TiFlash, and how to experience the performance boost.

## Before you begin

Before experiencing the HTAP feature, follow [TiDB Cloud Quick Start](/tidb-cloud/tidb-cloud-quickstart.md) to create a {{{ .starter }}} instance and import the **Steam Game Stats** sample dataset to the instance.
Before trying the HTAP feature, follow [TiDB Cloud Quick Start](/tidb-cloud/tidb-cloud-quickstart.md) to create a {{{ .starter }}} instance, and then import your data (this document uses the **Steam Games Dataset 2021-2025** as an example) into the instance as follows:

1. Download the [Steam Games Dataset 2021-2025](https://www.kaggle.com/datasets/jypenpen54534/steam-games-dataset-2021-2025-65k) from Kaggle.
2. Open the **Import** page for your {{{ .starter }}} instance.

1. In the [TiDB Cloud console](https://tidbcloud.com/), navigate to the [**My TiDB**](https://tidbcloud.com/tidbs) page, and then click the name of your target {{{ .starter }}} instance to go to its overview page.
2. On the overview page, click **Data** > **Import** in the left navigation pane.

3. Import the downloaded CSV file into your {{{ .starter }}} instance.

1. On the **Import** page, click **Upload a local file**, and then select and upload the downloaded CSV file.
2. In the **Destination** section, enter `steam` in the **Database** field and `games` in the **Table** field.
3. Click **Define Table**, change the data type of the column `categories` to `TEXT`, and then change the data type of the column `developer` to `TEXT`.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason: This step is to avoid the following import errors.

CastValueError: when encoding 12712-th data row in file a_steam_data_2021_2025.csv: Value conversion failed for column 'developer'. Expected type: varchar(512) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, received value: ?. Reason: [types:1406]Data Too Long, field len 512, data len 584. Please correct the source data and retry.

CastValueError: when encoding 55532-th data row in file a_steam_data_2021_2025.csv: Value conversion failed for column 'categories'. Expected type: varchar(512) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, received value: ?. Reason: [types:1406]Data Too Long, field len 512, data len 522. Please correct the source data and retry.

4. Click **Start Import**.

You can view the import progress on the **Import Task Detail** page.

## Steps

### Step 1. Replicate the sample data to the columnar storage engine

After a {{{ .starter }}} instance is created, TiKV does not replicate data to TiFlash by default. You need to execute DDL statements in a MySQL client of TiDB to specify the tables to be replicated. After that, TiDB will create the specified table replicas in TiFlash accordingly.
After you create a {{{ .starter }}} instance, TiDB does not replicate data from TiKV to TiFlash by default. To replicate a target table to TiFlash, use a MySQL client to connect to your {{{ .starter }}} instance and execute a DDL statement. TiDB then creates the specified table replicas in TiFlash.

For example, to replicate the `games` table (in the **Steam Game Stats** sample dataset) to TiFlash, execute the following statements:
For example, to replicate the `games` table (from the **Steam Games Dataset 2021-2025**) to TiFlash, execute the following statements:

```sql
USE game;
USE steam;
```

```sql
Expand All @@ -33,16 +48,16 @@ ALTER TABLE games SET TIFLASH REPLICA 2;
To check the replication progress, execute the following statement:

```sql
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ID, REPLICA_COUNT, LOCATION_LABELS, AVAILABLE, PROGRESS FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'game' and TABLE_NAME = 'games';
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ID, REPLICA_COUNT, LOCATION_LABELS, AVAILABLE, PROGRESS FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'steam' AND TABLE_NAME = 'games';
```

```sql
+--------------+------------+----------+---------------+-----------------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ID | REPLICA_COUNT | LOCATION_LABELS | AVAILABLE | PROGRESS |
+--------------+------------+----------+---------------+-----------------+-----------+----------+
| game | games | 88 | 2 | | 1 | 1 |
| steam | games | 227 | 2 | | 1 | 1 |
+--------------+------------+----------+---------------+-----------------+-----------+----------+
1 row in set (0.20 sec)
1 row in set (0.24 sec)
```

In the result of the preceding statement:
Expand All @@ -52,18 +67,20 @@ In the result of the preceding statement:

### Step 2. Query data using HTAP

When the process of replication is completed, you can start to run some queries.
After the replication completes, you can run queries.

For example, you can check the number of games released every year, as well as the average price and average playtime:
For example, you can check the number of games released every year, as well as the average price and average number of recommendations:

```sql
SELECT
YEAR(`release_date`) AS `release_year`,
`release_year`,
COUNT(*) AS `games_released`,
AVG(`price`) AS `average_price`,
AVG(`average_playtime_forever`) AS `average_playtime`
AVG(`recommendations`) AS `average_recommendations`
FROM
`games`
WHERE
`release_year` IS NOT NULL
GROUP BY
`release_year`
ORDER BY
Expand All @@ -78,16 +95,30 @@ In this step, you can compare the execution statistics between TiKV (row-based s

```sql
EXPLAIN ANALYZE SELECT /*+ READ_FROM_STORAGE(TIKV[games]) */
YEAR(`release_date`) AS `release_year`,
`release_year`,
`genres`,
`developer`,
COUNT(*) AS `games_released`,
SUM(`recommendations`) AS `total_recommendations`,
AVG(`recommendations`) AS `average_recommendations`,
AVG(`price`) AS `average_price`,
AVG(`average_playtime_forever`) AS `average_playtime`
MAX(`recommendations`) AS `max_recommendations`
FROM
`games`
WHERE
`release_year` IS NOT NULL
AND `genres` IS NOT NULL
AND `developer` IS NOT NULL
GROUP BY
`release_year`
`release_year`,
`genres`,
`developer`
HAVING
COUNT(*) >= 2
ORDER BY
`release_year` DESC;
`total_recommendations` DESC,
`games_released` DESC
LIMIT 20;
```

For tables with TiFlash replicas, the TiDB optimizer automatically determines whether to use either TiKV or TiFlash replicas based on the cost estimation. In the preceding `EXPLAIN ANALYZE` statement, the `/*+ READ_FROM_STORAGE(TIKV[games]) */` hint is used to force the optimizer to choose TiKV so you can check the execution statistics of TiKV.
Comment thread
qiancai marked this conversation as resolved.
Expand All @@ -99,55 +130,73 @@ In this step, you can compare the execution statistics between TiKV (row-based s
In the output, you can get the execution time from the `execution info` column.

```sql
id | estRows | actRows | task | access object | execution info | operator info | memory | disk
---------------------------+----------+---------+-----------+---------------+--------------------------------------------+-----------------------------------------------+---------+---------
Sort_5 | 4019.00 | 28 | root | | time:672.7ms, loops:2, RU:1159.679690 | Column#36:desc | 18.0 KB | 0 Bytes
└─Projection_7 | 4019.00 | 28 | root | | time:672.7ms, loops:6, Concurrency:5 | year(game.games.release_date)->Column#36, ... | 35.5 KB | N/A
└─HashAgg_15 | 4019.00 | 28 | root | | time:672.6ms, loops:6, partial_worker:... | group by:Column#38, funcs:count(Column#39)... | 56.7 KB | N/A
└─TableReader_16 | 4019.00 | 28 | root | | time:672.4ms, loops:2, cop_task: {num:... | data:HashAgg_9 | 3.60 KB | N/A
└─HashAgg_9 | 4019.00 | 28 | cop[tikv] | | tikv_task:{proc max:300ms, min:0s, avg... | group by:year(game.games.release_date), ... | N/A | N/A
└─TableFullScan_14 | 68223.00 | 68223 | cop[tikv] | table:games | tikv_task:{proc max:290ms, min:0s, avg... | keep order:false | N/A | N/A
(6 rows)
id | estRows | actRows | task | access object | execution info | operator info | memory | disk
----------------------------------+----------+---------+-----------+---------------+---------------------------------------------+--------------------------------------------+----------+---------
Projection_10 | 20.00 | 20 | root | | time:234.4ms, loops:2, RU:241.66, ... | steam.games.release_year, ... | 6.03 KB | N/A
└─TopN_13 | 20.00 | 20 | root | | time:234.4ms, loops:2 | Column#13:desc, Column#12:desc, ... | 12.6 KB | 0 Bytes
└─Selection_18 | 36774.40 | 2458 | root | | time:233.9ms, loops:5 | ge(Column#12, ?) | 187.0 KB | N/A
└─HashAgg_22 | 45968.00 | 59883 | root | | time:228.4ms, loops:62, partial_worker:... | group by:Column#39, Column#40, ... | 31.7 MB | 0 Bytes
└─Projection_38 | 65521.00 | 65521 | root | | time:142.9ms, loops:66, Concurrency:5 | cast(steam.games.recommendations, ... | 1.16 MB | N/A
└─TableReader_32 | 65521.00 | 65521 | root | | time:49.5ms, loops:66, cop_task:{num:9... | data:Selection_31 | 3.26 MB | N/A
└─Selection_31 | 65521.00 | 65521 | cop[tikv] | | tikv_task:{proc max:20ms, min:0s, ... | not(isnull(steam.games.developer)), ... | N/A | N/A
└─TableFullScan_30 | 65521.00 | 65521 | cop[tikv] | table:games | tikv_task:{proc max:10ms, min:0s, ... | keep order:false | N/A | N/A
(8 rows)
```

- To get the execution statistics of this query using TiFlash, execute the following statement:
- To get the execution statistics of this query using TiFlash, execute the same statement without the `/*+ READ_FROM_STORAGE(TIKV[games]) */` hint:

```sql
EXPLAIN ANALYZE SELECT
YEAR(`release_date`) AS `release_year`,
`release_year`,
`genres`,
`developer`,
COUNT(*) AS `games_released`,
SUM(`recommendations`) AS `total_recommendations`,
AVG(`recommendations`) AS `average_recommendations`,
AVG(`price`) AS `average_price`,
AVG(`average_playtime_forever`) AS `average_playtime`
MAX(`recommendations`) AS `max_recommendations`
FROM
`games`
WHERE
`release_year` IS NOT NULL
AND `genres` IS NOT NULL
AND `developer` IS NOT NULL
GROUP BY
`release_year`
`release_year`,
`genres`,
`developer`
HAVING
COUNT(*) >= 2
ORDER BY
`release_year` DESC;
`total_recommendations` DESC,
`games_released` DESC
LIMIT 20;
```

In the output, you can get the execution time from the `execution info` column.

```sql
id | estRows | actRows | task | access object | execution info | operator info | memory | disk
-------------------------------------+----------+---------+--------------+---------------+-------------------------------------------------------+--------------------------------------------+---------+---------
Sort_5 | 4019.00 | 28 | root | | time:222.2ms, loops:2, RU:25.609675 | Column#36:desc | 3.77 KB | 0 Bytes
└─TableReader_39 | 4019.00 | 28 | root | | time:222.1ms, loops:2, cop_task: {num: 2, max: 0s,... | MppVersion: 1, data:ExchangeSender_38 | 4.64 KB | N/A
└─ExchangeSender_38 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:214.8ms, loops:1, threads:1} | ExchangeType: PassThrough | N/A | N/A
└─Projection_8 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:214.8ms, loops:1, threads:1} | year(game.games.release_date)->Column#3... | N/A | N/A
└─Projection_34 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:214.8ms, loops:1, threads:1} | Column#33, div(Column#34, cast(case(eq(... | N/A | N/A
└─HashAgg_35 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:214.8ms, loops:1, threads:1} | group by:Column#63, funcs:sum(Column#64... | N/A | N/A
└─ExchangeReceiver_37 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:214.8ms, loops:1, threads:8} | | N/A | N/A
└─ExchangeSender_36 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:210.6ms, loops:1, threads:1} | ExchangeType: HashPartition, Compressio... | N/A | N/A
└─HashAgg_33 | 4019.00 | 28 | mpp[tiflash] | | tiflash_task:{time:210.6ms, loops:1, threads:1} | group by:Column#75, funcs:count(1)->Col... | N/A | N/A
└─Projection_40 | 68223.00 | 68223 | mpp[tiflash] | | tiflash_task:{time:210.6ms, loops:2, threads:8} | game.games.price, game.games.price, gam... | N/A | N/A
└─TableFullScan_23 | 68223.00 | 68223 | mpp[tiflash] | table:games | tiflash_task:{time:210.6ms, loops:2, threads:8}, ... | keep order:false | N/A | N/A
(11 rows)
id | estRows | actRows | task | access object | execution info | operator info | memory | disk
----------------------------------------+----------+---------+--------------+---------------+---------------------------------------------+--------------------------------------------+---------+---------
Projection_10 | 20.00 | 20 | root | | time:92.5ms, loops:2, RU:120.42, ... | steam.games.release_year, ... | 6.03 KB | N/A
└─TopN_14 | 20.00 | 20 | root | | time:92.4ms, loops:2 | Column#13:desc, Column#12:desc, ... | 4.32 KB | 0 Bytes
└─TableReader_68 | 20.00 | 20 | root | | time:92.4ms, loops:2, cop_task:{num:2... | MppVersion: 2, data:ExchangeSender_67 | 7.99 KB | N/A
└─ExchangeSender_67 | 20.00 | 20 | mpp[tiflash] | | tiflash_task:{time:91ms, loops:1, ... | ExchangeType: PassThrough | N/A | N/A
└─TopN_66 | 20.00 | 20 | mpp[tiflash] | | tiflash_task:{time:91ms, loops:1, ... | Column#13:desc, Column#12:desc, ... | N/A | N/A
└─Selection_65 | 36774.40 | 2458 | mpp[tiflash] | | tiflash_task:{time:91ms, loops:1, ... | ge(Column#12, ?) | N/A | N/A
└─Projection_58 | 45968.00 | 59883 | mpp[tiflash] | | tiflash_task:{time:91ms, loops:1, ... | Column#12, Column#13, div(Column#14, ... | N/A | N/A
└─HashAgg_56 | 45968.00 | 59883 | mpp[tiflash] | | tiflash_task:{time:71ms, loops:1, ... | group by:Column#79, Column#80, ... | N/A | N/A
└─Projection_71 | 65521.00 | 65521 | mpp[tiflash] | | tiflash_task:{time:31ms, loops:7, ... | cast(steam.games.recommendations, ... | N/A | N/A
└─ExchangeReceiver_41 | 65521.00 | 65521 | mpp[tiflash] | | tiflash_task:{time:31ms, loops:7, ... | | N/A | N/A
└─ExchangeSender_40 | 65521.00 | 65521 | mpp[tiflash] | | tiflash_task:{time:31.2ms, loops:7, ... | ExchangeType: HashPartition, ... | N/A | N/A
└─Selection_39 | 65521.00 | 65521 | mpp[tiflash] | | tiflash_task:{time:21.2ms, loops:7, ... | not(isnull(steam.games.developer)), ... | N/A | N/A
└─TableFullScan_38| 65521.00 | 65521 | mpp[tiflash] | table:games | tiflash_task:{time:21.2ms, loops:7, ... | pushed down filter:empty, keep order:false | N/A | N/A
(13 rows)
```

> **Note:**
>
> Because the size of sample data is small and the query in this document is very simple, if you have already forced the optimizer to choose TiKV for this query and run the same query again, TiKV will reuse its cache, so the query might be much faster. If the data is updated frequently, the cache will be missed.
> Because the sample dataset is small and the query in this document is relatively simple, if you force the optimizer to choose TiKV for this query and then run the same query again, TiKV reuses its cache, which makes the query much faster. If the data is updated frequently, the cache is invalidated.

## Learn more

Expand Down
Loading
Loading