diff --git a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md index 43752c3cdad9c..60d60b2f77ef2 100644 --- a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md +++ b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md @@ -57,6 +57,12 @@ FROM ${table_name} WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY), 'Doris'); ``` +In VARIANT queries, JSON Path can be expressed in the following forms; any other form is undefined: + +1. `v['properties']['title']` +2. `v['properties.title']` +3. `v.properties.title` + ## Primitive types VARIANT infers subcolumn types automatically. Supported types include: @@ -168,7 +174,7 @@ Schema only guides the persisted storage type. During query execution, the effec SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']); ``` -Wildcard matching and order: +### Wildcard matching and order ```sql CREATE TABLE test_var_schema ( @@ -194,6 +200,76 @@ v1 VARIANT< Matched subpaths are materialized as columns by default. If too many paths match and generate excessive columns, consider enabling `variant_enable_typed_paths_to_sparse` (see “Configuration”). +### Wildcard syntax + +The Schema Template pattern-matching algorithm supports **only a restricted subset of glob syntax**. + +#### Supported glob syntax + +In SQL strings, we write `\\` to express a literal `\` in glob patterns. + +All examples below are matching examples. + +| Syntax | Meaning | Example (pattern → JSON Path) | SQL literal | +|------|---------|------------------------------|-------------| +| `*` | Any-length string | `num_*` → `num_latency` | `'num_*'` | +| `?` | Any single character | `a?b` → `acb` | `'a?b'` | +| `[abc]` | Character class | `a[bc]d` → `abd` | `'a[bc]d'` | +| `[a-z]` | Character range | `int_[0-9]` → `int_3` | `'int_[0-9]'` | +| `[!abc]` | Negated character class | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` | +| `[^abc]` | Negated character class | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` | +| `\` | Escape the next character | `a\*b` → `a*b`
`a\?b` → `a?b`
`a\[b` → `a[b`
`\` → `\` | `'a\\*b'`
`'a\\?b'`
`'a\\[b'`
`'\\'` | + +#### Unsupported syntax + +The following are treated as ordinary characters or cause matching to fail; avoid them whenever possible: + +| Syntax | Semantics in some glob implementations | Current behavior | +|------|----------------------------------------|------------------| +| `{a,b}` | Brace expansion | **Not supported** (treated as literal `{` `}`) | +| `**` | Recursive directory match | **No special semantics** (equivalent to `*` `*`) | + +- Empty character patterns like `[]`, `[!]`, `[^]`, and `a[]b` are invalid and match no JSON Path. +- Unterminated character patterns like `int_[0-9` are invalid and match no JSON Path. + +#### Typical examples + +1. Normal match +- Pattern: `num_*` + - √ `num_a` + - √ `num_1` + - × `number_a` + +- Pattern: `a\*b` + - SQL: `'a\\*b'` + - √ `a*b` + - × `axxb` + +- Pattern: `\*` + - SQL: `'\\*'` + - √ `*` + - × `a*` + +- Pattern: `\` + - SQL: `'\\'` + - √ `\` + - × `\\` + +- Pattern: `int_[0-9]` + - √ `int_1` + - × `int_a` + +2. Full match (not “contains” semantics) +- Pattern: `a*b` + - √ `ab` + - √ `axxxb` + - × `xxaxxxbxx` + +3. `.` and `/` are not special; they are ordinary characters +- Pattern: `int_*` + - √ `int_nested.level1` + - √ `int_nested/level1` + ## Type conflicts and promotion rules When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss: @@ -391,6 +467,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris'; | `VARCHAR` | ✔ | ✔ | | `JSON` | ✔ | ✔ | +### Schema Template based auto CAST + +When a VARIANT column defines a Schema Template and `enable_variant_schema_auto_cast` is set to true, the analyzer automatically inserts CASTs to the declared types for subpaths that match the Schema Template, so you do not need to write CASTs manually. + +- Applies to SELECT, WHERE, ORDER BY, GROUP BY, HAVING, JOIN keys, and aggregate arguments. +- To disable this behavior, set `enable_variant_schema_auto_cast` to false. + +Example: +```sql +CREATE TABLE t ( + id BIGINT, + data VARIANT<'num_*': BIGINT, 'str_*': STRING> +); + +-- 1) FILTER + ORDER +SELECT id +FROM t +WHERE data['num_a'] > 10 +ORDER BY data['num_a']; + +-- 2) GROUP + AGGREGATE + ALIAS +SELECT data['str_name'] AS username, SUM(data['num_a']) AS total +FROM t +GROUP BY username +HAVING data['num_a'] > 100; + +-- 3) JOIN ON +SELECT * +FROM t1 JOIN t2 +ON t1.data['num_id'] = t2.data['num_id']; +``` + +**Note**: Auto CAST cannot determine whether a path is a leaf; it simply casts all paths that match the Schema Template. + +Therefore, in cases like the following, to ensure correct results, set `enable_variant_schema_auto_cast` to false and add CASTs manually. + +```sql +-- Schema Template: treat all int_* as INT +CREATE TABLE t ( + id INT, + data VARIANT<'int_*': INT> +); + +INSERT INTO t VALUES +(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}'); + +-- Auto CAST enabled +SET enable_variant_schema_auto_cast = true; + +-- int_nested matches int_*, is incorrectly CAST to INT, and the query returns NULL +SELECT + data['int_nested'] +FROM t; + +-- Auto CAST disabled +SET enable_variant_schema_auto_cast = false; + +-- The query returns the correct result +SELECT + data['int_nested'] +FROM t; +``` + ## Wide columns When ingested data contains many distinct JSON keys, VARIANT materialized subcolumns can grow rapidly; at scale this may cause metadata bloat, higher write/merge cost, and query slowdowns. To address “wide columns” (too many subcolumns), VARIANT provides two mechanisms: **Sparse columns** and **DOC encoding**. @@ -577,4 +716,3 @@ ClickBench (43 queries): 2. Why doesn’t my query/index work? - Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths. - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md index a187bb63ae528..ce013bbf97c7f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md @@ -57,6 +57,12 @@ FROM ${table_name} WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY), 'Doris'); ``` +VARIANT 查询中, JSON Path 的表示有如下几种类型,除此之外的表示均为未定义行为: + +1. `v['properties']['title']` +2. `v['properties.title']` +3. `v.properties.title` + ## 基本类型 VARIANT 自动推断的子列基础类型包括: @@ -168,7 +174,7 @@ Schema 仅指导“存储层”的持久化类型,计算逻辑仍以实际数 SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']); ``` -通配符与匹配顺序: +### 通配符与匹配顺序 ```sql CREATE TABLE test_var_schema ( @@ -194,6 +200,76 @@ v1 VARIANT< 匹配成功的子路径默认会展开为独立列。若匹配子列过多导致列数暴增,建议开启 `variant_enable_typed_paths_to_sparse`(见“配置”)。 +### 通配符语法 + +Schema Template 模式匹配算法**只支持受限 glob 语法子集**。 + +#### 支持的 glob 语法 + +SQL 字符串需要写成 `\\` 才能表达 glob 中的 `\`。 + +以下示例均为可匹配示例。 + +| 语法 | 含义 | 示例(模式 → JSON Path) | SQL 字面量写法 | +|------|------|-------------------|----------------| +| `*` | 任意长度字符串 | `num_*` → `num_latency` | `'num_*'` | +| `?` | 任意单字符 | `a?b` → `acb` | `'a?b'` | +| `[abc]` | 字符类 | `a[bc]d` → `abd` | `'a[bc]d'` | +| `[a-z]` | 字符范围 | `int_[0-9]` → `int_3` | `'int_[0-9]'` | +| `[!abc]` | 取反字符类 | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` | +| `[^abc]` | 取反字符类 | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` | +| `\` | 转义下一个字符 | `a\*b` → `a*b`
`a\?b` → `a?b`
`a\[b` → `a[b`
`\` → `\` | `'a\\*b'`
`'a\\?b'`
`'a\\[b'`
`'\\'` | + +#### 不支持的语法 + +以下语法会被当成普通字符处理,或导致匹配失败,请尽可能避免: + +| 语法 | 在某些 glob 实现中的语义 | 当前行为 | +|------|--------------------------|----------| +| `{a,b}` | 花括号展开 | **不支持**(当作字面量 `{` `}`) | +| `**` | 递归目录匹配 | **不支持特殊语义**(等价于 `*` `*` 连用) | + +- 类似于 `[]`、`[!]`、`[^]`、`a[]b` 的空字符模式无效,不匹配任何 JSON Path +- 类似于 `int_[0-9` 的未闭合字符模式无效,不匹配任何 JSON Path + +#### 典型示例 + +1. 正常匹配 +- 模式:`num_*` + - √ `num_a` + - √ `num_1` + - × `number_a` + +- 模式:`a\*b` + - SQL:`'a\\*b'` + - √ `a*b` + - × `axxb` + +- 模式:`\*` + - SQL:`'\\*'` + - √ `*` + - × `a*` + +- 模式:`\` + - SQL:`'\\'` + - √ `\` + - × `\\` + +- 模式:`int_[0-9]` + - √ `int_1` + - × `int_a` + +2. 全量匹配(不是“包含”的语义) +- 模式:`a*b` + - √ `ab` + - √ `axxxb` + - × `xxaxxxbxx` + +3. `.` 与 `/` 不特殊,为普通字符 +- 模式:`int_*` + - √ `int_nested.level1` + - √ `int_nested/level1` + ## 类型冲突与提升规则 当同一路径出现不兼容类型(如同一字段既出现整数又出现字符串)时,将提升为 JSONB 类型以避免信息丢失: @@ -391,6 +467,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris'; | `VARCHAR` | ✔ | ✔ | | `JSON` | ✔ | ✔ | +### 基于 Schema Template 自动 CAST + +当 VARIANT 列定义了 schema template 时,且 `enable_variant_schema_auto_cast` 设为 true 时,语义分析阶段会为命中 schema template 的子列自动插入对应类型的 CAST,无需自行手写。 + +- 覆盖 SELECT、WHERE、ORDER BY、GROUP BY、HAVING、JOIN KEY 或聚合参数等场景。 +- 若需关闭此行为,将 `enable_variant_schema_auto_cast` 设为 false。 + +示例: +```sql +CREATE TABLE t ( + id BIGINT, + data VARIANT<'num_*': BIGINT, 'str_*': STRING> +); + +-- 1) 过滤 + 排序 +SELECT id +FROM t +WHERE data['num_a'] > 10 +ORDER BY data['num_a']; + +-- 2) 分组 + 聚合 + Alias +SELECT data['str_name'] AS username, SUM(data['num_a']) AS total +FROM t +GROUP BY username +HAVING data['num_a'] > 100; + +-- 3) JOIN ON +SELECT * +FROM t1 JOIN t2 +ON t1.data['num_id'] = t2.data['num_id']; +``` + +**注意**:自动 CAST 功能无法感知给定的 Path 是否为叶子,它只是对所有符合 schema template 规则的 Path 都加对应的 CAST。 + +因此,对于下述这种情况需要额外注意,为保证结果正确,请设置 `enable_variant_schema_auto_cast` 设为 false,并手动添加 CAST。 + +```sql +-- Schema Template:所有 int_* 视为 INT +CREATE TABLE t ( + id INT, + data VARIANT<'int_*': INT> +); + +INSERT INTO t VALUES +(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}'); + +-- 自动 CAST 开启 +SET enable_variant_schema_auto_cast = true; + +-- int_nested 匹配 int_*,错误自动 CAST 为 INT,查询结果返回 NULL +SELECT + data['int_nested'] +FROM t; + +-- 自动 CAST 关闭 +SET enable_variant_schema_auto_cast = false; + +-- 查询结果正确返回 +SELECT + data['int_nested'] +FROM t; +``` + ## 宽列 当导入数据包含大量不同的 JSON key 时,VARIANT 的子列会迅速增多;当规模达到一定程度,可能出现元数据膨胀、写入/合并开销增大、查询性能下降等问题。为应对“宽列”(子列过多),VARIANT 提供两种机制:**稀疏列** 与 **DOC 编码**。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md index eb65b641eb983..c2becf32b8388 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md @@ -57,6 +57,12 @@ FROM ${table_name} WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY), 'Doris'); ``` +VARIANT 查询中, JSON Path 的表示有如下几种类型,除此之外的表示均为未定义行为: + +1. `v['properties']['title']` +2. `v['properties.title']` +3. `v.properties.title` + ## 基本类型 VARIANT 自动推断的子列基础类型包括: @@ -165,7 +171,7 @@ Schema 仅指导“存储层”的持久化类型,计算逻辑仍以实际数 SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']); ``` -通配符与匹配顺序: +### 通配符与匹配顺序 ```sql CREATE TABLE test_var_schema ( @@ -191,6 +197,76 @@ v1 VARIANT< 匹配成功的子路径默认会展开为独立列。若匹配子列过多导致列数暴增,建议开启 `variant_enable_typed_paths_to_sparse`(见“配置”)。 +### 通配符语法 + +Schema Template 模式匹配算法**只支持受限 glob 语法子集**。 + +#### 支持的 glob 语法 + +SQL 字符串需要写成 `\\` 才能表达 glob 中的 `\`。 + +以下示例均为可匹配示例。 + +| 语法 | 含义 | 示例(模式 → JSON Path) | SQL 字面量写法 | +|------|------|-------------------|----------------| +| `*` | 任意长度字符串 | `num_*` → `num_latency` | `'num_*'` | +| `?` | 任意单字符 | `a?b` → `acb` | `'a?b'` | +| `[abc]` | 字符类 | `a[bc]d` → `abd` | `'a[bc]d'` | +| `[a-z]` | 字符范围 | `int_[0-9]` → `int_3` | `'int_[0-9]'` | +| `[!abc]` | 取反字符类 | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` | +| `[^abc]` | 取反字符类 | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` | +| `\` | 转义下一个字符 | `a\*b` → `a*b`
`a\?b` → `a?b`
`a\[b` → `a[b`
`\` → `\` | `'a\\*b'`
`'a\\?b'`
`'a\\[b'`
`'\\'` | + +#### 不支持的语法 + +以下语法会被当成普通字符处理,或导致匹配失败,请尽可能避免: + +| 语法 | 在某些 glob 实现中的语义 | 当前行为 | +|------|--------------------------|----------| +| `{a,b}` | 花括号展开 | **不支持**(当作字面量 `{` `}`) | +| `**` | 递归目录匹配 | **不支持特殊语义**(等价于 `*` `*` 连用) | + +- 类似于 `[]`、`[!]`、`[^]`、`a[]b` 的空字符模式无效,不匹配任何 JSON Path +- 类似于 `int_[0-9` 的未闭合字符模式无效,不匹配任何 JSON Path + +#### 典型示例 + +1. 正常匹配 +- 模式:`num_*` + - √ `num_a` + - √ `num_1` + - × `number_a` + +- 模式:`a\*b` + - SQL:`'a\\*b'` + - √ `a*b` + - × `axxb` + +- 模式:`\*` + - SQL:`'\\*'` + - √ `*` + - × `a*` + +- 模式:`\` + - SQL:`'\\'` + - √ `\` + - × `\\` + +- 模式:`int_[0-9]` + - √ `int_1` + - × `int_a` + +2. 全量匹配(不是“包含”的语义) +- 模式:`a*b` + - √ `ab` + - √ `axxxb` + - × `xxaxxxbxx` + +3. `.` 与 `/` 不特殊,为普通字符 +- 模式:`int_*` + - √ `int_nested.level1` + - √ `int_nested/level1` + ## 类型冲突与提升规则 当同一路径出现不兼容类型(如同一字段既出现整数又出现字符串)时,将提升为 JSONB 类型以避免信息丢失: @@ -388,6 +464,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris'; | `VARCHAR` | ✔ | ✔ | | `JSON` | ✔ | ✔ | +### 基于 Schema Template 自动 CAST + +当 VARIANT 列定义了 schema template 时,且 `enable_variant_schema_auto_cast` 设为 true 时,语义分析阶段会为命中 schema template 的子列自动插入对应类型的 CAST,无需自行手写。 + +- 覆盖 SELECT、WHERE、ORDER BY、GROUP BY、HAVING、JOIN KEY 或聚合参数等场景。 +- 若需关闭此行为,将 `enable_variant_schema_auto_cast` 设为 false。 + +示例: +```sql +CREATE TABLE t ( + id BIGINT, + data VARIANT<'num_*': BIGINT, 'str_*': STRING> +); + +-- 1) 过滤 + 排序 +SELECT id +FROM t +WHERE data['num_a'] > 10 +ORDER BY data['num_a']; + +-- 2) 分组 + 聚合 + Alias +SELECT data['str_name'] AS username, SUM(data['num_a']) AS total +FROM t +GROUP BY username +HAVING data['num_a'] > 100; + +-- 3) JOIN ON +SELECT * +FROM t1 JOIN t2 +ON t1.data['num_id'] = t2.data['num_id']; +``` + +**注意**:自动 CAST 功能无法感知给定的 Path 是否为叶子,它只是对所有符合 schema template 规则的 Path 都加对应的 CAST。 + +因此,对于下述这种情况需要额外注意,为保证结果正确,请设置 `enable_variant_schema_auto_cast` 设为 false,并手动添加 CAST。 + +```sql +-- Schema Template:所有 int_* 视为 INT +CREATE TABLE t ( + id INT, + data VARIANT<'int_*': INT> +); + +INSERT INTO t VALUES +(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}'); + +-- 自动 CAST 开启 +SET enable_variant_schema_auto_cast = true; + +-- int_nested 匹配 int_*,错误自动 CAST 为 INT,查询结果返回 NULL +SELECT + data['int_nested'] +FROM t; + +-- 自动 CAST 关闭 +SET enable_variant_schema_auto_cast = false; + +-- 查询结果正确返回 +SELECT + data['int_nested'] +FROM t; +``` + ## 限制 - `variant_max_subcolumns_count`:默认 0(不限制 Path 物化列数)。建议在生产设置为 2048(Tablet 级别)以控制列数。超过阈值后,低频/稀疏路径会被收敛到共享数据结构,从该结构查询可能带来性能下降(详见“配置”)。 @@ -501,4 +640,4 @@ DESCRIBE ${table_name} PARTITION ($partition_name); 1. VARIANT 中的 `null` 与 SQL `NULL` 有区别吗? - 没有区别,两者等价。 2. 为什么我的查询/索引没有生效? - - 请检查是否对路径做了正确的 CAST、是否因为类型冲突被提升为 JSONB、或是否误以为给 VARIANT“整体”建的索引可用于子列。 \ No newline at end of file + - 请检查是否对路径做了正确的 CAST、是否因为类型冲突被提升为 JSONB、或是否误以为给 VARIANT“整体”建的索引可用于子列。 diff --git a/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md b/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md index b5116669ea318..b05698659ae72 100644 --- a/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md +++ b/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md @@ -57,6 +57,12 @@ FROM ${table_name} WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY), 'Doris'); ``` +In VARIANT queries, JSON Path can be expressed in the following forms; any other form is undefined: + +1. `v['properties']['title']` +2. `v['properties.title']` +3. `v.properties.title` + ## Primitive types VARIANT infers subcolumn types automatically. Supported types include: @@ -165,7 +171,7 @@ Schema only guides the persisted storage type. During query execution, the effec SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']); ``` -Wildcard matching and order: +### Wildcard matching and order ```sql CREATE TABLE test_var_schema ( @@ -191,6 +197,76 @@ v1 VARIANT< Matched subpaths are materialized as columns by default. If too many paths match and generate excessive columns, consider enabling `variant_enable_typed_paths_to_sparse` (see “Configuration”). +### Wildcard syntax + +The Schema Template pattern-matching algorithm supports **only a restricted subset of glob syntax**. + +#### Supported glob syntax + +In SQL strings, we should write `\\` to express a literal `\` in glob patterns. + +All examples below are matching examples. + +| Syntax | Meaning | Example (pattern → JSON Path) | SQL literal | +|------|---------|------------------------------|-------------| +| `*` | Any-length string | `num_*` → `num_latency` | `'num_*'` | +| `?` | Any single character | `a?b` → `acb` | `'a?b'` | +| `[abc]` | Character class | `a[bc]d` → `abd` | `'a[bc]d'` | +| `[a-z]` | Character range | `int_[0-9]` → `int_3` | `'int_[0-9]'` | +| `[!abc]` | Negated character class | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` | +| `[^abc]` | Negated character class | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` | +| `\` | Escape the next character | `a\*b` → `a*b`
`a\?b` → `a?b`
`a\[b` → `a[b`
`\` → `\` | `'a\\*b'`
`'a\\?b'`
`'a\\[b'`
`'\\'` | + +#### Unsupported syntax + +The following are treated as ordinary characters or cause matching to fail; avoid them whenever possible: + +| Syntax | Semantics in some glob implementations | Current behavior | +|------|----------------------------------------|------------------| +| `{a,b}` | Brace expansion | **Not supported** (treated as literal `{` `}`) | +| `**` | Recursive directory match | **No special semantics** (equivalent to `*` `*`) | + +- Empty character patterns like `[]`, `[!]`, `[^]`, and `a[]b` are invalid and match no JSON Path. +- Unterminated character patterns like `int_[0-9` are invalid and match no JSON Path. + +#### Typical examples + +1. Normal match +- Pattern: `num_*` + - √ `num_a` + - √ `num_1` + - × `number_a` + +- Pattern: `a\*b` + - SQL: `'a\\*b'` + - √ `a*b` + - × `axxb` + +- Pattern: `\*` + - SQL: `'\\*'` + - √ `*` + - × `a*` + +- Pattern: `\` + - SQL: `'\\'` + - √ `\` + - × `\\` + +- Pattern: `int_[0-9]` + - √ `int_1` + - × `int_a` + +2. Full match (not “contains” semantics) +- Pattern: `a*b` + - √ `ab` + - √ `axxxb` + - × `xxaxxxbxx` + +3. `.` and `/` are not special; they are ordinary characters +- Pattern: `int_*` + - √ `int_nested.level1` + - √ `int_nested/level1` + ## Type conflicts and promotion rules When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss: @@ -388,6 +464,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris'; | `VARCHAR` | ✔ | ✔ | | `JSON` | ✔ | ✔ | +### Schema Template based auto CAST + +When a VARIANT column defines a Schema Template and `enable_variant_schema_auto_cast` is set to true, the analyzer automatically inserts CASTs to the declared types for subpaths that match the Schema Template, so you do not need to write CASTs manually. + +- Applies to SELECT, WHERE, ORDER BY, GROUP BY, HAVING, JOIN keys, and aggregate arguments. +- To disable this behavior, set `enable_variant_schema_auto_cast` to false. + +Example: +```sql +CREATE TABLE t ( + id BIGINT, + data VARIANT<'num_*': BIGINT, 'str_*': STRING> +); + +-- 1) FILTER + ORDER +SELECT id +FROM t +WHERE data['num_a'] > 10 +ORDER BY data['num_a']; + +-- 2) GROUP + AGGREGATE + ALIAS +SELECT data['str_name'] AS username, SUM(data['num_a']) AS total +FROM t +GROUP BY username +HAVING data['num_a'] > 100; + +-- 3) JOIN ON +SELECT * +FROM t1 JOIN t2 +ON t1.data['num_id'] = t2.data['num_id']; +``` + +**Note**: Auto CAST cannot determine whether a path is a leaf; it simply casts all paths that match the Schema Template. + +Therefore, in cases like the following, to ensure correct results, set `enable_variant_schema_auto_cast` to false and add CASTs manually. + +```sql +-- Schema Template: treat all int_* as INT +CREATE TABLE t ( + id INT, + data VARIANT<'int_*': INT> +); + +INSERT INTO t VALUES +(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}'); + +-- Auto CAST enabled +SET enable_variant_schema_auto_cast = true; + +-- int_nested matches int_*, is incorrectly CAST to INT, and the query returns NULL +SELECT + data['int_nested'] +FROM t; + +-- Auto CAST disabled +SET enable_variant_schema_auto_cast = false; + +-- The query returns the correct result +SELECT + data['int_nested'] +FROM t; +``` + ## Limitations - `variant_max_subcolumns_count`: default 0 (no limit). In production, set to 2048 (tablet level) to control the number of materialized paths. Above the threshold, low-frequency/sparse paths are moved to a shared data structure; reading from it may be slower (see “Configuration”). @@ -503,4 +642,3 @@ ClickBench (43 queries): 2. Why doesn’t my query/index work? - Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths. -