Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,12 @@ FROM ${table_name}
WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');
```

In VARIANT queries, JSON Path can be expressed in the following forms; any other form is undefined:

1. `v['properties']['title']`
2. `v['properties.title']`
3. `v.properties.title`

## Primitive types

VARIANT infers subcolumn types automatically. Supported types include:
Expand Down Expand Up @@ -168,7 +174,7 @@ Schema only guides the persisted storage type. During query execution, the effec
SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);
```

Wildcard matching and order:
### Wildcard matching and order

```sql
CREATE TABLE test_var_schema (
Expand All @@ -194,6 +200,76 @@ v1 VARIANT<

Matched subpaths are materialized as columns by default. If too many paths match and generate excessive columns, consider enabling `variant_enable_typed_paths_to_sparse` (see “Configuration”).

### Wildcard syntax

The Schema Template pattern-matching algorithm supports **only a restricted subset of glob syntax**.

#### Supported glob syntax

In SQL strings, we write `\\` to express a literal `\` in glob patterns.

All examples below are matching examples.

| Syntax | Meaning | Example (pattern → JSON Path) | SQL literal |
|------|---------|------------------------------|-------------|
| `*` | Any-length string | `num_*` → `num_latency` | `'num_*'` |
| `?` | Any single character | `a?b` → `acb` | `'a?b'` |
| `[abc]` | Character class | `a[bc]d` → `abd` | `'a[bc]d'` |
| `[a-z]` | Character range | `int_[0-9]` → `int_3` | `'int_[0-9]'` |
| `[!abc]` | Negated character class | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` |
| `[^abc]` | Negated character class | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` |
| `\` | Escape the next character | `a\*b` → `a*b`<br/>`a\?b` → `a?b`<br/>`a\[b` → `a[b`<br/>`\` → `\` | `'a\\*b'`<br/>`'a\\?b'`<br/>`'a\\[b'`<br/>`'\\'` |

#### Unsupported syntax

The following are treated as ordinary characters or cause matching to fail; avoid them whenever possible:

| Syntax | Semantics in some glob implementations | Current behavior |
|------|----------------------------------------|------------------|
| `{a,b}` | Brace expansion | **Not supported** (treated as literal `{` `}`) |
| `**` | Recursive directory match | **No special semantics** (equivalent to `*` `*`) |

- Empty character patterns like `[]`, `[!]`, `[^]`, and `a[]b` are invalid and match no JSON Path.
- Unterminated character patterns like `int_[0-9` are invalid and match no JSON Path.

#### Typical examples

1. Normal match
- Pattern: `num_*`
- √ `num_a`
- √ `num_1`
- × `number_a`

- Pattern: `a\*b`
- SQL: `'a\\*b'`
- √ `a*b`
- × `axxb`

- Pattern: `\*`
- SQL: `'\\*'`
- √ `*`
- × `a*`

- Pattern: `\`
- SQL: `'\\'`
- √ `\`
- × `\\`

- Pattern: `int_[0-9]`
- √ `int_1`
- × `int_a`

2. Full match (not “contains” semantics)
- Pattern: `a*b`
- √ `ab`
- √ `axxxb`
- × `xxaxxxbxx`

3. `.` and `/` are not special; they are ordinary characters
- Pattern: `int_*`
- √ `int_nested.level1`
- √ `int_nested/level1`

## Type conflicts and promotion rules

When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss:
Expand Down Expand Up @@ -391,6 +467,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
| `VARCHAR` | ✔ | ✔ |
| `JSON` | ✔ | ✔ |

### Schema Template based auto CAST

When a VARIANT column defines a Schema Template and `enable_variant_schema_auto_cast` is set to true, the analyzer automatically inserts CASTs to the declared types for subpaths that match the Schema Template, so you do not need to write CASTs manually.

- Applies to SELECT, WHERE, ORDER BY, GROUP BY, HAVING, JOIN keys, and aggregate arguments.
- To disable this behavior, set `enable_variant_schema_auto_cast` to false.

Example:
```sql
CREATE TABLE t (
id BIGINT,
data VARIANT<'num_*': BIGINT, 'str_*': STRING>
);

-- 1) FILTER + ORDER
SELECT id
FROM t
WHERE data['num_a'] > 10
ORDER BY data['num_a'];

-- 2) GROUP + AGGREGATE + ALIAS
SELECT data['str_name'] AS username, SUM(data['num_a']) AS total
FROM t
GROUP BY username
HAVING data['num_a'] > 100;

-- 3) JOIN ON
SELECT *
FROM t1 JOIN t2
ON t1.data['num_id'] = t2.data['num_id'];
```

**Note**: Auto CAST cannot determine whether a path is a leaf; it simply casts all paths that match the Schema Template.

Therefore, in cases like the following, to ensure correct results, set `enable_variant_schema_auto_cast` to false and add CASTs manually.

```sql
-- Schema Template: treat all int_* as INT
CREATE TABLE t (
id INT,
data VARIANT<'int_*': INT>
);

INSERT INTO t VALUES
(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}');

-- Auto CAST enabled
SET enable_variant_schema_auto_cast = true;

-- int_nested matches int_*, is incorrectly CAST to INT, and the query returns NULL
SELECT
data['int_nested']
FROM t;

-- Auto CAST disabled
SET enable_variant_schema_auto_cast = false;

-- The query returns the correct result
SELECT
data['int_nested']
FROM t;
```

## Wide columns

When ingested data contains many distinct JSON keys, VARIANT materialized subcolumns can grow rapidly; at scale this may cause metadata bloat, higher write/merge cost, and query slowdowns. To address “wide columns” (too many subcolumns), VARIANT provides two mechanisms: **Sparse columns** and **DOC encoding**.
Expand Down Expand Up @@ -577,4 +716,3 @@ ClickBench (43 queries):
2. Why doesn’t my query/index work?
- Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths.


Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,12 @@ FROM ${table_name}
WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');
```

VARIANT 查询中, JSON Path 的表示有如下几种类型,除此之外的表示均为未定义行为:

1. `v['properties']['title']`
2. `v['properties.title']`
3. `v.properties.title`

## 基本类型

VARIANT 自动推断的子列基础类型包括:
Expand Down Expand Up @@ -168,7 +174,7 @@ Schema 仅指导“存储层”的持久化类型,计算逻辑仍以实际数
SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);
```

通配符与匹配顺序
### 通配符与匹配顺序

```sql
CREATE TABLE test_var_schema (
Expand All @@ -194,6 +200,76 @@ v1 VARIANT<

匹配成功的子路径默认会展开为独立列。若匹配子列过多导致列数暴增,建议开启 `variant_enable_typed_paths_to_sparse`(见“配置”)。

### 通配符语法

Schema Template 模式匹配算法**只支持受限 glob 语法子集**。

#### 支持的 glob 语法

SQL 字符串需要写成 `\\` 才能表达 glob 中的 `\`。

以下示例均为可匹配示例。

| 语法 | 含义 | 示例(模式 → JSON Path) | SQL 字面量写法 |
|------|------|-------------------|----------------|
| `*` | 任意长度字符串 | `num_*` → `num_latency` | `'num_*'` |
| `?` | 任意单字符 | `a?b` → `acb` | `'a?b'` |
| `[abc]` | 字符类 | `a[bc]d` → `abd` | `'a[bc]d'` |
| `[a-z]` | 字符范围 | `int_[0-9]` → `int_3` | `'int_[0-9]'` |
| `[!abc]` | 取反字符类 | `int_[!0-9]` → `int_a` | `'int_[!0-9]'` |
| `[^abc]` | 取反字符类 | `int_[^0-9]` → `int_a` | `'int_[^0-9]'` |
| `\` | 转义下一个字符 | `a\*b` → `a*b`<br/>`a\?b` → `a?b`<br/>`a\[b` → `a[b`<br/>`\` → `\` | `'a\\*b'`<br/>`'a\\?b'`<br/>`'a\\[b'`<br/>`'\\'` |

#### 不支持的语法

以下语法会被当成普通字符处理,或导致匹配失败,请尽可能避免:

| 语法 | 在某些 glob 实现中的语义 | 当前行为 |
|------|--------------------------|----------|
| `{a,b}` | 花括号展开 | **不支持**(当作字面量 `{` `}`) |
| `**` | 递归目录匹配 | **不支持特殊语义**(等价于 `*` `*` 连用) |

- 类似于 `[]`、`[!]`、`[^]`、`a[]b` 的空字符模式无效,不匹配任何 JSON Path
- 类似于 `int_[0-9` 的未闭合字符模式无效,不匹配任何 JSON Path

#### 典型示例

1. 正常匹配
- 模式:`num_*`
- √ `num_a`
- √ `num_1`
- × `number_a`

- 模式:`a\*b`
- SQL:`'a\\*b'`
- √ `a*b`
- × `axxb`

- 模式:`\*`
- SQL:`'\\*'`
- √ `*`
- × `a*`

- 模式:`\`
- SQL:`'\\'`
- √ `\`
- × `\\`

- 模式:`int_[0-9]`
- √ `int_1`
- × `int_a`

2. 全量匹配(不是“包含”的语义)
- 模式:`a*b`
- √ `ab`
- √ `axxxb`
- × `xxaxxxbxx`

3. `.` 与 `/` 不特殊,为普通字符
- 模式:`int_*`
- √ `int_nested.level1`
- √ `int_nested/level1`

## 类型冲突与提升规则

当同一路径出现不兼容类型(如同一字段既出现整数又出现字符串)时,将提升为 JSONB 类型以避免信息丢失:
Expand Down Expand Up @@ -391,6 +467,69 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
| `VARCHAR` | ✔ | ✔ |
| `JSON` | ✔ | ✔ |

### 基于 Schema Template 自动 CAST

当 VARIANT 列定义了 schema template 时,且 `enable_variant_schema_auto_cast` 设为 true 时,语义分析阶段会为命中 schema template 的子列自动插入对应类型的 CAST,无需自行手写。

- 覆盖 SELECT、WHERE、ORDER BY、GROUP BY、HAVING、JOIN KEY 或聚合参数等场景。
- 若需关闭此行为,将 `enable_variant_schema_auto_cast` 设为 false。

示例:
```sql
CREATE TABLE t (
id BIGINT,
data VARIANT<'num_*': BIGINT, 'str_*': STRING>
);

-- 1) 过滤 + 排序
SELECT id
FROM t
WHERE data['num_a'] > 10
ORDER BY data['num_a'];

-- 2) 分组 + 聚合 + Alias
SELECT data['str_name'] AS username, SUM(data['num_a']) AS total
FROM t
GROUP BY username
HAVING data['num_a'] > 100;

-- 3) JOIN ON
SELECT *
FROM t1 JOIN t2
ON t1.data['num_id'] = t2.data['num_id'];
```

**注意**:自动 CAST 功能无法感知给定的 Path 是否为叶子,它只是对所有符合 schema template 规则的 Path 都加对应的 CAST。

因此,对于下述这种情况需要额外注意,为保证结果正确,请设置 `enable_variant_schema_auto_cast` 设为 false,并手动添加 CAST。

```sql
-- Schema Template:所有 int_* 视为 INT
CREATE TABLE t (
id INT,
data VARIANT<'int_*': INT>
);

INSERT INTO t VALUES
(1, '{"int_1": 1, "int_nested": {"level1_num_1": 1011111, "level1_num_2": 102}}');

-- 自动 CAST 开启
SET enable_variant_schema_auto_cast = true;

-- int_nested 匹配 int_*,错误自动 CAST 为 INT,查询结果返回 NULL
SELECT
data['int_nested']
FROM t;

-- 自动 CAST 关闭
SET enable_variant_schema_auto_cast = false;

-- 查询结果正确返回
SELECT
data['int_nested']
FROM t;
```

## 宽列

当导入数据包含大量不同的 JSON key 时,VARIANT 的子列会迅速增多;当规模达到一定程度,可能出现元数据膨胀、写入/合并开销增大、查询性能下降等问题。为应对“宽列”(子列过多),VARIANT 提供两种机制:**稀疏列** 与 **DOC 编码**。
Expand Down
Loading