Feat(starrocks)!: improve some StarRocks sql generation#6737
Feat(starrocks)!: improve some StarRocks sql generation#6737jaogoy wants to merge 6805 commits intotobymao:mainfrom
Conversation
jaogoy
commented
Jan 14, 2026
- Added StarRocks DDL support for three partition methods.
- Enabled ALTER TABLE … RENAME for StarRocks.
- Emitted ORDER BY via CLUSTER BY for StarRocks outputs.
- Added MV (REFRESH) properties handling for StarRocks materialized views.
- Tests updated/added for the new StarRocks behaviors.
* support ?:: * include base COLUMN_OPERATORS
…6447) * annotate type for MODE function snowflake * support multiple semantics * address test comment
…ymao#6470) * annotate type for PERCENTILE_CONT snowflake * fix format * fix test
Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
…eturn type VARCHAR (tobymao#6475) # Conflicts: # sqlglot/typing/snowflake.py # tests/fixtures/optimizer/annotate_functions.sql
…o#6474) * chore(optimizer)!: annotate type for snowflake func TO_BINARY * remove unnecessary function * add test to test_dialect --------- Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
…TANCE, VECTOR_COSINE_SIMILARITY functions (tobymao#6468) * Annotate type for vector distances (VECTOR_L1_DISTANCE, VECTOR_L2_DISTANCE, VECTOR_COSINE_SIMILARITY) * Added VECTOR_L1_DISTANCE to TRANSFORMS * Removed redundant _sql_names
* Type annotation for REGR_* functions * removed unrequired change * added tests for other databases and made all REGR classes inherit from AggFunc * removed unsupported databases
…o#6481) * chore(optimizer)!: annotate type for snowflake func TO_BOOLEAN * formatting * remove _sql_names from ToBoolean expression * remove TO_BOOLEAN from oracle function parsing --------- Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
* wrap bq -> duckdb REGEXP_EXTRACT SUBSTRING() call in NULLIF * Use dialect-specific constant for position overflow semantics
* Feat(BigQuery)!: Add support for the NET.HOST function * PR feedback
* support dcolonqmark * add testcases
… for BITNOT (tobymao#6490) * feat(duckdb): Add transpilation support for the negative integer args for BITNOT * Update sqlglot/dialects/duckdb.py --------- Co-authored-by: Vaggelis Danias <daniasevangelos@gmail.com>
…_AGG / STRING_AGG functions (tobymao#6463)
* annotate the TANH * feat(optimizer): Refactor Tanh and Atan2 annotations for consistency * feat(optimizer): Add Tanh annotation for Hive * feat(optimizer): Update TANH dialect support to include Hive
…om snowflake to duckdb (tobymao#6690) * Fix TO_TIME/TRY_TO_TIME transpilation from Snowflake to DuckDB for compact formats * switched to expressions instead of strings, added tests * added tests, removed generator.py changes * added missing test * minor tweak, removed invalid test
* Annotate Type for Kurtosis * missing test * made sure type annotation was correct for decfloat, double, number * Small cleanup --------- Co-authored-by: Jo <46752250+georgesittas@users.noreply.github.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
VaggelisD
left a comment
There was a problem hiding this comment.
Hey @jaogoy, thank you for the PR! As a heads up, there are many different changes clustered together here, it'd be appreciated if we separated these in different PRs.
Leaving a few comments from an initial pass regardless:
Signed-off-by: jaogoy <jaogoy@gmail.com>
| def cluster_sql(self, expression: exp.Cluster) -> str: | ||
| """Generate StarRocks ORDER BY clause for clustering. | ||
|
|
||
| StarRocks uses ORDER BY instead of CLUSTER BY for table clustering. | ||
| This override ensures exp.Cluster generates the correct syntax. | ||
|
|
||
| Example: | ||
| exp.Cluster(expressions=[id, name]) → "ORDER BY (id, name)" | ||
| """ | ||
| expressions = self.expressions(expression, flat=True) | ||
| return f"ORDER BY ({expressions})" if expressions else "" |
There was a problem hiding this comment.
Given that Starrocks does not support CLUSTER BY then this is regarding transpilation. Are we sure that the semantics of the input dialect are always equivalent to SR's ORDER BY?
For instance, coming from Spark it looks like CLUSTER BY = DISTRIBUTE BY + ORDER BY in MapReduce, whereas for Snowflake I get the feeling that it resembles Starrocks's PARTITION BY (?)
There was a problem hiding this comment.
cluster is a kind of data storage.
cluster = partition + distribution + order by.
If there is no partition and distribution, it's the same as order by.
In snowflake, the cluster contains some similar concept with partitions/distribution by using micro partitions.
So for StarRocks, there is not differency between cluster and order by, that is StarRocks won't use another cluster different with order by.
also, mv test cases into TestStarocks' test_ddl Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
| any_func_expr = any(isinstance(e, (exp.Func, exp.Anonymous)) for e in node.expressions) \ | ||
| if isinstance(node, exp.Tuple) else False |
There was a problem hiding this comment.
Is this necessary or is the VIEW lookup sufficient to know whether to have parentheses? If yes, we can remove any_func_expr logic completely
There was a problem hiding this comment.
No, only RANGE(c1, c2) and LIST(c1, c2) can be without parentheses, others need parentheses:
- for MV partitioning.
- for tables, but column names only, such as
(c1 ,c2).
| any_func_expr = any(isinstance(e, (exp.Func, exp.Anonymous)) for e in node.expressions) \ | ||
| if isinstance(node, exp.Tuple) else False | ||
| create = expression.find_ancestor(exp.Create) | ||
| # SR needs `(...)` for MVs, with parens, and columns only |
There was a problem hiding this comment.
Yes. But VIEWs don't have partitions, so for simiplicity, it doesn't matter.
| if create and create.kind == "VIEW" or not any_func_expr: | ||
| return f"PARTITION BY ({partition_columns_str})" | ||
| else: | ||
| # SR doesn't support `(func(...), col2)` with parens for normal tables | ||
| return f"PARTITION BY {partition_columns_str}" |
There was a problem hiding this comment.
| if create and create.kind == "VIEW" or not any_func_expr: | |
| return f"PARTITION BY ({partition_columns_str})" | |
| else: | |
| # SR doesn't support `(func(...), col2)` with parens for normal tables | |
| return f"PARTITION BY {partition_columns_str}" | |
| if create and create.kind == "VIEW" or not any_func_expr: | |
| partition_columns_str = f"({partition_columns_str})" | |
| return f"PARTITION BY {partition_columns_str}" |
| # LIST partitioning | ||
| list_partition = exp.PartitionByListProperty( | ||
| partition_expressions=[exp.column("c1")], | ||
| create_expressions=[ | ||
| exp.Var(this="PARTITION p1 VALUES IN (1, 2)"), | ||
| exp.Var(this="PARTITION p2 VALUES IN ('US', 'CN')"), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Lets not test RANGE/LIST partitioning by manually creating expressions, we can test them through SQL by adding them to the ddl_sqls list like you did with the other new entries
There was a problem hiding this comment.
Do you mean using self.validate_identity(...)? But:
PartitionByListPropertyis not supported in Parser for StarRocks now.- It can support to define
create_expressionswith rawexp.Var()or structured expresion, such asexp.Partition(...). While usingself.validate_identity(...), we can only test one way by using SQL statements, even it's supported in Parser.
Right?
And, do I need to move the corresponding code from Doris to base Parser/Generator.
| # MV : Refresh trigger property | ||
| manual_refresh = exp.RefreshTriggerProperty(kind=exp.var("MANUAL")) | ||
| self.assertEqual(manual_refresh.sql(dialect="starrocks"), "REFRESH MANUAL") | ||
|
|
||
| async_refresh = exp.RefreshTriggerProperty( | ||
| method=exp.var("IMMEDIATE"), | ||
| kind=exp.var("ASYNC"), | ||
| starts=exp.Literal.string("2025-01-01 00:00:00"), | ||
| every=exp.Literal.number(5), | ||
| unit=exp.var("MINUTE"), | ||
| ) | ||
| self.assertEqual( | ||
| async_refresh.sql(dialect="starrocks"), | ||
| "REFRESH IMMEDIATE ASYNC START ('2025-01-01 00:00:00') EVERY (INTERVAL 5 MINUTE)", | ||
| ) |
There was a problem hiding this comment.
Ditto, let's not use manually created expressions, lets prefer to test DDLs that have these clauses through self.validate_identity
There was a problem hiding this comment.
But, RefreshTriggerProperty is not supported in Parser now. How can we use validate_identity to test it?
I've tried use self.validate_identity("CREATE MATERIALIZED VIEW mv_name REFRESH MANUAL AS SELECT * FROM t;) to tested it, it's not correct.
Do I need to move the corresponding code from Doris to base Parser/Generator, but there are a few differences between Doris and StarRocks.
|
@VaggelisD Thank you for your meticulous review. Do I need to submit another PR for this PR is closed by Tobymao. Or do I need to do any more operations? |
1 similar comment
|
@VaggelisD Thank you for your meticulous review. Do I need to submit another PR for this PR is closed by Tobymao. Or do I need to do any more operations? |
|
Hey @jaogoy, the PR was closed by mistake, due to a git operation that was used to clean up old caches. Can you please open a new PR after you rebase off of main? Thanks! |
|
@georgesittas @VaggelisD I've created another PR(#6827), and it depends on #6804, which also supports partitioning parse and generation, submitted by @petrikoro a day ago. Please have a review. Thanks. |