Skip to content

Feat(starrocks)!: improve some StarRocks sql generation#6737

Closed
jaogoy wants to merge 6805 commits intotobymao:mainfrom
jaogoy:improv.sr
Closed

Feat(starrocks)!: improve some StarRocks sql generation#6737
jaogoy wants to merge 6805 commits intotobymao:mainfrom
jaogoy:improv.sr

Conversation

@jaogoy
Copy link
Contributor

@jaogoy jaogoy commented Jan 14, 2026

  • Added StarRocks DDL support for three partition methods.
  • Enabled ALTER TABLE … RENAME for StarRocks.
  • Emitted ORDER BY via CLUSTER BY for StarRocks outputs.
  • Added MV (REFRESH) properties handling for StarRocks materialized views.
  • Tests updated/added for the new StarRocks behaviors.

fivetran-kwoodbeck and others added 30 commits December 3, 2025 10:14
* support ?::

* include base COLUMN_OPERATORS
…6447)

* annotate type for MODE function snowflake

* support multiple semantics

* address test comment
…ymao#6470)

* annotate type for PERCENTILE_CONT snowflake

* fix format

* fix test
Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
…eturn type VARCHAR (tobymao#6475)

# Conflicts:
#	sqlglot/typing/snowflake.py
#	tests/fixtures/optimizer/annotate_functions.sql
…o#6474)

* chore(optimizer)!: annotate type for snowflake func TO_BINARY

* remove unnecessary function

* add test to test_dialect

---------

Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
…TANCE, VECTOR_COSINE_SIMILARITY functions (tobymao#6468)

* Annotate type for vector distances (VECTOR_L1_DISTANCE, VECTOR_L2_DISTANCE, VECTOR_COSINE_SIMILARITY)

* Added VECTOR_L1_DISTANCE to TRANSFORMS

* Removed redundant _sql_names
* Type annotation for REGR_* functions

* removed unrequired change

* added tests for other databases and made all REGR classes inherit from AggFunc

* removed unsupported databases
…o#6481)

* chore(optimizer)!: annotate type for snowflake func TO_BOOLEAN

* formatting

* remove _sql_names from ToBoolean expression

* remove TO_BOOLEAN from oracle function parsing

---------

Co-authored-by: Michael Lee <michael.lee@michael.lee-FMF6J19R7N>
* wrap bq -> duckdb REGEXP_EXTRACT SUBSTRING() call in NULLIF

* Use dialect-specific constant for position overflow semantics
* Feat(BigQuery)!: Add support for the NET.HOST function

* PR feedback
* support dcolonqmark

* add testcases
… for BITNOT (tobymao#6490)

* feat(duckdb): Add transpilation support for the negative integer args for BITNOT

* Update sqlglot/dialects/duckdb.py

---------

Co-authored-by: Vaggelis Danias <daniasevangelos@gmail.com>
VaggelisD and others added 11 commits January 13, 2026 16:58
* annotate the TANH

* feat(optimizer): Refactor Tanh and Atan2 annotations for consistency

* feat(optimizer): Add Tanh annotation for Hive

* feat(optimizer): Update TANH dialect support to include Hive
…om snowflake to duckdb (tobymao#6690)

* Fix TO_TIME/TRY_TO_TIME transpilation from Snowflake to DuckDB for compact formats

* switched to expressions instead of strings, added tests

* added tests, removed generator.py changes

* added missing test

* minor tweak, removed invalid test
* Annotate Type for Kurtosis

* missing test

* made sure type annotation was correct for decfloat, double, number

* Small cleanup

---------

Co-authored-by: Jo <46752250+georgesittas@users.noreply.github.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jaogoy, thank you for the PR! As a heads up, there are many different changes clustered together here, it'd be appreciated if we separated these in different PRs.

Leaving a few comments from an initial pass regardless:

Signed-off-by: jaogoy <jaogoy@gmail.com>
Comment on lines 403 to 413
def cluster_sql(self, expression: exp.Cluster) -> str:
"""Generate StarRocks ORDER BY clause for clustering.

StarRocks uses ORDER BY instead of CLUSTER BY for table clustering.
This override ensures exp.Cluster generates the correct syntax.

Example:
exp.Cluster(expressions=[id, name]) → "ORDER BY (id, name)"
"""
expressions = self.expressions(expression, flat=True)
return f"ORDER BY ({expressions})" if expressions else ""
Copy link
Collaborator

@VaggelisD VaggelisD Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Starrocks does not support CLUSTER BY then this is regarding transpilation. Are we sure that the semantics of the input dialect are always equivalent to SR's ORDER BY?

For instance, coming from Spark it looks like CLUSTER BY = DISTRIBUTE BY + ORDER BY in MapReduce, whereas for Snowflake I get the feeling that it resembles Starrocks's PARTITION BY (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster is a kind of data storage.
cluster = partition + distribution + order by.
If there is no partition and distribution, it's the same as order by.
In snowflake, the cluster contains some similar concept with partitions/distribution by using micro partitions.
So for StarRocks, there is not differency between cluster and order by, that is StarRocks won't use another cluster different with order by.

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jaogoy, sending a few more comments your way, appreciate your cooperation thus far!

also, mv test cases into TestStarocks' test_ddl

Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Comment on lines 377 to 378
any_func_expr = any(isinstance(e, (exp.Func, exp.Anonymous)) for e in node.expressions) \
if isinstance(node, exp.Tuple) else False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary or is the VIEW lookup sufficient to know whether to have parentheses? If yes, we can remove any_func_expr logic completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only RANGE(c1, c2) and LIST(c1, c2) can be without parentheses, others need parentheses:

  1. for MV partitioning.
  2. for tables, but column names only, such as (c1 ,c2).

any_func_expr = any(isinstance(e, (exp.Func, exp.Anonymous)) for e in node.expressions) \
if isinstance(node, exp.Tuple) else False
create = expression.find_ancestor(exp.Create)
# SR needs `(...)` for MVs, with parens, and columns only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment relevant? #6737 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But VIEWs don't have partitions, so for simiplicity, it doesn't matter.

Comment on lines 381 to 385
if create and create.kind == "VIEW" or not any_func_expr:
return f"PARTITION BY ({partition_columns_str})"
else:
# SR doesn't support `(func(...), col2)` with parens for normal tables
return f"PARTITION BY {partition_columns_str}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if create and create.kind == "VIEW" or not any_func_expr:
return f"PARTITION BY ({partition_columns_str})"
else:
# SR doesn't support `(func(...), col2)` with parens for normal tables
return f"PARTITION BY {partition_columns_str}"
if create and create.kind == "VIEW" or not any_func_expr:
partition_columns_str = f"({partition_columns_str})"
return f"PARTITION BY {partition_columns_str}"

Comment on lines 88 to 95
# LIST partitioning
list_partition = exp.PartitionByListProperty(
partition_expressions=[exp.column("c1")],
create_expressions=[
exp.Var(this="PARTITION p1 VALUES IN (1, 2)"),
exp.Var(this="PARTITION p2 VALUES IN ('US', 'CN')"),
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not test RANGE/LIST partitioning by manually creating expressions, we can test them through SQL by adding them to the ddl_sqls list like you did with the other new entries

Copy link
Contributor Author

@jaogoy jaogoy Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using self.validate_identity(...)? But:

  1. PartitionByListProperty is not supported in Parser for StarRocks now.
  2. It can support to define create_expressions with raw exp.Var() or structured expresion, such as exp.Partition(...). While using self.validate_identity(...), we can only test one way by using SQL statements, even it's supported in Parser.
    Right?

And, do I need to move the corresponding code from Doris to base Parser/Generator.

Comment on lines 144 to 158
# MV : Refresh trigger property
manual_refresh = exp.RefreshTriggerProperty(kind=exp.var("MANUAL"))
self.assertEqual(manual_refresh.sql(dialect="starrocks"), "REFRESH MANUAL")

async_refresh = exp.RefreshTriggerProperty(
method=exp.var("IMMEDIATE"),
kind=exp.var("ASYNC"),
starts=exp.Literal.string("2025-01-01 00:00:00"),
every=exp.Literal.number(5),
unit=exp.var("MINUTE"),
)
self.assertEqual(
async_refresh.sql(dialect="starrocks"),
"REFRESH IMMEDIATE ASYNC START ('2025-01-01 00:00:00') EVERY (INTERVAL 5 MINUTE)",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, let's not use manually created expressions, lets prefer to test DDLs that have these clauses through self.validate_identity

Copy link
Contributor Author

@jaogoy jaogoy Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, RefreshTriggerProperty is not supported in Parser now. How can we use validate_identity to test it?
I've tried use self.validate_identity("CREATE MATERIALIZED VIEW mv_name REFRESH MANUAL AS SELECT * FROM t;) to tested it, it's not correct.
Do I need to move the corresponding code from Doris to base Parser/Generator, but there are a few differences between Doris and StarRocks.

@jaogoy
Copy link
Contributor Author

jaogoy commented Jan 20, 2026

@VaggelisD Thank you for your meticulous review.
I have fixed most of them according to you opinion.

Do I need to submit another PR for this PR is closed by Tobymao. Or do I need to do any more operations?

1 similar comment
@jaogoy
Copy link
Contributor Author

jaogoy commented Jan 20, 2026

@VaggelisD Thank you for your meticulous review.
I have fixed most of them according to you opinion.

Do I need to submit another PR for this PR is closed by Tobymao. Or do I need to do any more operations?

@georgesittas
Copy link
Collaborator

Hey @jaogoy, the PR was closed by mistake, due to a git operation that was used to clean up old caches.

Can you please open a new PR after you rebase off of main? Thanks!

@jaogoy
Copy link
Contributor Author

jaogoy commented Jan 21, 2026

@georgesittas @VaggelisD I've created another PR(#6827), and it depends on #6804, which also supports partitioning parse and generation, submitted by @petrikoro a day ago.

Please have a review. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments