Skip to content

Snowflake Unparser dialect and UNNEST support#21593

Open
yonatan-sevenai wants to merge 20 commits intoapache:mainfrom
yonatan-sevenai:feature/snowflake_unparser
Open

Snowflake Unparser dialect and UNNEST support#21593
yonatan-sevenai wants to merge 20 commits intoapache:mainfrom
yonatan-sevenai:feature/snowflake_unparser

Conversation

@yonatan-sevenai
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The SQL unparser needs a Snowflake dialect. Basic dialect settings (identifier quoting, NULLS FIRST/NULLS LAST, timestamp types) are straightforward, but UNNEST support required more than configuration.

Snowflake has no UNNEST keyword. Its equivalent, LATERAL FLATTEN(INPUT => expr), is a table function in the FROM clause with output accessed via alias."VALUE". This differs structurally from standard SQL: the unparser must emit a FROM-clause table factor with a CROSS JOIN instead of a SELECT-clause expression. It also must rewrite column references to point at the FLATTEN output, and handle several optimizer-produced plan shapes (intermediate Limit/Sort nodes, SubqueryAlias wrappers, composed expressions wrapping the unnest output, multi-expression projections). None of this can be expressed through CustomDialectBuilder.

What changes are included in this PR?

dialect.rs - New SnowflakeDialect with double-quote identifiers, NULLS FIRST/NULLS LAST, no empty select lists, no column aliases in table aliases, Snowflake timestamp types, and unnest_as_lateral_flatten(). Also wired into CustomDialect/CustomDialectBuilder.

ast.rs - New FlattenRelationBuilder that produces LATERAL FLATTEN(INPUT => expr, OUTER => bool) table factors, parallel to the existing UnnestRelationBuilder.

utils.rs - New unproject_unnest_expr_as_flatten_value transform that rewrites unnest placeholder columns to _unnest.VALUE references.

plan.rs - Changes to select_to_sql_recursively:

  • The Projection handler scans all expressions for unnest placeholders (not just single-expression projections), then branches into the FLATTEN path or the existing table-factor path.
  • peel_to_unnest_with_modifiers walks through Limit/Sort nodes between Projection and Unnest, applying their SQL modifiers to the query builder. This handles an optimizer behavior where these nodes are inserted between the two.
  • peel_to_inner_projection walks through SubqueryAlias to find the inner Projection that feeds an Unnest.
  • reconstruct_select_statement gained FLATTEN-aware expression rewriting and a has_internal_unnest_alias predicate to strip internal UNNEST(...) display names.
  • The Unnest handler rejects struct columns for the FLATTEN dialect with a clear error.

Are these changes tested?

Yes. 18 new tests covering:

  • Simple inline arrays, string arrays, cross joins
  • Implicit FROM (UNNEST in SELECT clause)
  • User aliases, table aliases, literal + unnest
  • Subselect source with filters and limit
  • UDF result as FLATTEN input
  • Limit between Projection and Unnest
  • Sort between Projection and Unnest
  • Limit + SubqueryAlias combined
  • Composed expressions wrapping unnest output (e.g. CAST)
  • Composed expressions with Limit
  • Multi-expression projections
  • Multi-expression projections with Limit
  • SubqueryAlias between Unnest and inner Projection

Are there any user-facing changes?

Yes. New public API surface:

  • SnowflakeDialect struct and its constructor
  • Dialect::unnest_as_lateral_flatten() method (default false)
  • CustomDialectBuilder::with_unnest_as_lateral_flatten()
  • FlattenRelationBuilder and FLATTEN_DEFAULT_ALIAS in the AST module

None of these are breaking changes, and all previous APIs should work.
New traits have default implementations to ease migrations.

yonatan-sevenai and others added 19 commits March 22, 2026 00:06
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing tests pass with broken SQL output — the SELECT list
still uses DataFusion internal names (__unnest_placeholder) instead
of Snowflake's alias.VALUE convention. Update expectations to the
correct Snowflake SQL so these tests will drive the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nd Projection

When a table is accessed through a passthrough/virtual table mapping,
DataFusion inserts a SubqueryAlias node between Unnest and its inner
Projection. The FLATTEN rendering code assumed a direct Projection child
and failed with "Unnest input is not a Projection: SubqueryAlias(...)".

Peel through SubqueryAlias in three code paths that inspect unnest.input:
try_unnest_to_lateral_flatten_sql, the inline-vs-table source check, and
the general unnest recursion. Also fix a pre-existing collapsible_if
clippy warning in check_unnest_placeholder_with_outer_ref.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the sql SQL Planner label Apr 13, 2026
@nuno-faria
Copy link
Copy Markdown
Contributor

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

@yonatan-sevenai
Copy link
Copy Markdown
Contributor Author

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

Thanks!
Quite the find :)

I believe the implementation I added covers many more use cases, but we'll see if we can collaborate on a single implementation.
Specifically, there's a lot of complexity when the array to unnest is the result of a UDF, Subquery, and things like that?
I saw a lot of edge cases where the optimizer might include limits and sorts between the unnest and the TableScan / SubqueryAlias as well and some complexity when you need to cross join the original table.

Hope we can figure out a single stable implementation!

@goldmedal
Copy link
Copy Markdown
Contributor

Thanks @yonatan-sevenai, I'll take a look at this PR

@goldmedal goldmedal self-requested a review April 14, 2026 03:48
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonatan-sevenai Thanks for picking this up — I haven't had bandwidth to finish #20648 recently, but I'd like to help get this landed.

Before a detailed review, I want to discuss the design. LATERAL FLATTEN(INPUT => expr) is Snowflake-specific syntax, and this PR embeds that logic directly in the core unparser. I'd prefer delegating UNNEST-to-table-factor conversion to the Dialect trait — see my detailed comment below.

What do you think?

Comment thread datafusion/sql/src/unparser/dialect.rs
Comment thread datafusion/sql/tests/cases/plan_to_sql.rs Outdated
Comment thread datafusion/sql/src/unparser/ast.rs Outdated
Comment thread datafusion/sql/src/unparser/plan.rs Outdated
Comment thread datafusion/sql/src/unparser/plan.rs Outdated
Comment thread datafusion/sql/src/unparser/plan.rs Outdated
Comment thread datafusion/sql/src/unparser/dialect.rs
@yonatan-sevenai yonatan-sevenai force-pushed the feature/snowflake_unparser branch from 96642c7 to 70c717c Compare April 17, 2026 16:49
Replace the hardcoded FLATTEN_DEFAULT_ALIAS ("_unnest") with a
per-SelectBuilder counter that generates unique aliases (_unnest_1,
_unnest_2, …). This prevents alias collisions when multiple unnests
appear in the same query.

- Add flatten_alias_counter to SelectBuilder with next/current
  accessor methods, scoped to one SELECT so subqueries get
  independent counters
- Remove FLATTEN_DEFAULT_ALIAS constant, the dead alias_name()
  method, and the default alias from FlattenRelationBuilder
- All three FLATTEN code paths (placeholder projection, display-name
  projection, and Unnest handler) now coordinate through the
  SelectBuilder to ensure SELECT items and FROM clause use the same
  alias
- Use internal_datafusion_err! macro for FLATTEN error handling
- Migrate unnest tests from partial .contains() assertions to
  insta::assert_snapshot! for full SQL verification
@yonatan-sevenai yonatan-sevenai force-pushed the feature/snowflake_unparser branch from 70c717c to 8e83e17 Compare April 18, 2026 17:47
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yonatan-sevenai, Overall looks good to me 👍 Just some minor suggestions.

/// (`_unnest_1`, `_unnest_2`, …). Each call returns a fresh name.
pub fn next_flatten_alias(&mut self) -> String {
self.flatten_alias_counter += 1;
format!("_unnest_{}", self.flatten_alias_counter)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use a constant to present the name. Something like

pub const UNNEST_PREFIX: &str = "__unnest_";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for the multiple unnest case?

Comment thread datafusion/sql/src/unparser/dialect.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snowflake dialect support for Unparser

3 participants