feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY by fivetran-kwoodbeck · Pull Request #7709 · tobymao/sqlglot

fivetran-kwoodbeck · 2026-06-04T20:04:57Z

Adds Snowflake to DuckDB transpilation for START WITH ... CONNECT BY PRIOR hierarchical queries.

DuckDB rejects CONNECT BY verbatim. This adds connect_by_to_recursive_cte registered as a preprocessor on exp.Select, which rewrites the query into a WITH RECURSIVE CTE at generation time:

Anchors START WITH rows from the source table
Recursive arm: source table aliased _child_row joined to the CTE aliased _parent_row on the PRIOR predicate
Outer SELECT re-projects from _rootcte, applying WHERE post-hierarchy and forwarding ORDER BY, LIMIT, OFFSET

Falls through unchanged for NOCYCLE, multiple PRIORs, or non-equality predicates. CTE name collision is avoided via find_new_name. WHERE columns absent from SELECT are carried through the CTE so the outer filter can see them. LEVEL is injected as a depth counter unless already projected.

Tests in sqlglot-integration-tests cover: SELECT *, explicit column list with ORDER BY, PRIOR on left side, WHERE carry-through, existing CTE preservation, and name collision.

github-actions · 2026-06-04T20:37:48Z

SQLGlot Integration Test Results

✅ All tests passed

Comparing:

this branch (sqlglot:transpile/snowflake-startswith @ sqlglot 121f079)
baseline (main @ sqlglot a76439c)

Overall

main: 192441 total, 153536 passed (pass rate: 79.8%)

sqlglot:transpile/snowflake-startswith: 106860 total, 106860 passed (pass rate: 100.0%)

Transitions:
No change

Dialect pair changes: 0 previous results not found, 2 current results not found

✅ All tests passed

georgesittas · 2026-06-08T10:56:44Z

+    if connect.args.get("nocycle") or not isinstance(connect.args.get("connect"), exp.EQ):
+        return expression


Does Snowflake only allow = comparisons in PRIOR join conditions? And hence, is this why we're only checking for EQ here?

Snowflake does allows non-equality PRIOR predicates, but we're bailing out here to prevent hanged queries. Snowflake has cycle detection whereas DuckDB recursive CTEs don't, so the CTE could be non-terminating and I don't (yet) see a way around it.

CREATE TABLE t (a INT, b INT); INSERT INTO t VALUES (1, 5), (2, 5);

Snowflake query:

SELECT a, b FROM t START WITH a = 1 CONNECT BY PRIOR b > a

The naive DuckDB recursive CTE translation:

WITH RECURSIVE cte AS ( SELECT a, b FROM t WHERE a = 1 -- anchor: {(1, 5)} UNION ALL SELECT t.a, t.b FROM t JOIN cte ON t.a < cte.b -- child.a < parent.b ) SELECT * FROM cte

Trace:

- Level 0: {(1, 5)} - Level 1: rows where a < 5 → {(1, 5), (2, 5)} - Level 2: from (1,5) → {(1,5),(2,5)}; from (2,5) → {(1,5),(2,5)} - Level 3: same thing, four more rows - …never terminates

Snowflake terminates because CONNECT BY has cycle detection. Once a row has been visited, it won't be revisited. DuckDB's UNION ALL would mean the same rows keep being re-added forever.

Do you see any other options here?

I don't think the EQ is special, cycles can arise for it as well depending on the data. For example, take (1,2),(2,1) with parent = PRIOR id. This is a simple equality predicate that passes the guard, but the recursive CTE we emit will keep producing one new row per iteration and never terminate:

memory D WITH RECURSIVE _rootcte AS ( SELECT id, 1 AS level FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1 UNION ALL SELECT _child_row.id, _parent_row.level + 1 FROM (VALUES (1,2),(2,1)) _child_row(id,parent) JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id WHERE _parent_row.level < 8 -- added this manually as an arbitrary cutoff, only to stop the infinite recursion ) SELECT level, id FROM _rootcte ORDER BY level, id; ┌───────┬───────┐ │ level │ id │ │ int32 │ int32 │ ├───────┼───────┤ │ 1 │ 1 │ │ 2 │ 2 │ │ 3 │ 1 │ │ 4 │ 2 │ │ 5 │ 1 │ │ 6 │ 2 │ │ 7 │ 1 │ │ 8 │ 2 │ └───────┴───────┘

Also, regarding:

Once a row has been visited, it won't be revisited

Snowflake doesn't appear to maintain a global visited set. CONNECT BY tracks the path (a row being its own ancestor), so it expands per-path. The inequality example you shared didn't terminate quickly for me either (it ran several minutes on the 2-row input before I cancelled).

You can confirm what Snowflake ultimately does with an actual cycle here:

WITH t(id, parent) AS (SELECT * FROM VALUES (1,2),(2,1)) SELECT id FROM t START WITH id = 1 CONNECT BY parent = PRIOR id;

Do you see any other options here?

What about carrying the ancestor path and exclude rows already on it?

memory D WITH RECURSIVE _rootcte AS ( SELECT id, 1 AS level, [id] AS _path FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1 UNION ALL SELECT _child_row.id, _parent_row.level + 1, list_append(_parent_row._path, _child_row.id) FROM (VALUES (1,2),(2,1)) _child_row(id,parent) JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id WHERE NOT list_contains(_parent_row._path, _child_row.id) ) SELECT level, id FROM _rootcte ORDER BY level, id; ┌───────┬───────┐ │ level │ id │ │ int32 │ int32 │ ├───────┼───────┤ │ 1 │ 1 │ │ 2 │ 2 │ └───────┴───────┘

This way you don't need to add a constraint on the predicates you accept.

You're right, my recursive test was actually timing out (not completing). I tried your suggestion, it works well and wasn't too complicated.

georgesittas · 2026-06-08T11:00:50Z

+    prior = priors[0]
+
+    # Pull everything we need from the original SELECT up front.
+    src = expression.args["from_"].this  # type: ignore[union-attr]


Why do we need to ignore this typing error?

Also: should we explicitly error out if joins is set? Does Snowflake support FROM, JOINS and START WITH ... CONNECT BY ... in a single query?

Will fix the typing error. And yes, we should bail if JOINS is set, good catch.

georgesittas · 2026-06-08T11:05:47Z

+
+    # WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them.
+    select_names = {e.alias_or_name for e in selects}
+    extra_cols = list(


Nit: do we need to materialize a list here again? We could keep the dict and iterate over its keys, for col in extra_cols works as-is and key order is still preserved.

I think this column collection is only consumed for iteration-purposes, not actually stored under an AST node, meaning a list is not strictly necessary.

georgesittas · 2026-06-08T11:11:38Z

+    has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects)
+
+    # Anchor: original projections filtered by START WITH; seeds LEVEL at 1.
+    anchor_projs = [e.copy() for e in selects] + [exp.column(c) for c in extra_cols]


We do quite a few of copying in this transformation. Let's double-check whether all of the copy() calls are needed, dropping the redundant ones. Some builder methods already copy some of their arguments, for example.

georgesittas · 2026-06-08T11:12:16Z

+    # Inject LEVEL as a synthetic depth counter unless the query already projects it.
+    has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects)


I'm not sure if this is right. A couple of examples:

If you have a SELECT * in Snowflake, LEVER is not included in the output schema, but I believe the transpiled DuckDB query will include it

If LEVEL appears in a Snowflake query, it's left as-is, resulting in a DuckDB binding error, since we don't produce a computed column for it in that case

I think that we should always produce a computed LEVEL column in the recursive CTE and only project it in the top-level query, when it's explicitly projected in the source query, too. Otherwise, we can do a SELECT * EXCLUDE (LEVEL) ... or something along those lines.

Let's add some tests for LEVEL usage. A few simple tests will expose these issues, if they exist.

FYI, Snowflake appears to always give precedence to the LEVEL pseudo-column, despite the possible existence of a physical LEVEL column. In other words: the pseudo-column always shadows.

georgesittas · 2026-06-08T11:33:18Z

+    existing_where = expression.args.get("where")
+    existing_with = expression.args.get("with_")
+
+    # WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them.


What if you have SELECT * FROM x WHERE c > 1 START WITH ...? Does the extra_cols logic result in duplicating c in the projection list? That may cause ambiguity issues if that's the case. Let's check this.

georgesittas · 2026-06-08T11:42:24Z

+    # Recursive arm: qualify every column to _child_row (the new row) to disambiguate from _parent_row (the CTE/parent row).
+    # Identify which side of the equality is PRIOR (parent key) and which is the child FK.
+    parent_col = prior.this
+    child_col = connect_pred.expression if connect_pred.this is prior else connect_pred.this


What if the predicate contains arithmetic, e.g., PRIOR x = y - 1. Do we handle cases like this?

will add support for that, and walk the expression to qualify nodes.

georgesittas · 2026-06-08T11:44:25Z

+
+    # Attach the CTE, marking the WITH clause recursive. Preserve any existing CTEs.
+    if existing_with:
+        new_with = existing_with.copy()


Why do we need to copy() here?

Wouldn't we need a copy given that we're modifying it below? new_with.set("recursive", True), etc? I'm still slightly unclear when a copy is needed.

I'll take another close look, I may have misread something. In general, the generator copies for you, so mutations don't reflect in the original AST being generated. If, though, a sub-AST is used in >1 places, then you likely need to copy() to avoid mutation issues internally, within the generating pass. If this is the case, then you can disregard the comments related to copy(); in any case, worth double-checking.

Thanks. I reviewed the other copy() calls and removed them, this is the one I wasn't sure about.

georgesittas · 2026-06-08T11:47:42Z

+        dict.fromkeys(
+            col.name
+            for col in (existing_where.find_all(exp.Column) if existing_where else [])
+            if not col.table and col.name not in select_names


Why are we checking whether table is set? What happens when you have a query like this one?

SELECT a FROM src t WHERE t.b = 1 START WITH t.c = 2 CONNECT BY t.d = PRIOR t.e

Will update to factor in aliases

georgesittas · 2026-06-08T11:50:14Z

+    for arg in ("order", "limit", "offset"):
+        if val := expression.args.get(arg):
+            outer.set(arg, val.copy())


Ditto, re: query modifiers. This needs some attention: we should ensure this list is complete.

fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 90cc2e1 to 9c55e84 Compare June 5, 2026 20:19

georgesittas self-requested a review June 8, 2026 10:16

georgesittas self-assigned this Jun 8, 2026

georgesittas requested changes Jun 8, 2026

View reviewed changes

fivetran-kwoodbeck changed the title ~~feat(duckdb): Add transpilation support for Snowflake STARTS WITH~~ feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY Jun 8, 2026

fivetran-kwoodbeck added 4 commits June 9, 2026 10:03

changes so far

294e5e4

refactored

5f4a821

reverted makefile

a0553bd

many updates, as per PR feedback

bb9744a

fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 9c55e84 to bb9744a Compare June 9, 2026 14:04

		if connect.args.get("nocycle") or not isinstance(connect.args.get("connect"), exp.EQ):
		return expression

		# Inject LEVEL as a synthetic depth counter unless the query already projects it.
		has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects)

Conversation

fivetran-kwoodbeck commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SQLGlot Integration Test Results

✅ All tests passed

Overall

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgesittas Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 4, 2026 •

edited

Loading

georgesittas Jun 8, 2026 •

edited

Loading