Skip to content

feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709

Open
fivetran-kwoodbeck wants to merge 4 commits into
mainfrom
transpile/snowflake-startswith
Open

feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709
fivetran-kwoodbeck wants to merge 4 commits into
mainfrom
transpile/snowflake-startswith

Conversation

@fivetran-kwoodbeck

Copy link
Copy Markdown
Collaborator

Adds Snowflake to DuckDB transpilation for START WITH ... CONNECT BY PRIOR hierarchical queries.

DuckDB rejects CONNECT BY verbatim. This adds connect_by_to_recursive_cte registered as a preprocessor on exp.Select, which rewrites the query into a WITH RECURSIVE CTE at generation time:

  • Anchors START WITH rows from the source table
  • Recursive arm: source table aliased _child_row joined to the CTE aliased _parent_row on the PRIOR predicate
  • Outer SELECT re-projects from _rootcte, applying WHERE post-hierarchy and forwarding ORDER BY, LIMIT, OFFSET

Falls through unchanged for NOCYCLE, multiple PRIORs, or non-equality predicates. CTE name collision is avoided via find_new_name. WHERE columns absent from SELECT are carried through the CTE so the outer filter can see them. LEVEL is injected as a depth counter unless already projected.

Tests in sqlglot-integration-tests cover: SELECT *, explicit column list with ORDER BY, PRIOR on left side, WHERE carry-through, existing CTE preservation, and name collision.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

✅ All tests passed

Comparing:

  • this branch (sqlglot:transpile/snowflake-startswith @ sqlglot 121f079)
  • baseline (main @ sqlglot a76439c)

Overall

main: 192441 total, 153536 passed (pass rate: 79.8%)

sqlglot:transpile/snowflake-startswith: 106860 total, 106860 passed (pass rate: 100.0%)

Transitions:
No change

Dialect pair changes: 0 previous results not found, 2 current results not found

✅ All tests passed

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 90cc2e1 to 9c55e84 Compare June 5, 2026 20:19
@georgesittas georgesittas self-requested a review June 8, 2026 10:16
@georgesittas georgesittas self-assigned this Jun 8, 2026
Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +719 to +720
if connect.args.get("nocycle") or not isinstance(connect.args.get("connect"), exp.EQ):
return expression

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Snowflake only allow = comparisons in PRIOR join conditions? And hence, is this why we're only checking for EQ here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake does allows non-equality PRIOR predicates, but we're bailing out here to prevent hanged queries. Snowflake has cycle detection whereas DuckDB recursive CTEs don't, so the CTE could be non-terminating and I don't (yet) see a way around it.

CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 5), (2, 5);

Snowflake query:

SELECT a, b FROM t
START WITH a = 1
CONNECT BY PRIOR b > a

The naive DuckDB recursive CTE translation:

WITH RECURSIVE cte AS (
    SELECT a, b FROM t WHERE a = 1          -- anchor: {(1, 5)}
    UNION ALL
    SELECT t.a, t.b FROM t
    JOIN cte ON t.a < cte.b                 -- child.a < parent.b
)
SELECT * FROM cte

Trace:

- Level 0: {(1, 5)}
- Level 1: rows where a < 5 → {(1, 5), (2, 5)}
- Level 2: from (1,5) → {(1,5),(2,5)}; from (2,5) → {(1,5),(2,5)}
- Level 3: same thing, four more rows
- …never terminates

Snowflake terminates because CONNECT BY has cycle detection. Once a row has been visited, it won't be revisited. DuckDB's UNION ALL would mean the same rows keep being re-added forever.

Do you see any other options here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the EQ is special, cycles can arise for it as well depending on the data. For example, take (1,2),(2,1) with parent = PRIOR id. This is a simple equality predicate that passes the guard, but the recursive CTE we emit will keep producing one new row per iteration and never terminate:

memory D WITH RECURSIVE _rootcte AS (
           SELECT id, 1 AS level FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1
           UNION ALL
           SELECT _child_row.id, _parent_row.level + 1
           FROM (VALUES (1,2),(2,1)) _child_row(id,parent)
           JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id
           WHERE _parent_row.level < 8 -- added this manually as an arbitrary cutoff, only to stop the infinite recursion
           ) SELECT level, id FROM _rootcte ORDER BY level, id;
┌───────┬───────┐
│ level │  id   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     1 │
│     2 │     2 │
│     3 │     1 │
│     4 │     2 │
│     5 │     1 │
│     6 │     2 │
│     7 │     1 │
│     8 │     2 │
└───────┴───────┘

Also, regarding:

Once a row has been visited, it won't be revisited

Snowflake doesn't appear to maintain a global visited set. CONNECT BY tracks the path (a row being its own ancestor), so it expands per-path. The inequality example you shared didn't terminate quickly for me either (it ran several minutes on the 2-row input before I cancelled).

You can confirm what Snowflake ultimately does with an actual cycle here:

WITH t(id, parent) AS (SELECT * FROM VALUES (1,2),(2,1))
SELECT id FROM t START WITH id = 1 CONNECT BY parent = PRIOR id;

Do you see any other options here?

What about carrying the ancestor path and exclude rows already on it?

memory D  WITH RECURSIVE _rootcte AS (
            SELECT id, 1 AS level, [id] AS _path
            FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1
            UNION ALL
            SELECT _child_row.id, _parent_row.level + 1,
                   list_append(_parent_row._path, _child_row.id)
            FROM (VALUES (1,2),(2,1)) _child_row(id,parent)
            JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id
            WHERE NOT list_contains(_parent_row._path, _child_row.id)
          ) SELECT level, id FROM _rootcte ORDER BY level, id;
┌───────┬───────┐
│ level │  id   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     1 │
│     2 │     2 │
└───────┴───────┘

This way you don't need to add a constraint on the predicates you accept.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, my recursive test was actually timing out (not completing). I tried your suggestion, it works well and wasn't too complicated.

Comment thread sqlglot/generators/duckdb.py Outdated
prior = priors[0]

# Pull everything we need from the original SELECT up front.
src = expression.args["from_"].this # type: ignore[union-attr]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to ignore this typing error?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: should we explicitly error out if joins is set? Does Snowflake support FROM, JOINS and START WITH ... CONNECT BY ... in a single query?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix the typing error. And yes, we should bail if JOINS is set, good catch.

Comment thread sqlglot/generators/duckdb.py Outdated

# WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them.
select_names = {e.alias_or_name for e in selects}
extra_cols = list(

@georgesittas georgesittas Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we need to materialize a list here again? We could keep the dict and iterate over its keys, for col in extra_cols works as-is and key order is still preserved.

I think this column collection is only consumed for iteration-purposes, not actually stored under an AST node, meaning a list is not strictly necessary.

Comment thread sqlglot/generators/duckdb.py Outdated
has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects)

# Anchor: original projections filtered by START WITH; seeds LEVEL at 1.
anchor_projs = [e.copy() for e in selects] + [exp.column(c) for c in extra_cols]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do quite a few of copying in this transformation. Let's double-check whether all of the copy() calls are needed, dropping the redundant ones. Some builder methods already copy some of their arguments, for example.

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +745 to +746
# Inject LEVEL as a synthetic depth counter unless the query already projects it.
has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is right. A couple of examples:

  • If you have a SELECT * in Snowflake, LEVER is not included in the output schema, but I believe the transpiled DuckDB query will include it
  • If LEVEL appears in a Snowflake query, it's left as-is, resulting in a DuckDB binding error, since we don't produce a computed column for it in that case

I think that we should always produce a computed LEVEL column in the recursive CTE and only project it in the top-level query, when it's explicitly projected in the source query, too. Otherwise, we can do a SELECT * EXCLUDE (LEVEL) ... or something along those lines.

Let's add some tests for LEVEL usage. A few simple tests will expose these issues, if they exist.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, Snowflake appears to always give precedence to the LEVEL pseudo-column, despite the possible existence of a physical LEVEL column. In other words: the pseudo-column always shadows.

existing_where = expression.args.get("where")
existing_with = expression.args.get("with_")

# WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you have SELECT * FROM x WHERE c > 1 START WITH ...? Does the extra_cols logic result in duplicating c in the projection list? That may cause ambiguity issues if that's the case. Let's check this.

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +756 to +759
# Recursive arm: qualify every column to _child_row (the new row) to disambiguate from _parent_row (the CTE/parent row).
# Identify which side of the equality is PRIOR (parent key) and which is the child FK.
parent_col = prior.this
child_col = connect_pred.expression if connect_pred.this is prior else connect_pred.this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the predicate contains arithmetic, e.g., PRIOR x = y - 1. Do we handle cases like this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add support for that, and walk the expression to qualify nodes.

Comment thread sqlglot/generators/duckdb.py Outdated

# Attach the CTE, marking the WITH clause recursive. Preserve any existing CTEs.
if existing_with:
new_with = existing_with.copy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to copy() here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we need a copy given that we're modifying it below? new_with.set("recursive", True), etc? I'm still slightly unclear when a copy is needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another close look, I may have misread something. In general, the generator copies for you, so mutations don't reflect in the original AST being generated. If, though, a sub-AST is used in >1 places, then you likely need to copy() to avoid mutation issues internally, within the generating pass. If this is the case, then you can disregard the comments related to copy(); in any case, worth double-checking.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I reviewed the other copy() calls and removed them, this is the one I wasn't sure about.

Comment thread sqlglot/generators/duckdb.py Outdated
dict.fromkeys(
col.name
for col in (existing_where.find_all(exp.Column) if existing_where else [])
if not col.table and col.name not in select_names

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking whether table is set? What happens when you have a query like this one?

SELECT a FROM src t WHERE t.b = 1 START WITH t.c = 2 CONNECT BY t.d = PRIOR t.e

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to factor in aliases

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +809 to +811
for arg in ("order", "limit", "offset"):
if val := expression.args.get(arg):
outer.set(arg, val.copy())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, re: query modifiers. This needs some attention: we should ensure this list is complete.

@fivetran-kwoodbeck fivetran-kwoodbeck changed the title feat(duckdb): Add transpilation support for Snowflake STARTS WITH feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY Jun 8, 2026
@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 9c55e84 to bb9744a Compare June 9, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants