feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709
feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709fivetran-kwoodbeck wants to merge 4 commits into
Conversation
SQLGlot Integration Test Results✅ All tests passedComparing:
Overallmain: 192441 total, 153536 passed (pass rate: 79.8%) sqlglot:transpile/snowflake-startswith: 106860 total, 106860 passed (pass rate: 100.0%) Transitions: Dialect pair changes: 0 previous results not found, 2 current results not found ✅ All tests passed |
90cc2e1 to
9c55e84
Compare
| if connect.args.get("nocycle") or not isinstance(connect.args.get("connect"), exp.EQ): | ||
| return expression |
There was a problem hiding this comment.
Does Snowflake only allow = comparisons in PRIOR join conditions? And hence, is this why we're only checking for EQ here?
There was a problem hiding this comment.
Snowflake does allows non-equality PRIOR predicates, but we're bailing out here to prevent hanged queries. Snowflake has cycle detection whereas DuckDB recursive CTEs don't, so the CTE could be non-terminating and I don't (yet) see a way around it.
CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 5), (2, 5);
Snowflake query:
SELECT a, b FROM t
START WITH a = 1
CONNECT BY PRIOR b > a
The naive DuckDB recursive CTE translation:
WITH RECURSIVE cte AS (
SELECT a, b FROM t WHERE a = 1 -- anchor: {(1, 5)}
UNION ALL
SELECT t.a, t.b FROM t
JOIN cte ON t.a < cte.b -- child.a < parent.b
)
SELECT * FROM cte
Trace:
- Level 0: {(1, 5)}
- Level 1: rows where a < 5 → {(1, 5), (2, 5)}
- Level 2: from (1,5) → {(1,5),(2,5)}; from (2,5) → {(1,5),(2,5)}
- Level 3: same thing, four more rows
- …never terminates
Snowflake terminates because CONNECT BY has cycle detection. Once a row has been visited, it won't be revisited. DuckDB's UNION ALL would mean the same rows keep being re-added forever.
Do you see any other options here?
There was a problem hiding this comment.
I don't think the EQ is special, cycles can arise for it as well depending on the data. For example, take (1,2),(2,1) with parent = PRIOR id. This is a simple equality predicate that passes the guard, but the recursive CTE we emit will keep producing one new row per iteration and never terminate:
memory D WITH RECURSIVE _rootcte AS (
SELECT id, 1 AS level FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1
UNION ALL
SELECT _child_row.id, _parent_row.level + 1
FROM (VALUES (1,2),(2,1)) _child_row(id,parent)
JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id
WHERE _parent_row.level < 8 -- added this manually as an arbitrary cutoff, only to stop the infinite recursion
) SELECT level, id FROM _rootcte ORDER BY level, id;
┌───────┬───────┐
│ level │ id │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 1 │
│ 4 │ 2 │
│ 5 │ 1 │
│ 6 │ 2 │
│ 7 │ 1 │
│ 8 │ 2 │
└───────┴───────┘
Also, regarding:
Once a row has been visited, it won't be revisited
Snowflake doesn't appear to maintain a global visited set. CONNECT BY tracks the path (a row being its own ancestor), so it expands per-path. The inequality example you shared didn't terminate quickly for me either (it ran several minutes on the 2-row input before I cancelled).
You can confirm what Snowflake ultimately does with an actual cycle here:
WITH t(id, parent) AS (SELECT * FROM VALUES (1,2),(2,1))
SELECT id FROM t START WITH id = 1 CONNECT BY parent = PRIOR id;Do you see any other options here?
What about carrying the ancestor path and exclude rows already on it?
memory D WITH RECURSIVE _rootcte AS (
SELECT id, 1 AS level, [id] AS _path
FROM (VALUES (1,2),(2,1)) t(id,parent) WHERE id = 1
UNION ALL
SELECT _child_row.id, _parent_row.level + 1,
list_append(_parent_row._path, _child_row.id)
FROM (VALUES (1,2),(2,1)) _child_row(id,parent)
JOIN _rootcte _parent_row ON _child_row.parent = _parent_row.id
WHERE NOT list_contains(_parent_row._path, _child_row.id)
) SELECT level, id FROM _rootcte ORDER BY level, id;
┌───────┬───────┐
│ level │ id │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
└───────┴───────┘
This way you don't need to add a constraint on the predicates you accept.
There was a problem hiding this comment.
You're right, my recursive test was actually timing out (not completing). I tried your suggestion, it works well and wasn't too complicated.
| prior = priors[0] | ||
|
|
||
| # Pull everything we need from the original SELECT up front. | ||
| src = expression.args["from_"].this # type: ignore[union-attr] |
There was a problem hiding this comment.
Why do we need to ignore this typing error?
There was a problem hiding this comment.
Also: should we explicitly error out if joins is set? Does Snowflake support FROM, JOINS and START WITH ... CONNECT BY ... in a single query?
There was a problem hiding this comment.
Will fix the typing error. And yes, we should bail if JOINS is set, good catch.
|
|
||
| # WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them. | ||
| select_names = {e.alias_or_name for e in selects} | ||
| extra_cols = list( |
There was a problem hiding this comment.
Nit: do we need to materialize a list here again? We could keep the dict and iterate over its keys, for col in extra_cols works as-is and key order is still preserved.
I think this column collection is only consumed for iteration-purposes, not actually stored under an AST node, meaning a list is not strictly necessary.
| has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects) | ||
|
|
||
| # Anchor: original projections filtered by START WITH; seeds LEVEL at 1. | ||
| anchor_projs = [e.copy() for e in selects] + [exp.column(c) for c in extra_cols] |
There was a problem hiding this comment.
We do quite a few of copying in this transformation. Let's double-check whether all of the copy() calls are needed, dropping the redundant ones. Some builder methods already copy some of their arguments, for example.
| # Inject LEVEL as a synthetic depth counter unless the query already projects it. | ||
| has_level = any(e.alias_or_name.upper() == "LEVEL" for e in selects) |
There was a problem hiding this comment.
I'm not sure if this is right. A couple of examples:
- If you have a
SELECT *in Snowflake,LEVERis not included in the output schema, but I believe the transpiled DuckDB query will include it - If
LEVELappears in a Snowflake query, it's left as-is, resulting in a DuckDB binding error, since we don't produce a computed column for it in that case
I think that we should always produce a computed LEVEL column in the recursive CTE and only project it in the top-level query, when it's explicitly projected in the source query, too. Otherwise, we can do a SELECT * EXCLUDE (LEVEL) ... or something along those lines.
Let's add some tests for LEVEL usage. A few simple tests will expose these issues, if they exist.
There was a problem hiding this comment.
FYI, Snowflake appears to always give precedence to the LEVEL pseudo-column, despite the possible existence of a physical LEVEL column. In other words: the pseudo-column always shadows.
| existing_where = expression.args.get("where") | ||
| existing_with = expression.args.get("with_") | ||
|
|
||
| # WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them. |
There was a problem hiding this comment.
What if you have SELECT * FROM x WHERE c > 1 START WITH ...? Does the extra_cols logic result in duplicating c in the projection list? That may cause ambiguity issues if that's the case. Let's check this.
| # Recursive arm: qualify every column to _child_row (the new row) to disambiguate from _parent_row (the CTE/parent row). | ||
| # Identify which side of the equality is PRIOR (parent key) and which is the child FK. | ||
| parent_col = prior.this | ||
| child_col = connect_pred.expression if connect_pred.this is prior else connect_pred.this |
There was a problem hiding this comment.
What if the predicate contains arithmetic, e.g., PRIOR x = y - 1. Do we handle cases like this?
There was a problem hiding this comment.
will add support for that, and walk the expression to qualify nodes.
|
|
||
| # Attach the CTE, marking the WITH clause recursive. Preserve any existing CTEs. | ||
| if existing_with: | ||
| new_with = existing_with.copy() |
There was a problem hiding this comment.
Why do we need to copy() here?
There was a problem hiding this comment.
Wouldn't we need a copy given that we're modifying it below? new_with.set("recursive", True), etc? I'm still slightly unclear when a copy is needed.
There was a problem hiding this comment.
I'll take another close look, I may have misread something. In general, the generator copies for you, so mutations don't reflect in the original AST being generated. If, though, a sub-AST is used in >1 places, then you likely need to copy() to avoid mutation issues internally, within the generating pass. If this is the case, then you can disregard the comments related to copy(); in any case, worth double-checking.
There was a problem hiding this comment.
Thanks. I reviewed the other copy() calls and removed them, this is the one I wasn't sure about.
| dict.fromkeys( | ||
| col.name | ||
| for col in (existing_where.find_all(exp.Column) if existing_where else []) | ||
| if not col.table and col.name not in select_names |
There was a problem hiding this comment.
Why are we checking whether table is set? What happens when you have a query like this one?
SELECT a FROM src t WHERE t.b = 1 START WITH t.c = 2 CONNECT BY t.d = PRIOR t.eThere was a problem hiding this comment.
Will update to factor in aliases
| for arg in ("order", "limit", "offset"): | ||
| if val := expression.args.get(arg): | ||
| outer.set(arg, val.copy()) |
There was a problem hiding this comment.
Ditto, re: query modifiers. This needs some attention: we should ensure this list is complete.
9c55e84 to
bb9744a
Compare
Adds Snowflake to DuckDB transpilation for
START WITH ... CONNECT BY PRIORhierarchical queries.DuckDB rejects
CONNECT BYverbatim. This addsconnect_by_to_recursive_cteregistered as a preprocessor onexp.Select, which rewrites the query into aWITH RECURSIVE CTEat generation time:START WITHrows from the source table_child_rowjoined to theCTEaliased _parent_row on thePRIORpredicateSELECTre-projects from_rootcte, applyingWHEREpost-hierarchy and forwardingORDER BY, LIMIT, OFFSETFalls through unchanged for
NOCYCLE, multiplePRIORs, or non-equality predicates. CTE name collision is avoided viafind_new_name.WHEREcolumns absent fromSELECTare carried through the CTE so the outer filter can see them.LEVELis injected as a depth counter unless already projected.Tests in sqlglot-integration-tests cover: SELECT *, explicit column list with ORDER BY, PRIOR on left side, WHERE carry-through, existing CTE preservation, and name collision.