-
Notifications
You must be signed in to change notification settings - Fork 297
Add ExistenceJoin support to Comet native execution #3881
Description
What is the problem the feature request solves?
Comet does not support ExistenceJoin, causing incorrect results for correlated IN subqueries combined with OR on Spark 4.0. Adding native ExistenceJoin support would allow Comet to handle these plans end-to-end and eliminate the mixed Spark/Comet execution that produces wrong results.
The following query returns incorrect results when Comet is enabled on Spark 4.0:
CREATE TEMPORARY VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE TEMPORARY VIEW t2(c1, c2) AS VALUES (0, 2), (0, 3);
SELECT * FROM t1 WHERE
c1 IN (SELECT count(*) + 1 FROM t2 WHERE t2.c1 = t1.c1) OR
c2 IN (SELECT count(*) - 1 FROM t2 WHERE t2.c1 = t1.c1);
Expected: (0, 1) and (1, 2) (both rows match via different OR branches) Actual with Comet: (1, 2) only (row (0, 1) is dropped)
The same query with a single IN (no OR) produces correct results.
Describe the potential solution
No response
Additional context
This only reproduces on Spark 4.0 because:
in-count-bug.sql test only exists in Spark 4.0
Spark 4.0 uses a new decorrelation path (decorrelateInnerQueryEnabledForExistsIn = true by default) that produces DomainJoin + ExistenceJoin plan structures
Spark 3.5 uses the old decorrelation which has the "count bug" (different expected behavior)