Skip to content

Add ExistenceJoin support to Comet native execution #3881

@Shekharrajak

Description

@Shekharrajak

What is the problem the feature request solves?

Comet does not support ExistenceJoin, causing incorrect results for correlated IN subqueries combined with OR on Spark 4.0. Adding native ExistenceJoin support would allow Comet to handle these plans end-to-end and eliminate the mixed Spark/Comet execution that produces wrong results.

The following query returns incorrect results when Comet is enabled on Spark 4.0:

CREATE TEMPORARY VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE TEMPORARY VIEW t2(c1, c2) AS VALUES (0, 2), (0, 3);

SELECT * FROM t1 WHERE
  c1 IN (SELECT count(*) + 1 FROM t2 WHERE t2.c1 = t1.c1) OR
  c2 IN (SELECT count(*) - 1 FROM t2 WHERE t2.c1 = t1.c1);

Expected: (0, 1) and (1, 2) (both rows match via different OR branches) Actual with Comet: (1, 2) only (row (0, 1) is dropped)

The same query with a single IN (no OR) produces correct results.

Describe the potential solution

No response

Additional context

This only reproduces on Spark 4.0 because:

in-count-bug.sql test only exists in Spark 4.0
Spark 4.0 uses a new decorrelation path (decorrelateInnerQueryEnabledForExistsIn = true by default) that produces DomainJoin + ExistenceJoin plan structures
Spark 3.5 uses the old decorrelation which has the "count bug" (different expected behavior)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions