Skip to content

[SPARK-57437][SQL] Infer additional constraints by substituting attribute-to-literal bindings#56499

Open
xumingming wants to merge 6 commits into
apache:masterfrom
xumingming:catalyst-infer-constraints-from-literal-bindings
Open

[SPARK-57437][SQL] Infer additional constraints by substituting attribute-to-literal bindings#56499
xumingming wants to merge 6 commits into
apache:masterfrom
xumingming:catalyst-infer-constraints-from-literal-bindings

Conversation

@xumingming

Copy link
Copy Markdown

What changes were proposed in this pull request?

When a predicate binds an attribute to a literal (e.g. a.pt = '20260610') and another predicate references that attribute (e.g. b.pt >= f(a.pt)), Catalyst previously did not exploit the literal binding to derive a simpler, pushable predicate.

   SELECT *
   FROM a
   LEFT JOIN b
     ON a.key = b.key
    AND b.pt >= f(a.pt)
   WHERE a.pt = '20260610';

Here table b is full scaned, thus very bad performance.

This change extends ConstraintHelper.inferAdditionalConstraints with a second pass that:

  1. Collects Attribute = Literal bindings from the constraint set.
  2. Substitutes the literal into non-equality predicates that reference those attributes.
  3. Adds the resulting deterministic expressions as new inferred constraints.

After constant folding, the inferred predicates can be pushed into scans as partition filters, avoiding full-table scans in cases where only a small subset of partitions can match.

Why are the changes needed?

Currently query of the following pattern causes full table scan for table b:

   SELECT *
   FROM a
   LEFT JOIN b
     ON a.key = b.key
    AND b.pt >= f(a.pt)
   WHERE a.pt = '20260610';

With this optimization table b can get very good partition pruning.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

…bute-to-literal bindings

When a predicate binds an attribute to a literal (e.g. a.pt = '20260610') and another predicate references that attribute (e.g. b.pt >= f(a.pt)), Catalyst previously did not exploit the literal binding to derive a simpler, pushable predicate.

This change extends ConstraintHelper.inferAdditionalConstraints with a second pass that:

  1. Collects Attribute = Literal bindings from the constraint set.

  2. Substitutes the literal into non-equality predicates that reference those attributes.

  3. Adds the resulting deterministic expressions as new inferred constraints.

After constant folding, the inferred predicates can be pushed into scans as partition filters, avoiding full-table scans in cases where only a small subset of partitions can match.
@xumingming

Copy link
Copy Markdown
Author

@cloud-fan Can you help taking a look at this PR?

// When a.pt = '20260610' and b.pt >= f(a.pt) are both in the constraint set,
// substituting yields b.pt >= f('20260610'), which ConstantFolding then reduces to a
// literal comparison that can be pushed into the right-side scan as a partition filter.
val attrToLiteral: Map[Attribute, Literal] = predicates.collect {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the existing machinery, instead of a hand-rolled second pass?

The file already has replaceConstraints (which substitutes via semanticEquals). The natural implementation would be to add cases to the existing pattern match, e.g.

case eq @ EqualTo(l: Attribute, r: Literal) =>
  inferredConstraints ++= replaceConstraints(predicates - eq, l, r)
// (as well as the literal-on-left mirror)

rather than building a parallel attrToLiteral map with manual transform.

This would be simpler, remove the extra `if (attrToLiteral.nonEmpty), and avoid an equality-semantics inconsistency.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 0 non-blocking, 2 nits.
Sound, useful optimization, but it reintroduces a collation-substitution hazard that the peer ConstantPropagation rule already guards against.

Correctness (1)

  • QueryPlanConstraints.scala:95: attr-to-literal substitution has no collation guard, can infer a false constraint for collated string columns and drop valid rows — see inline

Nits: 2 minor items — stale method Scaladoc (QueryPlanConstraints.scala:58-62, doesn't mention the new inequality-substitution inference) and a vacuous-pass guard in the new test (see inline).

Verification

Traced soundness of P(a) -> P[a := lit] given an a = lit constraint: equivalent for the NULL case (a = lit true implies a non-null), type-safe (the EqualTo(attr, Literal) pattern matches only on identical types), and non-determinism is correctly gated by pred.deterministic. The one dimension that is not equivalent and not gated is string collation: under a non-binary-stable collation a = lit does not imply binary(a) = binary(lit), so substituting the literal inside a re-collating comparison changes the result. Idempotence holds — the pass never synthesizes a new attr = literal binding, so on re-entry the unfolded substituted form re-matches an existing constraint and is subtracted.

case _: EqualTo => // already handled above
case pred if pred.deterministic && pred.references.exists(attrToLiteral.contains) =>
inferredConstraints += pred.transform {
case a: Attribute if attrToLiteral.contains(a) => attrToLiteral(a)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This substitution has no collation guard, so it can produce wrong results for collated string columns: a = lit does not imply binary(a) = binary(lit) under a non-binary-stable (case/accent-insensitive) collation, so replacing a inside a comparison that re-collates it changes the result.

Concretely, with a.c STRING COLLATE UTF8_LCASE, WHERE a.c = 'hello', and a join condition b.c COLLATE UTF8_BINARY >= a.c COLLATE UTF8_BINARY, this infers the pushed filter b.c COLLATE UTF8_BINARY >= 'hello' — which drops b.c = 'HZ' even though it legitimately joins with a.c = 'HELLO' ('HZ' >= 'HELLO' is true in binary, but 'HZ' >= 'hello' is false).

ConstantPropagation guards exactly this with isBinaryStable(a.dataType) (expressions.scala:233). Suggest skipping the substitution unless isBinaryStable(attr.dataType), and adding a collated-column test (it should fail today and pass once guarded).


// The optimized right side must carry the inferred range filters.
optimized.foreach {
case Join(_, rightChild, `joinType`, _, _) =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If optimized contains no Join node, the case _ => swallows every node and this block asserts nothing, so the test passes vacuously. Consider asserting that a Join was actually matched, so a future change that restructures the plan can't silently neuter the test.

@xumingming

Copy link
Copy Markdown
Author

@cloud-fan @uros-b Thanks for the review. The following changes are made:

  • Added isBinaryStable(a.dataType) guard so we only do the substitution for binary stable strings.
  • Optimized the test to check the Join is indeed there.
  • Removed the second pass, reused the replaceConstraints function to do that.

But there are some issues pop up, do you have suggestions how to proceed with these two issues?.

The first one is that one of the test in InferFiltersFromConstraintsSuite.scala now generates predicate like the following:

testRelation2.where(IsNotNull($"b") && ($"b" === 1L) &&
          ($"b".attr.cast(IntegerType) === 1)).subquery("right")

There are two similar predicates for the b == 1.

The second one is the TPCDS q78 plan changes(which caused the CI to fail)

Before:

Scan parquet ... store_sales ...
  SubqueryBroadcast [d_date_sk] #1   <-- producer
Scan parquet ... web_sales ...
  ReusedSubquery [d_date_sk] #1      <-- consumer of #1
Scan parquet ... catalog_sales ...
  ReusedSubquery [d_date_sk] #1      <-- consumer of #1

After:

Scan parquet ... store_sales ...
  SubqueryBroadcast [d_date_sk] #1   <-- producer, unchanged
Scan parquet ... web_sales ...
  SubqueryBroadcast [d_date_sk] #2   <-- NEW separate producer
Scan parquet ... catalog_sales ...
  ReusedSubquery [d_date_sk] #2      <-- consumer of #2

Q78 plan-stability tests fail because the execution plan now builds two date_dim dynamic-pruning subqueries instead of one. store_sales keeps the original EqualTo(d_year, 2000) subquery, but web_sales now creates a duplicate with EqualNullSafe(d_year, 2000). Subquery-reuse detection treats them as different, so catalog_sales reuses the new web_sales subquery and the original broadcast is wasted.

Where is the EqualNullSafe(d_year, 2000) from? For the outer query’s ss LEFT JOIN ws ON ws_sold_year = ss_sold_year ..., the optimizer generates an EqualNullSafe(ws_sold_year, ss_sold_year) constraint (needed for LEFT JOIN null semantics).

The new literal-substitution inference in QueryPlanConstraints.inferAdditionalConstraints substitutes ss_sold_year → 2000 into that EqualNullSafe, producing:

EqualNullSafe(ws_sold_year, 2000)

That inferred predicate is then pushed down through the ws CTE and into its date_dim dynamic-pruning subquery, where it becomes EqualNullSafe(d_year, 2000).

The original code never substituted literals into EqualNullSafe, so it never created this predicate. It only inferred EqualTo(ws_sold_year, 2000) from the EqualTo(ws_sold_year, ss_sold_year) constraint.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 addressed, 0 remaining, 2 new.
Collation guard, machinery reuse, Scaladoc, and the vacuous-test guard from last round are all addressed, thanks. The one remaining blocker is structural, and it's the common root of the two issues you raised.

Design / architecture (1)

  • QueryPlanConstraints.scala:75: the generic literal-substitution shape (folding the literal into every remaining predicate) is what keeps generating the trouble — collation (now patched), the EqualNullSafe(attr, literal) that breaks subquery-reuse and regresses q78, and the redundant cast(b)=1 duplicate. Suggest reshaping as a focused inequality-bound inference; see inline (minimal EqualNullSafe-exclusion fix also noted there to unblock CI).

Nits: 1 minor item (see inline comments).

Verification

Re-confirmed P(a) -> P[a := lit] is row-equivalent on NULL/3VL (an always-true a = lit implies a non-null and = lit), type (EqualTo(attr, Literal) matches identical types only), non-determinism (dropped by the plan-level deterministic filter), and collation (now gated by isBinaryStable — guarding the bound attribute is sufficient). The one dimension that is row-equivalent but plan-degrading is EqualNullSafe: substituting into the LEFT JOIN's EqualNullSafe(ws, ss) produces a <=>-form predicate that subquery-reuse treats as distinct from the EqualTo form, so a duplicate dynamic-pruning subquery is built (rows unchanged, so checkAnswer is blind to it; only the plan regresses). Both the targeted redesign and the minimal exclusion eliminate it.

val candidateConstraints = predicates - eq - EqualNullSafe(l, r)
inferredConstraints ++= replaceConstraints(candidateConstraints, l, r)
inferredConstraints ++= replaceConstraints(candidateConstraints, r, l)
case eq @ EqualTo(l: Attribute, r: Literal) if isBinaryStable(l.dataType) =>

@cloud-fan cloud-fan Jun 21, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two new EqualTo(attr, Literal) cases fold the bound literal into every remaining predicate via replaceConstraints(predicates - eq, l, r). That generality — not the optimization itself — is the common source of the three problems here: collation (now patched with isBinaryStable), the EqualNullSafe(attr, literal) that breaks subquery-reuse and regresses q78, and the redundant cast(b)=1 duplicate (test nit).

The root issue is that two different inferences are being squeezed into the same equality pattern-match. Equality transitivity (the existing EqualTo(attr, attr) case) and folding a constant into a range predicate are distinct concerns and want distinct code. I'd pull the literal-into-inequality work into its own pass rather than adding two cases here — the same way constructIsNotNullConstraints is a separate method invoked alongside inferAdditionalConstraints:

  1. Drop the two EqualTo(l: Attribute, r: Literal) / EqualTo(l: Literal, r: Attribute) cases from the match.
  2. Add a dedicated inferRangeConstraints(predicates) (invoked after the equality loop and unioned into inferredConstraints like the others), which:
    • collects the Attribute -> Literal bindings into a map first, keeping the isBinaryStable guard per bound attribute;
    • walks only the inequality predicates (<, <=, >, >=), and where one side is a pinned attribute — directly or under a numeric Add/Subtract offset — substitutes the literal and emits the folded bound (b.v >= a.k with a.k = 5 => b.v >= 5; b.v <= a.k + 10 => b.v <= 15);
    • keeps a derived bound only when it is actually new/tighter than an existing bound on that attribute (so re-runs don't churn).

Because that pass never rewrites EqualTo/EqualNullSafe, the q78 duplicate-subquery regression and the cast(b)=1 duplicates cannot arise by construction; the collation guard attaches naturally to the bounded attribute; and equality propagation stays where it belongs, in the existing transitivity case. As a bonus this keeps the Once-batch idempotence reasoning local to one place — a range bound is only ever emitted when strictly tighter, so the pass is self-idempotent without needing the EqualNullSafe carve-out the equality case relies on.

If you'd rather keep the current shape for this round, the minimum to unblock q78 is to exclude EqualNullSafe from the targets in both cases:

case eq @ EqualTo(l: Attribute, r: Literal) if isBinaryStable(l.dataType) =>
  inferredConstraints ++=
    replaceConstraints((predicates - eq).filterNot(_.isInstanceOf[EqualNullSafe]), l, r)

That restores the original q78 plan with no golden-file churn (the old code inferred nothing from the LEFT JOIN's EqualNullSafe, since the attr-attr case only matched EqualTo), but leaves the duplicate-equality churn in place. Note the sibling EqualTo(attr, attr) case at lines 70-72 already excludes EqualNullSafe for the same idempotence reason.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! Made corresponding changes.

right,
testRelation1.where(IsNotNull($"a") && ($"a" === 1)).subquery("left"),
testRelation2.where(IsNotNull($"b") && ($"b" === 1L) &&
($"b".attr.cast(IntegerType) === 1)).subquery("right"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expected plan now carries both ($"b" === 1L) and ($"b".attr.cast(IntegerType) === 1) — two encodings of b = 1 — and the suite's optimizer batch gained ConstantFolding + UnwrapCastInBinaryComparison (lines 41/43) whose comment says they "collapse and deduplicate" these, yet both still appear, so the dedup is incomplete. This is the same over-eager-substitution smell as the design comment above; once the inference is reshaped (or at least narrowed), these batch additions should no longer be needed and the golden expected plan shouldn't bake in both predicate forms.

…bute-to-literal bindings

Reshape the literal-binding substitution into a dedicated
inferConstraintsFromLiteralBindings method to avoid substituting into
EqualTo/EqualNullSafe targets, which caused a duplicate dynamic-pruning
subquery regression and redundant cast-literal predicates in the plan.
Restore the original Scaladoc on inferAdditionalConstraints and add
SimplifyBinaryComparison/BooleanSimplification back to the test
optimizer batch to match the pre-PR baseline.
…onstraintsFromLiteralBindings

The EqualTo exclusion in inferConstraintsFromLiteralBindings was too
broad, blocking substitution into attr=expr forms like b = a + 1. Narrow
it to only attr=attr and cast-equality cases that are already handled by
inferAdditionalConstraints transitivity. Also fix three tests that were
calling inferAdditionalConstraints instead of inferConstraintsFromLiteralBindings.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 addressed, 0 remaining, 2 new.
The redesign cleanly resolves last round's blocking design finding and the nit — a separate inferConstraintsFromLiteralBindings pass with EqualTo/EqualNullSafe/IsNotNull excluded, q78 restored with no golden-file churn, and the redundant-predicate batch changes reverted. Thanks for the rework. One new blocker, test-only.

Correctness (1)

  • ConstraintPropagationSuite.scala:534: the collation regression test calls inferAdditionalConstraints, but literal substitution moved to inferConstraintsFromLiteralBindings — so it returns empty and the test passes vacuously, no longer guarding the isBinaryStable collation fix. One-word fix — see inline.

Nits: 1 minor item (see inline comments).

Verification

Re-verified the redesign: inferConstraintsFromLiteralBindings is row-equivalent on NULL/3VL, type, non-determinism (plan-level deterministic filter), and collation (gated by isBinaryStable on the binding). The EqualNullSafe and equality-form targets are excluded, which restores the q78 plan and removes the duplicate predicates. The reference-less lit = lit produced by substituting into the binding itself is dropped by the references.nonEmpty filter in both inferNewFilter and the constraints lazy val, so it never reaches a filter.

GreaterThanOrEqual(b, a)))

val helper = new ConstraintHelper {}
val inferred = helper.inferAdditionalConstraints(constraints)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collation regression test is now vacuous. Literal substitution moved out of inferAdditionalConstraints into inferConstraintsFromLiteralBindings, so inferAdditionalConstraints no longer has any EqualTo(attr, Literal) case — for {EqualTo(a, 'hello'), GreaterThanOrEqual(b, a)} it matches nothing and returns empty. The assertion !inferred.exists { b >= 'hello' } therefore passes trivially, and would still pass even if the isBinaryStable guard were deleted from inferConstraintsFromLiteralBindings — so it no longer protects the collation fix it's named for.

The three sibling tests (the range / literal-on-left / EqualTo-expression cases) correctly call inferConstraintsFromLiteralBindings; this one looks like it was missed in the rename. The production guard itself is correct, so this is a test-only fix:

Suggested change
val inferred = helper.inferAdditionalConstraints(constraints)
val inferred = helper.inferConstraintsFromLiteralBindings(constraints)


test("infer range constraints by substituting attr=literal bindings") {
// When a.pt = '20260610' (modeled as a.k = 5) and b.v >= a.k are both present,
// inferAdditionalConstraints should emit b.v >= 5.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale method name in the comment — this test (and the one at line 509) describe inferAdditionalConstraints, but the substitution they exercise is now done by inferConstraintsFromLiteralBindings (this test body even calls it). Worth updating both comments to name the new method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants