Skip to content

Fix volatile scalar subquery deduplication#22736

Open
kumarUjjawal wants to merge 4 commits into
apache:mainfrom
kumarUjjawal:fix/scalar-subquery-volatile-dedup
Open

Fix volatile scalar subquery deduplication#22736
kumarUjjawal wants to merge 4 commits into
apache:mainfrom
kumarUjjawal:fix/scalar-subquery-volatile-dedup

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Identical uncorrelated scalar subqueries are deduplicated, but this is not correct when the subquery contains volatile expressions such as random().

For example, (SELECT random()) - (SELECT random()) must evaluate both scalar subquery occurrences independently.

What changes are included in this PR?

  • Detect volatility inside subquery plans.
  • Avoid CSE deduplication for volatile scalar subqueries.
  • Keep volatile scalar subquery occurrences distinct in physical planning.
  • Preserve deduplication for non-volatile scalar subqueries.
  • Add regression coverage for same-node, cross-node, nested, and shared-expression cases.
  • Update upgrade notes for the public ExecutionProps field type.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes. Queries with repeated volatile scalar subqueries now evaluate each occurrence independently instead of incorrectly reusing one result.

There is also a public API type change for ExecutionProps::subquery_indexes, documented in the upgrade notes.

@github-actions github-actions Bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 3, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

@neilconway would you be available to review this pr?

Comment thread datafusion/core/src/physical_planner.rs Outdated
Comment on lines +487 to +499
let mut all_subqueries = Self::collect_scalar_subqueries(logical_plan);
let freshened = if all_subqueries.iter().any(Subquery::is_volatile) {
Some(Self::freshen_volatile_subqueries(logical_plan)?)
} else {
Vec::new()
None
};
let logical_plan = match &freshened {
Some(freshened) => {
all_subqueries = Self::collect_scalar_subqueries(freshened);
freshened
}
None => logical_plan,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling collect_scalar_subqueries twice is unfortunate; as is the need to traverse the whole tree yet again to freshen the volatile SQs. I'm also not crazy about distinguishing between two volatile-containing subqueries based purely on the value of an Arc pointer.

Stepping back a bit, whether a SQ is volatile is not is a local property of the SQ, so we should be able to determine how a subquery should be handled as part of a single tree traversal. What if we did something like:

  • Do a single pass over the plan
  • When we encounter a subquery, determine if it is volatile or not
  • If it's volatile, give it a fresh index. If non-volatile, look it up (structural equality) and assign/reuse an index as appropriate
  • Write the index back into the logical plan node

That means that two textually equal SQs containing volatile expressions will compare non-equal via structural equality, so I think we'll get the semantics we want? It's a bit gross to write a field back into the logical plan but I can't think of something cleaner at the moment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @neilconway for great feedback as always. I have tried to address the concerns please check.

Comment on lines 2153 to 2167
pub fn is_volatile(&self) -> bool {
self.exists(|expr| Ok(expr.is_volatile_node()))
.expect("exists closure is infallible")
self.exists(|expr| {
let subquery_is_volatile = match expr {
Expr::ScalarSubquery(subquery)
| Expr::Exists(Exists { subquery, .. })
| Expr::InSubquery(InSubquery { subquery, .. })
| Expr::SetComparison(SetComparison { subquery, .. }) => {
subquery.is_volatile()
}
_ => false,
};
Ok(expr.is_volatile_node() || subquery_is_volatile)
})
.expect("exists closure is infallible")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ ag is_volatile | wc -l
      77

We should be careful about changing the semantics of a widely used API like this. It's not clear to me that the change is wrong (or right), but it has both behavioral and performance consequences. We should look at all of those existing call-sites and understand whether recursing into subqueries is the right behavior or not. And if we are going to change this, we should ensure it has unit test coverage.

@github-actions github-actions Bot added sql SQL Planner substrait Changes to the substrait crate proto Related to proto crate labels Jun 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  96.044s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.035s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  95.638s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.639s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 193.907s] datafusion
    Building datafusion-expr v53.1.0 (current)
       Built [  26.519s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.072s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  26.394s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.076s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.157s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field Subquery.scalar_subquery_index in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:4368
  field Subquery.scalar_subquery_index in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:4368

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  55.290s] datafusion-expr
    Building datafusion-optimizer v53.1.0 (current)
       Built [  27.065s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  26.572s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.172s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  54.735s] datafusion-optimizer
    Building datafusion-physical-expr v53.1.0 (current)
       Built [  28.670s] (current)
     Parsing datafusion-physical-expr v53.1.0 (current)
      Parsed [   0.048s] (current)
    Building datafusion-physical-expr v53.1.0 (baseline)
       Built [  28.535s] (baseline)
     Parsing datafusion-physical-expr v53.1.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-physical-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.384s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  58.611s] datafusion-physical-expr
    Building datafusion-proto v53.1.0 (current)
       Built [  57.757s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  57.451s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.294s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 116.807s] datafusion-proto
    Building datafusion-sql v53.1.0 (current)
       Built [  41.381s] (current)
     Parsing datafusion-sql v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion-sql v53.1.0 (baseline)
       Built [  41.221s] (baseline)
     Parsing datafusion-sql v53.1.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-sql v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.252s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  84.579s] datafusion-sql
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 164.018s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 163.640s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.103s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 330.756s] datafusion-sqllogictest
    Building datafusion-substrait v53.1.0 (current)
       Built [ 343.113s] (current)
     Parsing datafusion-substrait v53.1.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-substrait v53.1.0 (baseline)
       Built [ 346.018s] (baseline)
     Parsing datafusion-substrait v53.1.0 (baseline)
      Parsed [   0.018s] (baseline)
    Checking datafusion-substrait v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.214s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 691.526s] datafusion-substrait

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 4, 2026
}

/// Returns `true` if any expression in `plan` is volatile.
fn plan_contains_volatile(plan: &LogicalPlan) -> bool {
Copy link
Copy Markdown
Contributor

@nathanb9 nathanb9 Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small but is it possible to make this public?
Would be useful for detecting volatile functions when deciding to materialize subplans (#22676)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good!

@kumarUjjawal kumarUjjawal requested a review from neilconway June 5, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplicate uncorrelated scalar subqueries

3 participants