Skip to content

fix: Scale semi/anti-join column stats by estimated row count#22762

Open
neilconway wants to merge 2 commits into
apache:mainfrom
neilconway:neilc/fix-semijoin-stats-cap
Open

fix: Scale semi/anti-join column stats by estimated row count#22762
neilconway wants to merge 2 commits into
apache:mainfrom
neilconway:neilc/fix-semijoin-stats-cap

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Jun 4, 2026

Which issue does this PR close?

Rationale for this change

This PR makes several related improvements/fixes to the stats code for semi- and anti-joins:

  1. Scale per-column stats using the estimated output row count, rather than just reusing the stats from the preserved side of the join.
  2. Compute total_byte_size for semi/anti-join results, based on summing per-column byte_size, instead of always emitting Absent. We still emit absent for other join types and if any of the per-column byte_size values are Absent
  3. Pass in the join's NullEquality semantics, and use those for stats: under NullEqualsNothing, null join keys will never match (so we can return Exact(0)), whereas under NullEqualsNull we consider nulls just like any other value.

What changes are included in this PR?

  • Stats improvements described above
  • Some refactoring and cleanup
  • New unit tests
  • Update test expectations where needed

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

Some queries might get different plans.

@github-actions github-actions Bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jun 4, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

FYI @asolimando

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, taking into consideration NullEquality makes the upper bound tighter and estimation more accurate

/// "filter selectivity analysis").
/// - Column statistics for inner/outer joins are simply combined from inputs
/// without adjusting for join selectivity (acknowledged in the code as
/// needing "filter selectivity analysis").
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is there an issue filed already? It's easier to discover pending issues in GitHub rather than code comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22761 covers this, I think.

.enumerate()
.map(|(idx, stats)| {
let mut stats = stats.to_inexact();
stats.null_count = if join_key_indices.contains(&idx) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not a big problem as there won't be many items, but maybe we can use a HashSet here as it seems we are just using containment test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashSet would make the set-containment usage more clear, but since it will be small in practice, I'm inclined to leave this as-is.

Comment thread datafusion/physical-plan/src/joins/utils.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Semi/anti join column stats not scaled with estimated row count

2 participants