-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: Enable sliding window execution for covar_pop, covar_samp, and corr #22764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. datafusion/datafusion/functions-aggregate/src/covariance.rs Lines 308 to 312 in 71b5da1
When the window is holding a single row and that row leaves, the count drops to 0 and the division produces NaN. The internal running values then stay NaN forever, so every later window result silently comes out as NaN instead of the right number. This is reachable with NULL gaps, e.g. a 2-row sliding window over My suggestion would be when the count is about to reach 0, reset the state back to its initial values (count 0, running values 0.0) and skip the division. |
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both tests pass the same column twice (covar_pop(column2, column2)). Covariance of a column with itself is just variance, and correlation of a column with itself is always 1 so the "two different columns" math is never really tested. Please use two distinct columns. Across both tests, a row actually leaves the window only once (first test, last row). In the second test the window only ever grows, so nothing is ever removed. There are no NULLs and no case where the window becomes empty in the middle of the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/datafusion/functions-aggregate/src/correlation.rs
Lines 203 to 205 in 71b5da1
Same concern as in covariance.rs. There's a check in the result computation that treats "both internal averages are NaN" as meaning "the input contained NaN values":
An emptied sliding window also makes the averages NaN, so this check gets falsely triggered and
corrreturns NaN even for an empty window, where it should return NULL.