Skip to content

Conversation

@ilicmarkodb
Copy link
Contributor

What changes were proposed in this pull request?

create or replace table t1 (c1 string collate utf8_lcase_rtrim);
create or replace table t2 (c1 string collate utf8_lcase_rtrim);
insert into t1 values ('a');
insert into t2 values ('A ');

select * from t1 where c1 not in (select * from t2);
-- should return no data, but it returns one row

When performing a hash join on collated columns, we first wrap the column with CollationKey during analysis. This is because the hash of CollationKey is collation-aware. The problem with this query is that there is no join during the analysis phase (we have NOT IN), and the join is added during the optimization phase. As a result, the join operates on raw columns, which are not collation-aware.

This PR fixes the issue by rewriting the join keys in HashJoin trait.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 26, 2025
@ilicmarkodb ilicmarkodb changed the title [SPARK-54852] NOT IN subquery returns incorrect results with a collated table [SPARK-54852][SQL] NOT IN subquery returns incorrect results with a collated table Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant