-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11239: [Rust] Fixed equality with offsets and nulls #9211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11239: [Rust] Fixed equality with offsets and nulls #9211
Conversation
|
Fyi @andygrove @nevi-me @alamb , as this is a relatively important bug and could make sense to merge before shipping. |
Codecov Report
@@ Coverage Diff @@
## master #9211 +/- ##
=======================================
Coverage 81.88% 81.89%
=======================================
Files 215 215
Lines 52988 52999 +11
=======================================
+ Hits 43391 43403 +12
+ Misses 9597 9596 -1
Continue to review full report at Codecov.
|
|
Thanks @jorgecarleitao for taking time to fix this bug in time, this is really great! |
|
@jorgecarleitao Changes to the string and binary equals look good to me. I tried a similar test case for list arrays which is still failing, but probably for more complicated reasons, I'll create a separate ticket for that. |
rust/arrow/src/array/equal/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, but I tried a few more test cases and there still seems to be something wrong.
Namely I would expect this test to pass:
#[test]
fn test_string_offset_larger() {
let a = StringArray::from(vec![Some("a"), None, Some("b"), None, Some("c")]).data();
let b = StringArray::from(vec![None, Some("b"), None, Some("c")]).data();
test_equal(&a.slice(2,2), &b.slice(0,2), false);
test_equal(&a.slice(2,2), &b.slice(1,2), true);
test_equal(&a.slice(2,2), &b.slice(2,2), false);
}
But instead if fails
---- array::equal::tests::test_string_offset_larger stdout ----
thread 'array::equal::tests::test_string_offset_larger' panicked at 'assertion failed: `(left == right)`
left: `false`,
right: `true`:
ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 2, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90504480, len: 24, data: [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90504500, len: 3, data: [97, 98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90504580, len: 1, data: [21] }, offset: 0 } }) }
ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90604700, len: 20, data: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90604880, len: 2, data: [98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90604780, len: 1, data: [10] }, offset: 0 } }) }', arrow/src/array/equal/mod.rs:457:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
|
Thanks for working on this @jorgecarleitao and huge thanks for distilling this problem down for us @mqy BTW I think when we fix this issue it would be a potential one to backport to any 3.0 patchset we release (as it is a correctness bug) cc @andygrove |
|
@alamb , that comment led to the discovery of a broader class of bugs on all our equality operators when I pushed a fix for all of then. I added a test for strings and primitives, but it seems that we are not covering (in tests) this case for other types also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while reviewing, got this one also. xD
Could you check if the error is still there? I think I found it while fixing @alamb one. |
|
Thanks @jorgecarleitao @mqy for picking this up, and working on it. Firstly, I apologise for missing this, or not being diligent enough to look at offsets. I was looking at making existing tests pass, and so I didn't check if we were already testing for offsets. I've been looking at the issues and this PR. Could the challenge of having to bookkeep offsets be because we have an offset field in the If we use the below as an example: let arr = Int8Array::from(vec![None, Some(2), None, None, Some(5)]);
let arr2 = arr.slice(1, 3);
dbg!(arr2.data());The output is: ArrayData {
data_type: Int8,
len: 3,
null_count: 2,
offset: 1,
buffers: [
Buffer {
data: Bytes { ptr: 0x14ae09540, len: 5, data: [
0,
2,
0,
0,
5,
] },
offset: 0,
},
],
child_data: [],
null_bitmap: Some(
Bitmap {
bits: Buffer {
data: Bytes { ptr: 0x14ae094c0, len: 1, data: [
18,
] },
offset: 0,
},
},
),
}Note that the for Anyways, the above behaviour poses a bigger issue, which is that nested arrays will lose the offset as it is only stored on the parent Imagine an array with the type impl From<&StructArray> for RecordBatch {
fn from(struct_array: &StructArray) -> Self {
if let DataType::Struct(fields) = struct_array.data_type() {
// Narrator: the struct's fields rely on the struct's offset, which we do not pass down
let schema = Schema::new(fields.clone());
let columns = struct_array.boxed_fields.clone();
RecordBatch {
schema: Arc::new(schema),
columns,
}
} else {
_
}
}
}Solution: I propose that we explore propagating the The remaining item, which is more relevant to this PR, is that we can:
What are your thoughs? |
|
I saw some performance regression:
It looks like this is caused by For both clarity and performance reasons, perhaps it make sense to define variables like |
|
Thanks @nevi-me
Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using |
Thanks @mqy |
|
@nevi-me , no need to apologize, no one saw this, and we also had no tests to cover this. Wrt to the offsets, I admit that I had a branch where I removed the One aspect here is that unfortunately we have a lot of units of measurement:
Overall, this adds complexity to how to reason about the code. I am not sure we have an easy way to address this, as they are indeed needed. One idea is to declare each measurement to be a different type and perform explicit casts (which the compiler will optimize out as they area all represented by |
|
@jorgecarleitao @alamb The testcase I had is not yet fixed, for reference it's [ARROW-11267] with I'm currently thinking it's related to the |
Thanks for the heads up @nevi-me. I want to hook in in this discussion because I don't really know where I could otherwise. In view of the upcoming 3.0 release, I started refactoring to use arrow 3.0 (using git dependencies). And I must say, it is becoming an increasingly painful process. What is arrows view on backwards compatibility? Could logic in the public API have a deprecated flag for a release cycle or could there be pre-releases so that third party are able to stay in sync? |
|
@jhorstmann , then I think it is related to what @nevi-me was saying about the children. I am not sure we can have a fix ready for 3.0.0, though :( @rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the My feeling is that we will need at least 2 more releases to stabilize The two areas that I will be focusing short-term is on @nevi-me, @alamb , @andygrove do we merge this on the 3.0, or should we postpone? If yes, we should probably flag this on the mailing list. If not, then no action needed :) |
It is mostly about the (builder/buffer related) functionality being removed without no immediately clear substitute. For instance ArrayBuilder::append_data, and several other API methods/traits. Another one was I think that the first case would be less problematic if it were possible to get mutable references to some inner types, as I could rebuild some of the logic and gracefully update. |
|
@jhorstmann I'll have a look at that one that's still failing. @ritchie46 I share Jorge's sentiments in that there's still some more changes and stabilisation that we need to do. We'll be more pedantic about the changes that we make, and perhaps start collating them into some CHANGELOG type of document, so that users who rely on using arrow from git (there seem to be many) can at least know what to expect. @jorgecarleitao it's a serious enough regression to merge with 3.0.0, after we apply the hoisting that @mqy has suggested. |
|
Quoting from https://github.com/apache/arrow#implementation-status: The above words almost admit that it's difficult to release the whole Arrow libraries together with same features enabled, not to say the quality. So perhaps it's reasonable to release the most featured library (cpp?) as X.0.0 first, then other libraries follow up. |
rust/arrow/src/array/equal/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another lhs/rhs mixup a few lines above (148), github won't let me comment on that line:
// get a ref of the parent null buffer bytes, to use in testing for nullness
let lhs_null_bytes = rhs_nulls.unwrap().as_slice();
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, at line 147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. Pushed a fix to it.
|
I agree that the Rust arrow implementation is likely to need several more months of non trivial changes to get to a point where we want to be more constrained with backwards compatibility. Given the subtlety of this change, if this bug also exists in the 2.0.0 release of arrow I don't think we should merge it in 3.0.0 at the last minute. If it was introduced in 3.0.0 I do think we should merge it to avoid releasing a regression. |
@nevi-me -- I think that sounds like a good idea, though I suspect it may be hard to accomplish without some non trivial performance implications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks good to me -- but the test coverage seems a bit lacking (aka we made changes to the comparisons for equal in structs, strings, fixed sizes, and dictionary, and decimal, and yet only added tests for primitive types.
@jorgecarleitao do you plan to add tests? Would you like help doing so?
|
@jorgecarleitao in order to propagate the offsets to child data and buffers, we could do the below instead of ArrayData::new(
self.data_type.clone(),
length,
None,
self.null_buffer().map(|buf| {
buf.slice(offset)
}),
new_offset,
self.buffers.iter().map(|buf| {
buf.slice(offset)
}).collect(),
self.child_data.iter().map(|data| {
Arc::new(data.slice(offset, length))
}).collect()
)I'm still running benchmarks to see the performance hit, but for bench::array_slice, I only see a 3% penalty. I tried this change directly on master, so not on top of this PR. I'm also getting failures from memory not being aligned, but I haven't looked at that yet. |
|
Thanks @nevi-me . IMO the idea is good, but I think that in rust's notation that implementation will be unsound.
In the particular case of In the case of In general, the child data's |
No worries, I'll fix the code to account for the variations mentioned. I was doing a quick impl to see what the impact on the slice benchmark would look like. For lists, we might be further constrained by whether the list's child field is a struct, list or primitive. If we don't offset the list's child array, we could have a similar case to the problems with struct. So perhaps an acceptable way of slicing lists could still be to propagate the offset and length down, but use the list's offset buffer to determine such values for the child. Structs are fine, because as long as we implement the correct logic for lists inside With the data buffers, we could opt not to populate their offsets, and use the I don't know if there's anything still pending in this PR, I'm happy with the changes so far, and can approve it. Then I'll work on the changes above in a separate PR so it doesn't sully and hold up this one. What do you think @jorgecarleitao @alamb @mqy @jhorstmann ? |
|
In addition to what @jorgecarleitao mentioned, for Null Buffers or Boolean Buffers you'd have to copy data if the offset is not a multiple of 8 (done automatically by the Slicing of the child data depends on the data type, for List/String/Binary arrays it should not be modified. It might need to be sliced for struct/union. I think the main problem with the equal implementation is the calculation of combined null bitmaps and which offsets to use for this combined bitmap. The lhs_start/rhs_start parameters further adds to this confusion. |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One change, then I'm happy; I'll work on the nested struct slice after we've merged this
| let lhs_pos = lhs_start + i; | ||
| let rhs_pos = rhs_start + i; | ||
| let lhs_pos = lhs.offset() + lhs_start + i; | ||
| let rhs_pos = rhs.offset() + rhs_start + i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment from @mqy about moving this out of the iter, as it was hurting performance.
It's unexpected that this wouldn't be hoisted out by the compiler, but IIWII.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fix the equality operator of all types with offsets and nulls.
Big kudos to @mqy for identifying and reducing its scope for variable-sized, and @alamb for identifying a second error.