Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Jan 16, 2021

Fix the equality operator of all types with offsets and nulls.

Big kudos to @mqy for identifying and reducing its scope for variable-sized, and @alamb for identifying a second error.

@github-actions
Copy link

@jorgecarleitao
Copy link
Member Author

Fyi @andygrove @nevi-me @alamb , as this is a relatively important bug and could make sense to merge before shipping.

@codecov-io
Copy link

codecov-io commented Jan 16, 2021

Codecov Report

Merging #9211 (5371da4) into master (437c8c9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9211   +/-   ##
=======================================
  Coverage   81.88%   81.89%           
=======================================
  Files         215      215           
  Lines       52988    52999   +11     
=======================================
+ Hits        43391    43403   +12     
+ Misses       9597     9596    -1     
Impacted Files Coverage Δ
rust/arrow/src/array/equal/boolean.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/decimal.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/dictionary.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/fixed_binary.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/fixed_list.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/list.rs 83.58% <100.00%> (ø)
rust/arrow/src/array/equal/mod.rs 93.12% <100.00%> (+0.19%) ⬆️
rust/arrow/src/array/equal/primitive.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/structure.rs 100.00% <100.00%> (ø)
rust/arrow/src/array/equal/variable_size.rs 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 437c8c9...5371da4. Read the comment docs.

@mqy
Copy link
Contributor

mqy commented Jan 16, 2021

Thanks @jorgecarleitao for taking time to fix this bug in time, this is really great!
Tip: would you please modify the PR title by replacing strings with variable-sized arrays?

@jorgecarleitao jorgecarleitao changed the title ARROW-11239: [Rust] Fixed equality of strings with offsets. ARROW-11239: [Rust] Fixed equality of variable-sized arrays with offsets. Jan 16, 2021
@jhorstmann
Copy link
Contributor

@jorgecarleitao Changes to the string and binary equals look good to me. I tried a similar test case for list arrays which is still failing, but probably for more complicated reasons, I'll create a separate ticket for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but I tried a few more test cases and there still seems to be something wrong.

Namely I would expect this test to pass:

    #[test]
    fn test_string_offset_larger() {
        let a = StringArray::from(vec![Some("a"), None, Some("b"), None, Some("c")]).data();
        let b = StringArray::from(vec![None, Some("b"), None, Some("c")]).data();

        test_equal(&a.slice(2,2), &b.slice(0,2), false);
        test_equal(&a.slice(2,2), &b.slice(1,2), true);
        test_equal(&a.slice(2,2), &b.slice(2,2), false);
    }

But instead if fails

---- array::equal::tests::test_string_offset_larger stdout ----
thread 'array::equal::tests::test_string_offset_larger' panicked at 'assertion failed: `(left == right)`
  left: `false`,
 right: `true`: 
ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 2, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90504480, len: 24, data: [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90504500, len: 3, data: [97, 98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90504580, len: 1, data: [21] }, offset: 0 } }) }
ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90604700, len: 20, data: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90604880, len: 2, data: [98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90604780, len: 1, data: [10] }, offset: 0 } }) }', arrow/src/array/equal/mod.rs:457:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@alamb
Copy link
Contributor

alamb commented Jan 16, 2021

Thanks for working on this @jorgecarleitao and huge thanks for distilling this problem down for us @mqy

BTW I think when we fix this issue it would be a potential one to backport to any 3.0 patchset we release (as it is a correctness bug) cc @andygrove

@jorgecarleitao jorgecarleitao changed the title ARROW-11239: [Rust] Fixed equality of variable-sized arrays with offsets. ARROW-11239: [Rust] Fixed equality with offsets and nulls Jan 17, 2021
@jorgecarleitao
Copy link
Member Author

@alamb , that comment led to the discovery of a broader class of bugs on all our equality operators when offsets and nulls are found in the array, derived from confusing bit offsets from byte offsets.

I pushed a fix for all of then. I added a test for strings and primitives, but it seems that we are not covering (in tests) this case for other types also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reviewing, got this one also. xD

@jorgecarleitao
Copy link
Member Author

@jorgecarleitao Changes to the string and binary equals look good to me. I tried a similar test case for list arrays which is still failing, but probably for more complicated reasons, I'll create a separate ticket for that.

Could you check if the error is still there? I think I found it while fixing @alamb one.

@nevi-me
Copy link
Contributor

nevi-me commented Jan 17, 2021

Thanks @jorgecarleitao @mqy for picking this up, and working on it.

Firstly, I apologise for missing this, or not being diligent enough to look at offsets. I was looking at making existing tests pass, and so I didn't check if we were already testing for offsets.

I've been looking at the issues and this PR. Could the challenge of having to bookkeep offsets be because we have an offset field in the Buffer, but seldom update it when slicing data?

If we use the below as an example:

let arr = Int8Array::from(vec![None, Some(2), None, None, Some(5)]);
let arr2 = arr.slice(1, 3);
dbg!(arr2.data());

The output is:

ArrayData {
    data_type: Int8,
    len: 3,
    null_count: 2,
    offset: 1,
    buffers: [
        Buffer {
            data: Bytes { ptr: 0x14ae09540, len: 5, data: [
                0,
                2,
                0,
                0,
                5,
            ] },
            offset: 0,
        },
    ],
    child_data: [],
    null_bitmap: Some(
        Bitmap {
            bits: Buffer {
                data: Bytes { ptr: 0x14ae094c0, len: 1, data: [
                    18,
                ] },
                offset: 0,
            },
        },
    ),
}

Note that the Buffer::offset of both the data and null buffers is still 0. So, this would make sense as we perform a zero-copy clone (the arr and arr2 Bytes point to the same ptr: 0x*** location).

for ArrayData::slice, we pass the ArrayData::offset to the various Buffer::slice calls, so we seem to effectively compensate for Buffer::offset not being changed.
Where the issue comes, is when we use utils like bit_util::get_bit, because we get the bit from the raw data (&[u8]) and pass an index.
I truly missed this, but looking at 0eae886, I see that I managed to account for the offset in the decimal test case. I think if i'd seen that issue in another place, I would have recognised the pattern.

Anyways, the above behaviour poses a bigger issue, which is that nested arrays will lose the offset as it is only stored on the parent ArrayData.

Imagine an array with the type [a]struct<[b]struct<[c]struct<[d]primitive>>>. If we slice into the parent struct, its child_data buffers do not carry the parent's ArrayData::offset.
The first function that I can think of, that would be negatively impacted by this, would be the one that creates a RecordBatch from StructArray

impl From<&StructArray> for RecordBatch {
    fn from(struct_array: &StructArray) -> Self {
        if let DataType::Struct(fields) = struct_array.data_type() {
            // Narrator: the struct's fields rely on the struct's offset, which we do not pass down
            let schema = Schema::new(fields.clone());
            let columns = struct_array.boxed_fields.clone();
            RecordBatch {
                schema: Arc::new(schema),
                columns,
            }
        } else {
            _
        }
    }
}

Solution: I propose that we explore propagating the ArrayData::offset to buffers, child_data and null_buffers. I can work on that if my suggestion is sound.


The remaining item, which is more relevant to this PR, is that we can:

  • Audit the use of bit_util::get_bit to ensure that all the places we call it from, include the offset (I've only seen compute::kernels::comparison.rs.
  • Document the use of get_bit to warn users about the footgun.
  • Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.

What are your thoughs?

CC: @andygrove @paddyhoran @alamb @jhorstmann @Dandandan

@mqy
Copy link
Contributor

mqy commented Jan 17, 2021

I saw some performance regression:

  • equal_nulls_512 regressed about +22%
  • equal_string_nulls_512 regressed about +17%.

It looks like this is caused by lhs_start + lhs.offset() being executed for every bit and the rhs case too.

For both clarity and performance reasons, perhaps it make sense to define variables like let lhs_bits_start = lhs_start + lhs.offset(); above (0..len).all(... and let get_bit() call with lhs_bits_start+i.

@mqy
Copy link
Contributor

mqy commented Jan 17, 2021

Thanks @nevi-me

Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.

Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using arrow::bit_util:

@nevi-me
Copy link
Contributor

nevi-me commented Jan 17, 2021

Thanks @nevi-me

Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.

Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using arrow::bit_util:

Thanks @mqy

CC @ritchie46 @liurenjie1024

@jorgecarleitao
Copy link
Member Author

@nevi-me , no need to apologize, no one saw this, and we also had no tests to cover this.

Wrt to the offsets, I admit that I had a branch where I removed the Buffer::offset altogether. However, I saw some perf implications somewhere and abandoned the idea. We could just take the perf hit to remove this complexity, even though IMO the buffer offsets are not very painful to deal with, as they are opaque to the ArrayData: even bit operations automatically take that offset into account. The slot offset (ArrayData) is indeed more challenge, but it is part of the spec, so we need to deal with it.

One aspect here is that unfortunately we have a lot of units of measurement:

  1. Offset measured in slots (ArrayData::offset)
  2. Offset measured in bytes (Buffer::offset, ArrayData::offset * size_of)
  3. Offset is sometimes measured in bits (for bitmap operations)

Overall, this adds complexity to how to reason about the code. I am not sure we have an easy way to address this, as they are indeed needed. One idea is to declare each measurement to be a different type and perform explicit casts (which the compiler will optimize out as they area all represented by usize), so that we have the compiler on our side when adding different types of offsets (e.g. slot offset vs byte offset).

@jhorstmann
Copy link
Contributor

jhorstmann commented Jan 17, 2021

@jorgecarleitao @alamb The testcase I had is not yet fixed, for reference it's [ARROW-11267] with

    #[test]
    fn test_list_different_offsets() {
        let a = create_list_array(&[Some(&[0, 0]), Some(&[1, 2]), Some(&[3, 4])]);
        let b = create_list_array(&[Some(&[1, 2]), Some(&[3, 4]), Some(&[5, 6])]);
        let a_slice = a.slice(1, 2);
        let b_slice = b.slice(0, 2);
        test_equal(&a_slice, &b_slice, true);
    } 

I'm currently thinking it's related to the child_logical_null_buffer and logical_list_bitmap functions. One problem with that implementation is that it creates an all-valid bitmap as a default when there is no null bitmap in order to simplify the following calculations. This newly created bitmap then has an offset of 0, while for actual nullable arrays the offset is that of the parent data parameter. Maybe one approach would be to always pass tuples of (offset, Buffer) for the null buffers.

@ritchie46
Copy link
Contributor

Thanks @nevi-me

Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.

Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using arrow::bit_util:

Thanks @mqy

CC @ritchie46 @liurenjie1024

Thanks for the heads up @nevi-me. I want to hook in in this discussion because I don't really know where I could otherwise. In view of the upcoming 3.0 release, I started refactoring to use arrow 3.0 (using git dependencies). And I must say, it is becoming an increasingly painful process. What is arrows view on backwards compatibility?

Could logic in the public API have a deprecated flag for a release cycle or could there be pre-releases so that third party are able to stay in sync?

@jorgecarleitao
Copy link
Member Author

@jhorstmann , then I think it is related to what @nevi-me was saying about the children. I am not sure we can have a fix ready for 3.0.0, though :(

@rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the arrow crate.

My feeling is that we will need at least 2 more releases to stabilize arrow + parquet. The next release I will be working in cleaning up the arrow crate's API (I have been doing it the previous already). Unfortunately, there is still some work ahead, as some of the very fundamental parts of the crate need some refactoring to support the high performance that we goal towards (e.g. #9076 is backward incompatible).

The two areas that I will be focusing short-term is on Buffer / MutableBuffer and all (currently safe but actually unsafe) APIs. I will then jump to the creation of Arrays (which is based on Buffer), and then kernels (which is based on the creation of Arrays / Buffers).

@nevi-me, @alamb , @andygrove do we merge this on the 3.0, or should we postpone? If yes, we should probably flag this on the mailing list. If not, then no action needed :)

@ritchie46
Copy link
Contributor

ritchie46 commented Jan 17, 2021

@rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the arrow crate.

It is mostly about the (builder/buffer related) functionality being removed without no immediately clear substitute. For instance ArrayBuilder::append_data, and several other API methods/traits. Another one was BooleanType not being a primitive anymore (I can understand the reason behind it). My crate is using a lot of the API surface, and 3 months of accumulated changes can be a real mess at once.

I think that the first case would be less problematic if it were possible to get mutable references to some inner types, as I could rebuild some of the logic and gracefully update.

@nevi-me
Copy link
Contributor

nevi-me commented Jan 17, 2021

@jhorstmann I'll have a look at that one that's still failing.

@ritchie46 I share Jorge's sentiments in that there's still some more changes and stabilisation that we need to do. We'll be more pedantic about the changes that we make, and perhaps start collating them into some CHANGELOG type of document, so that users who rely on using arrow from git (there seem to be many) can at least know what to expect.

@jorgecarleitao it's a serious enough regression to merge with 3.0.0, after we apply the hoisting that @mqy has suggested.

@mqy
Copy link
Contributor

mqy commented Jan 18, 2021

Quoting from https://github.com/apache/arrow#implementation-status:

The official Arrow libraries in this repository are in different stages of implementing
the Arrow format and related features.  See our current feature matrix on git master.

The above words almost admit that it's difficult to release the whole Arrow libraries together with same features enabled, not to say the quality. So perhaps it's reasonable to release the most featured library (cpp?) as X.0.0 first, then other libraries follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another lhs/rhs mixup a few lines above (148), github won't let me comment on that line:

        // get a ref of the parent null buffer bytes, to use in testing for nullness
        let lhs_null_bytes = rhs_nulls.unwrap().as_slice();
                             ^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, at line 147

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Pushed a fix to it.

@alamb
Copy link
Contributor

alamb commented Jan 18, 2021

I agree that the Rust arrow implementation is likely to need several more months of non trivial changes to get to a point where we want to be more constrained with backwards compatibility.

Given the subtlety of this change, if this bug also exists in the 2.0.0 release of arrow I don't think we should merge it in 3.0.0 at the last minute. If it was introduced in 3.0.0 I do think we should merge it to avoid releasing a regression.

@alamb
Copy link
Contributor

alamb commented Jan 18, 2021

Solution: I propose that we explore propagating the ArrayData::offset to buffers, child_data and null_buffers. I can work on that if my suggestion is sound.

@nevi-me -- I think that sounds like a good idea, though I suspect it may be hard to accomplish without some non trivial performance implications

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good to me -- but the test coverage seems a bit lacking (aka we made changes to the comparisons for equal in structs, strings, fixed sizes, and dictionary, and decimal, and yet only added tests for primitive types.

@jorgecarleitao do you plan to add tests? Would you like help doing so?

@nevi-me
Copy link
Contributor

nevi-me commented Jan 28, 2021

@jorgecarleitao in order to propagate the offsets to child data and buffers, we could do the below instead of let mut new_data = self.clone(); inside of ArrayData::slice() in data.rs

ArrayData::new(
    self.data_type.clone(),
    length,
    None,
    self.null_buffer().map(|buf| {
        buf.slice(offset)
    }),
    new_offset,
    self.buffers.iter().map(|buf| {
        buf.slice(offset)
    }).collect(),
    self.child_data.iter().map(|data| {
        Arc::new(data.slice(offset, length))
    }).collect()
)

I'm still running benchmarks to see the performance hit, but for bench::array_slice, I only see a 3% penalty.

I tried this change directly on master, so not on top of this PR. I'm also getting failures from memory not being aligned, but I haven't looked at that yet.

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Jan 28, 2021

Thanks @nevi-me . IMO the idea is good, but I think that in rust's notation that implementation will be unsound.

Buffer::offset is measured in bytes, but ArrayData::offset is measured in slots. So, slicing a buffer in slots will lead to an unalligned buffer. E.g. a buffer representing N u32 has 4 bytes per slot, and doing ArrayData::slice(1, 1) would cause that the buffer to contain 3 bytes.

In the particular case of ListArray, I think that we should only offset the ArrayData and not the child array: the offset buffer (ArrayData::buffers[0]) will have all the information we need to extract the correct items from the child array. Of course we must use it to access the items, but imo we already do that on ListArray::value_offset. We may not be doing that in the equality, though.

In the case of StructArray, we have two options: increase the offset of the child by an equal amount and only support StructArray with ArrayData::offset = 0, or increase ArrayData::offset and change the equality code to take that into account.

In general, the child data's ArrayData is insufficient to use it. Either because the parent has a non-None null buffer, or because the parent has an offset. So, AFAI understand, we will always need to use the parent's accessors to interact with child objects.

@nevi-me
Copy link
Contributor

nevi-me commented Jan 28, 2021

but I think that in rust's notation that implementation will be unsound

No worries, I'll fix the code to account for the variations mentioned. I was doing a quick impl to see what the impact on the slice benchmark would look like.

For lists, we might be further constrained by whether the list's child field is a struct, list or primitive. If we don't offset the list's child array, we could have a similar case to the problems with struct. So perhaps an acceptable way of slicing lists could still be to propagate the offset and length down, but use the list's offset buffer to determine such values for the child.

Structs are fine, because as long as we implement the correct logic for lists inside ArrayData::slice(), structs of lists will retain the behaviour of lists.

With the data buffers, we could opt not to populate their offsets, and use the ArrayData::offset like you suggest, but I think with the null buffer we would still need to populate the offsets.


I don't know if there's anything still pending in this PR, I'm happy with the changes so far, and can approve it. Then I'll work on the changes above in a separate PR so it doesn't sully and hold up this one.

What do you think @jorgecarleitao @alamb @mqy @jhorstmann ?

@jhorstmann
Copy link
Contributor

In addition to what @jorgecarleitao mentioned, for Null Buffers or Boolean Buffers you'd have to copy data if the offset is not a multiple of 8 (done automatically by the bit_slice method).

Slicing of the child data depends on the data type, for List/String/Binary arrays it should not be modified. It might need to be sliced for struct/union.

I think the main problem with the equal implementation is the calculation of combined null bitmaps and which offsets to use for this combined bitmap. The lhs_start/rhs_start parameters further adds to this confusion.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One change, then I'm happy; I'll work on the nested struct slice after we've merged this

let lhs_pos = lhs_start + i;
let rhs_pos = rhs_start + i;
let lhs_pos = lhs.offset() + lhs_start + i;
let rhs_pos = rhs.offset() + rhs_start + i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment from @mqy about moving this out of the iter, as it was hurting performance.
It's unexpected that this wouldn't be hoisted out by the compiler, but IIWII.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants