[Parquet] Do not panic when trying to skip records in delta encoded files using non-standard block sizes#9794
[Parquet] Do not panic when trying to skip records in delta encoded files using non-standard block sizes#9794etseidl wants to merge 4 commits intoapache:mainfrom
Conversation
| }; | ||
|
|
||
| let mut skip_buffer = vec![T::T::default(); mini_block_batch_size]; | ||
| let mut skip_buffer = vec![T::T::default(); self.values_per_mini_block]; |
There was a problem hiding this comment.
This might panic if values_per_mini_block is crazy. Sadly try_with_capacity is still unstable.
There was a problem hiding this comment.
Vec::default() + Vec::try_reserve() + Vec::resize() would be a way to handle allocation errors, but I think enforcing a smaller limit than available memory in set_data could be a good idea. I'm thinking something between 64k to 1million values.
Maybe the skip_buffer could also be stored inside the struct and reused, to avoid allocations per skip.
There was a problem hiding this comment.
I just saw someone else already opened a separate issue about moving the skip_buffer (#9792)
There was a problem hiding this comment.
That's a good idea @jhorstmann...kill two birds with one stone.
There was a problem hiding this comment.
I think enforcing a smaller limit than available memory in
set_datacould be a good idea. I'm thinking something between 64k to 1million values.
That's probably a reasonable thing to do as well. Let's start with a million.
|
This also adds a smallish test file...I don't know if it would be better to try adding it to parquet-testing. The file is actually a rewrite of the parquet-testing |
|
run benchmark arrow_reader env:
BENCH_FILTER: /binary.*skip |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix_delta_skip (855e870) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Looks like all of the benches that use the skip_buffer regressed :( Investigating... Update:
|
|
run benchmark arrow_reader env:
BENCH_FILTER: /binary.*skip |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix_delta_skip (83dbfcc) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
I've now opted for erroring in skip, rather than set_data, since this is only an issue on skip. The decoder handles weird block sizes just fine. @jhorstmann what do you think? |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
DeltaBitPackDecoder::skipcould panic on "non-standard" miniblocks #9793.Rationale for this change
DeltaBitPackDecoder::skipuses some magic numbers when sizing the buffer used for skipping. Files that use non-standard miniblock sizes will causeskipto panic.What changes are included in this PR?
Replace hard coded values with
values_per_miniblockwhen creating skip buffer.values_per_miniblockis calculated from values in the delta header.Are these changes tested?
Yes, test with file with non-standard sizes is added to the arrow_reader tests.
Are there any user-facing changes?
No.