Implement PARQUET-2249: Introduce IEEE 754 total order#9619
Conversation
| // For floating point we need to compare NaN values until we encounter a non-NaN | ||
| // value which then becomes the new min/max. After this, only non-NaN values are | ||
| // evaluated. If all values are NaN, then the min/max NaNs as determined by | ||
| // IEEE 754 total order are returned. |
There was a problem hiding this comment.
This has me a bit worried. I need to do some benchmarking to make sure all the complicated NaN logic isn't killing performance.
There was a problem hiding this comment.
Agree, this method is on the hot path. I had a look at optimizing it, but could not get the compiler to generate nice auto-vectorized code for nan-handling yet. I think we can try optimizations in a followup, it would be more important to get the semantics correct first and make sure there are tests for edge cases.
If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.
Does the current code correctly distinguish different NaN payloads according to their sign and bit patterns?
(Solved, github was hiding the changes to compare_greater in mod.rs)
|
Tests will fail until apache/parquet-testing#104 is merged. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmark arrow_writer env:
BENCH_FILTER: list |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmark arrow_writer env: |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (570f6de) to edfb9ab (merge-base) diff File an issue against this benchmark runner |
It took me a little while to figure out how to do this (see the trail of disaster comments I left on apache/arrow-rs#9619) Add instructions to the README for running specific benchmark suites with filters.
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
The only thing that looks suspicious is some of the However I think @HippoBaro also saw some potential strangeness with those runs, so maybe it is a measurement thing |
Indeed. From #9619 (comment) |
Shout out to @Xuanwo who wrote all of the |
|
Are we waiting on anything else before we merge this pr? |
I think the possible regressions have been addressed, so from a technical standpoint, no. But procedurally, I'm not sure. While the parquet-format PR relating to this is merged, there has not yet been a formal release of parquet-format containing this change. So the question is do we jump the gun by a few days/weeks while a release is bundled? I think the only consequence of being a bit early is we'll write files where readers can't use float statistics...but that's the case right now anyway. If we don't merge now, then we wait until Aug/Sept. I don't have a strong preference either way. |
Roger -- given that I will defer to you (aka not merge this) as I have enough things to try and drive consensus on LOL |
|
Shall i mark it draft and add next release tag? |
|
👍 |
Which issue does this PR close?
Rationale for this change
This takes the implementation done by @Xuanwo (#8158) and updates it to the new thrift format and recent changes to the original proposal (apache/parquet-format#514).
What changes are included in this PR?
Adds needed thrift structures as well as NaN counts for pages and column chunks.
Are these changes tested?
Yes, new tests added (more may be needed).
Are there any user-facing changes?
Yes. This adds the
IEEE_754_TOTAL_ORDERvariant to the publicColumnOrderenum, andTOTAL_ORDERto the publicSortOrderenum, which are technically breaking API changes. This also adds anan_countargument to the public functionColumnIndexBuilder::append.Non-breaking changes include adding
nan_countfields to theValueStatisticsandColumnIndexstructs.Behaviorally, this will now order all floating point statistics using IEEE754 total order, and all floating point columns will use the new
IEEE_754_TOTAL_ORDERcolumn order in theFileMetaData.column_orderslist. Newer readers that recognize the new column order may now safely use the statistics for chunk and page pruning. Older readers will not recognize this column order and should continue to ignore the statistics.