[T1] Allow fixed length encoding for min/max and deprecate encoding_stats#252
[T1] Allow fixed length encoding for min/max and deprecate encoding_stats#252alkis wants to merge 1 commit intoapache:masterfrom
Conversation
src/main/thrift/parquet.thrift
Outdated
| * Only one pair of max_value/min_value, max1/min1, max2/min2, max4/min4, | ||
| * max8/min8 can be set. The pair is determined by the physical type of the | ||
| * column. Floating point values are bitcasted to integers. Variable length | ||
| * values are set in min_value/max_value. |
There was a problem hiding this comment.
Could you please update the docs for readers for backwards compatibility should check min_value/max_value if the non-variable width field is not not set?
There was a problem hiding this comment.
Rewritten this to be clearer.
| 7: optional bool is_max_value_exact; | ||
| /** If true, min_value is the actual minimum value for a column */ | ||
| 8: optional bool is_min_value_exact; | ||
| 9: optional i64 max8; |
There was a problem hiding this comment.
did you intentionally elide min1/max1? (they are still mentioned above).
There was a problem hiding this comment.
Yes I removed them because they provide little benefit and do not justify the added complexity. This is because in thrift these are ulebs so it makes no difference in the wire. For flatbuffers this would make a difference though.
|
|
||
| /** Set of all encodings used for pages in this column chunk. | ||
| /** | ||
| * DEPRECATED: use is_fully_dict_encoded instead |
There was a problem hiding this comment.
I would suggest making this a separate PR, I think we'd prefer to keep the changes as small and focused as possible?
There was a problem hiding this comment.
Agreed. I will create another PR for the Statistics change alone if we are OK merging that now.
…tats 1. Add `min8`/`max8` fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes. 2. Deprecate `ColumnMetaData.encoding_stats` and replace with a bool `ColumnMetaData.is_fully_dict_encoded`
b2caa21 to
ec13f34
Compare
| * the columns ColumnOrder | ||
| * max_value/min_value: PLAIN encoded values, sans length prefix if varlen | ||
| * max8/min8: up to 8-bytes: | ||
| * FLOAT, DOUBLE: bitcasted to INT32 and INT64, respectively |
There was a problem hiding this comment.
I think we might want to be more specific here about values less then 8 bytes are translated into 8 bytes. In practice it doesn't make a difference for readers but it would be good to limit ambiguity. I assume we do a normal cast from 1/4 integer byte values to 8 bytes values rather then just embedding them?
min8/max8fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.ColumnMetaData.encoding_statsand replace with a boolColumnMetaData.is_fully_dict_encoded'ref Parquet Metadata evolution
Jira
Commits
Documentation