Skip to content

Commit 8875504

Browse files
authored
arrow-row: Document dictionary handling (apache#8168)
# Which issue does this PR close? - related to apache#7627 - Related to apache#4811 # Rationale for this change It was not clear to me what the expected behavior for round trip through row converter was for DictionaryArrays, so let's document what @tustvold says here: apache#8067 (comment) > I think the issue is that Datafusion is not handling the fact that row encoding "hydrates" dictionaries. It should be updated to understand that List<Dictionary<...>> will be converted to List<...>, much like it already handles this for the non-nested case. Converting back to a dictionary is expensive, and likely pointless, not to mention a breaking change. # What changes are included in this PR? Document expected behavior with english comments and doc test # Are these changes tested? Yes (doctests) # Are there any user-facing changes? More docs, no behavior change
1 parent ebb6ede commit 8875504

File tree

1 file changed

+33
-1
lines changed

1 file changed

+33
-1
lines changed

arrow-row/src/lib.rs

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@
9797
//! assert_eq!(&c2_values, &["a", "f", "c", "e"]);
9898
//! ```
9999
//!
100-
//! # Lexsort
100+
//! # Lexicographic Sorts (lexsort)
101101
//!
102102
//! The row format can also be used to implement a fast multi-column / lexicographic sort
103103
//!
@@ -117,13 +117,41 @@
117117
//! }
118118
//! ```
119119
//!
120+
//! # Flattening Dictionaries
121+
//!
122+
//! For performance reasons, dictionary arrays are flattened ("hydrated") to their
123+
//! underlying values during row conversion. See [the issue] for more details.
124+
//!
125+
//! This means that the arrays that come out of [`RowConverter::convert_rows`]
126+
//! may not have the same data types as the input arrays. For example, encoding
127+
//! a `Dictionary<Int8, Utf8>` and then will come out as a `Utf8` array.
128+
//!
129+
//! ```
130+
//! # use arrow_array::{Array, ArrayRef, DictionaryArray};
131+
//! # use arrow_array::types::Int8Type;
132+
//! # use arrow_row::{RowConverter, SortField};
133+
//! # use arrow_schema::DataType;
134+
//! # use std::sync::Arc;
135+
//! // Input is a Dictionary array
136+
//! let dict: DictionaryArray::<Int8Type> = ["a", "b", "c", "a", "b"].into_iter().collect();
137+
//! let sort_fields = vec![SortField::new(dict.data_type().clone())];
138+
//! let arrays = vec![Arc::new(dict) as ArrayRef];
139+
//! let converter = RowConverter::new(sort_fields).unwrap();
140+
//! // Convert to rows
141+
//! let rows = converter.convert_columns(&arrays).unwrap();
142+
//! let converted = converter.convert_rows(&rows).unwrap();
143+
//! // result was a Utf8 array, not a Dictionary array
144+
//! assert_eq!(converted[0].data_type(), &DataType::Utf8);
145+
//! ```
146+
//!
120147
//! [non-comparison sorts]: https://en.wikipedia.org/wiki/Sorting_algorithm#Non-comparison_sorts
121148
//! [radix sort]: https://en.wikipedia.org/wiki/Radix_sort
122149
//! [normalized for sorting]: http://wwwlgis.informatik.uni-kl.de/archiv/wwwdvs.informatik.uni-kl.de/courses/DBSREAL/SS2005/Vorlesungsunterlagen/Implementing_Sorting.pdf
123150
//! [`memcmp`]: https://www.man7.org/linux/man-pages/man3/memcmp.3.html
124151
//! [`lexsort`]: https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort.html
125152
//! [compared]: PartialOrd
126153
//! [compare]: PartialOrd
154+
//! [the issue]: https://github.com/apache/arrow-rs/issues/4811
127155
128156
#![doc(
129157
html_logo_url = "https://arrow.apache.org/img/arrow-logo_chevrons_black-txt_white-bg.svg",
@@ -661,6 +689,8 @@ impl RowConverter {
661689
///
662690
/// See [`Row`] for information on when [`Row`] can be compared
663691
///
692+
/// See [`Self::convert_rows`] for converting [`Rows`] back into [`ArrayRef`]
693+
///
664694
/// # Panics
665695
///
666696
/// Panics if the schema of `columns` does not match that provided to [`RowConverter::new`]
@@ -768,6 +798,8 @@ impl RowConverter {
768798

769799
/// Convert [`Rows`] columns into [`ArrayRef`]
770800
///
801+
/// See [`Self::convert_columns`] for converting [`ArrayRef`] into [`Rows`]
802+
///
771803
/// # Panics
772804
///
773805
/// Panics if the rows were not produced by this [`RowConverter`]

0 commit comments

Comments
 (0)