-
Notifications
You must be signed in to change notification settings - Fork 23
Description
The help page of tiledb_array_schema() is not very informative on how to map an enumeration (factors) list to enum attributes.
We can assume that passing a named list of enums where each element corresponds to an enum-attribute should work but that's not always true;and if it does work, the mapping could be incorrect.
The safest option is to map all attributes (setting NULL for non-enums and matching the position order). This is not ideal with many attributes.
Note that fromDataframe() maps all attributes so it doesn't have any issue on mapping the right attributes as enums.
On the example that follows, the creation of schema with enums doesn't throw any error but it messes up the enum mappings.
Source schema helper
test_schema_creation <- function(enum_list) {
sch <- tiledb_array_schema(
domain = tiledb_domain(c(
tiledb_dim(
name = "id",
domain = c(NULL, NULL),
tile = NULL,
type = "ASCII",
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
))
)
)),
attrs = c(
tiledb_attr(
name = "col1",
type = "INT32",
ncells = 1,
nullable = FALSE,
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
))
),
tiledb_attr(
name = "enum1",
type = "INT32",
ncells = 1,
nullable = FALSE,
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
)),
enumeration = TRUE
),
tiledb_attr(
name = "col2",
type = "INT32",
ncells = 1,
nullable = FALSE,
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
))
),
tiledb_attr(
name = "enum2",
type = "INT32",
ncells = 1,
nullable = FALSE,
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
)),
enumeration = TRUE
),
tiledb_attr(
name = "enum3",
type = "INT32",
ncells = 1,
nullable = FALSE,
filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
)),
enumeration = TRUE
)
),
cell_order = "COL_MAJOR",
tile_order = "COL_MAJOR",
capacity = 10000,
sparse = TRUE,
allows_dups = FALSE,
coords_filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
)),
offsets_filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
)),
validity_filter_list = tiledb_filter_list(c(
tiledb_filter_set_option(tiledb_filter("RLE"), "COMPRESSION_LEVEL", -1)
)),
enumerations = enum_list
)
sch
}
library(tiledb) # TileDB R 0.33.1
# `sch` defines 3 enum attributes: `enum1`, `enum2` and `enum3` but `tiledb_array_create()` will set all
# attributes as enum (case 1).
# case 1: map enum attributes only [NOT OK]
uri <- tempfile()
enums <- list(
enum1 = c("A", "B"),
enum2 = c("yes", "no"),
enum3 = c("aa")
)
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)
tiledb_array_has_enumeration(arr)
# col1 enum1 col2 enum2 enum3
TRUE TRUE TRUE TRUE TRUE
# case 2: map all attributes with exact order [ OK ]
uri <- tempfile()
enums <- list(
col1 = NULL,
enum1 = c("A", "B"),
col2 = NULL,
enum2 = c("yes", "no"),
enum3 = c("aa")
)
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)
tiledb_array_has_enumeration(arr)
# col1 enum1 col2 enum2 enum3
FALSE TRUE FALSE TRUE TRUE
# case 3: unorder map all attributes [NOT OK]
uri <- tempfile()
enums <- list(
enum1 = c("A", "B"),
col1 = NULL,
col2 = NULL,
enum2 = c("yes", "no"),
enum3 = c("aa"))
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)
tiledb_array_has_enumeration(arr)
# col1 enum1 col2 enum2 enum3
TRUE TRUE FALSE TRUE TRUE
The problem is at C++ level within libtiledb_array_schema which assumes any length of enum list but the
mapping (C++ loop) will consider the full length of attributes; in the first example, we have enum list of length 3, so
the mapping is done on the first 3 attributes col1, enum1 and col2. Surprisingly, the enum2 and enum3 will
be set as enums although they are not reachable by C++ loop; so the mapping for those is occurred somewhere else.
Lines 1913 to 1930 in 0cb7689
| for (R_xlen_t i = 0; i < nenum; i++) { | |
| bool nn = enumerations[i] == R_NilValue; | |
| if (nn == false) { | |
| XPtr<tiledb::Attribute> attr = | |
| as<XPtr<tiledb::Attribute>>(attributes[i]); | |
| std::vector<std::string> enums = | |
| as<std::vector<std::string>>(enumerations[i]); | |
| std::string enum_name = std::string(enumnames[i]); | |
| bool is_ordered = false; // default | |
| // 'ordered' is an attribute off the CharacterVector | |
| CharacterVector enumvect = enumerations[i]; | |
| if (enumvect.hasAttribute("ordered")) { | |
| is_ordered = (as<bool>(enumvect.attr("ordered")) == true); | |
| } | |
| libtiledb_array_schema_set_enumeration(ctx, schema, attr, enum_name, | |
| enums, false, is_ordered); | |
| } | |
| } |
One solution that fixes the issue could be:
- at the R level, add names to attribute list
- at the C++ level, match the right attribute using enum name.
I will follow up with PR for discussion.