Skip to content

Issue on adding enums via tiledb_array_schema() #853

@cgiachalis

Description

@cgiachalis

The help page of tiledb_array_schema() is not very informative on how to map an enumeration (factors) list to enum attributes.

We can assume that passing a named list of enums where each element corresponds to an enum-attribute should work but that's not always true;and if it does work, the mapping could be incorrect.

The safest option is to map all attributes (setting NULL for non-enums and matching the position order). This is not ideal with many attributes.

Note that fromDataframe() maps all attributes so it doesn't have any issue on mapping the right attributes as enums.

On the example that follows, the creation of schema with enums doesn't throw any error but it messes up the enum mappings.

Source schema helper
test_schema_creation <- function(enum_list) {

sch <- tiledb_array_schema(
  domain = tiledb_domain(c(
    tiledb_dim(
      name = "id",
      domain = c(NULL, NULL),
      tile = NULL,
      type = "ASCII",
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
    )
  )),
  attrs = c(

    tiledb_attr(
      name = "col1",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
      ),
    tiledb_attr(
      name = "enum1",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    ),
    tiledb_attr(
      name = "col2",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      ))
    ),
    tiledb_attr(
      name = "enum2",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    ),
    tiledb_attr(
      name = "enum3",
      type = "INT32",
      ncells = 1,
      nullable = FALSE,
      filter_list = tiledb_filter_list(c(
        tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
      )),
      enumeration = TRUE
    )
  ),
  cell_order = "COL_MAJOR",
  tile_order = "COL_MAJOR",
  capacity = 10000,
  sparse = TRUE,
  allows_dups = FALSE,
  coords_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
  )),
  offsets_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("ZSTD"), "COMPRESSION_LEVEL", -1)
  )),
  validity_filter_list = tiledb_filter_list(c(
    tiledb_filter_set_option(tiledb_filter("RLE"), "COMPRESSION_LEVEL", -1)
  )),
  enumerations = enum_list
)
sch
}
library(tiledb) # TileDB R 0.33.1 

# `sch` defines 3 enum attributes: `enum1`, `enum2` and  `enum3` but `tiledb_array_create()` will set all 
#  attributes as enum (case 1).

# case 1: map enum attributes only [NOT OK]
uri <- tempfile()

enums <- list(
    enum1 =  c("A", "B"),
    enum2 = c("yes", "no"),
    enum3 = c("aa")
  )
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
# col1 enum1  col2 enum2 enum3 
 TRUE  TRUE  TRUE  TRUE  TRUE 
 

# case 2: map all attributes with exact order [ OK ]
uri <- tempfile()

enums <- list(
    col1 = NULL,
    enum1 =  c("A", "B"),
    col2 = NULL,
    enum2 = c("yes", "no"),
    enum3 = c("aa")
)
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
# col1 enum1  col2 enum2 enum3 
FALSE  TRUE FALSE  TRUE  TRUE 

# case 3: unorder map all attributes [NOT OK]
uri <- tempfile()
enums <- list(
    enum1 =  c("A", "B"),
    col1 = NULL, 
    col2 = NULL,
    enum2 = c("yes", "no"),
    enum3 = c("aa"))
  
sch <- test_schema_creation(enums)
tiledb_array_create(uri, sch)
arr <- tiledb_array(uri)

tiledb_array_has_enumeration(arr)
 # col1 enum1  col2 enum2 enum3 
 TRUE  TRUE FALSE  TRUE  TRUE 

The problem is at C++ level within libtiledb_array_schema which assumes any length of enum list but the
mapping (C++ loop) will consider the full length of attributes; in the first example, we have enum list of length 3, so
the mapping is done on the first 3 attributes col1, enum1 and col2. Surprisingly, the enum2 and enum3 will
be set as enums although they are not reachable by C++ loop; so the mapping for those is occurred somewhere else.

TileDB-R/src/libtiledb.cpp

Lines 1913 to 1930 in 0cb7689

for (R_xlen_t i = 0; i < nenum; i++) {
bool nn = enumerations[i] == R_NilValue;
if (nn == false) {
XPtr<tiledb::Attribute> attr =
as<XPtr<tiledb::Attribute>>(attributes[i]);
std::vector<std::string> enums =
as<std::vector<std::string>>(enumerations[i]);
std::string enum_name = std::string(enumnames[i]);
bool is_ordered = false; // default
// 'ordered' is an attribute off the CharacterVector
CharacterVector enumvect = enumerations[i];
if (enumvect.hasAttribute("ordered")) {
is_ordered = (as<bool>(enumvect.attr("ordered")) == true);
}
libtiledb_array_schema_set_enumeration(ctx, schema, attr, enum_name,
enums, false, is_ordered);
}
}


One solution that fixes the issue could be:

  1. at the R level, add names to attribute list
  2. at the C++ level, match the right attribute using enum name.

I will follow up with PR for discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions