Describe the bug
GetStructField (native/spark-expr/src/struct_funcs/get_struct_field.rs) extracts a struct field by returning the child column directly, without applying the parent struct's null mask:
ColumnarValue::Array(array) => {
let struct_array = array.as_any().downcast_ref::<StructArray>().expect("A struct is expected");
Ok(ColumnarValue::Array(Arc::clone(struct_array.column(self.ordinal))))
}
In Arrow, a StructArray's child arrays carry their own validity, independent of the parent struct's null buffer. At a row where the struct itself is null, the child buffer can still hold a non-null value. Returning the child verbatim therefore reads a field of a NULL struct as non-null, which violates Spark semantics (a field of a null struct is null). Concretely, isnotnull(structCol.field) returns true for a row whose structCol is null.
This is a data-correctness bug for any query that accesses a field of a nullable struct read from a parquet file where a logically-null struct column still has a populated child buffer.
Steps to reproduce
Read such a parquet file and filter on a struct field:
SELECT * FROM t WHERE structCol.field IS NOT NULL
Comet returns rows where structCol is null.
It surfaces in Delta: CheckpointProvider.readV2ActionsFromParquetCheckpoint runs
... .where("checkpointMetadata.version is not null or sidecar.path is not null")
over a checkpoint where those structs are all null, expecting zero rows; the leak yields scala.MatchError: (null, null) (Delta's DeltaIncrementalSetTransactionsSuite).
Simple structs written by createDataFrame happen to align child validity with the parent, so the bug only manifests when the child buffer is populated under a null parent (e.g. a coalesce-rewritten checkpoint).
Expected behavior
A field of a NULL struct is NULL.
Additional context
Found while working on the contrib Delta native scan (#4366). The fix — union the parent struct's null mask into the extracted child (null where the struct is null OR the child is null), plus a unit test — is included in PR #4366 (native/spark-expr/src/struct_funcs/get_struct_field.rs). It is independent of Delta and could be reviewed/cherry-picked as a standalone core fix.
Describe the bug
GetStructField(native/spark-expr/src/struct_funcs/get_struct_field.rs) extracts a struct field by returning the child column directly, without applying the parent struct's null mask:In Arrow, a
StructArray's child arrays carry their own validity, independent of the parent struct's null buffer. At a row where the struct itself is null, the child buffer can still hold a non-null value. Returning the child verbatim therefore reads a field of a NULL struct as non-null, which violates Spark semantics (a field of a null struct is null). Concretely,isnotnull(structCol.field)returnstruefor a row whosestructColis null.This is a data-correctness bug for any query that accesses a field of a nullable struct read from a parquet file where a logically-null struct column still has a populated child buffer.
Steps to reproduce
Read such a parquet file and filter on a struct field:
Comet returns rows where
structColis null.It surfaces in Delta:
CheckpointProvider.readV2ActionsFromParquetCheckpointruns... .where("checkpointMetadata.version is not null or sidecar.path is not null")over a checkpoint where those structs are all null, expecting zero rows; the leak yields
scala.MatchError: (null, null)(Delta'sDeltaIncrementalSetTransactionsSuite).Simple structs written by
createDataFramehappen to align child validity with the parent, so the bug only manifests when the child buffer is populated under a null parent (e.g. a coalesce-rewritten checkpoint).Expected behavior
A field of a NULL struct is NULL.
Additional context
Found while working on the contrib Delta native scan (#4366). The fix — union the parent struct's null mask into the extracted child (null where the struct is null OR the child is null), plus a unit test — is included in PR #4366 (
native/spark-expr/src/struct_funcs/get_struct_field.rs). It is independent of Delta and could be reviewed/cherry-picked as a standalone core fix.