Skip to content

GetStructField returns non-null for fields of a NULL struct (missing null-mask propagation) #4432

@schenksj

Description

@schenksj

Describe the bug

GetStructField (native/spark-expr/src/struct_funcs/get_struct_field.rs) extracts a struct field by returning the child column directly, without applying the parent struct's null mask:

ColumnarValue::Array(array) => {
    let struct_array = array.as_any().downcast_ref::<StructArray>().expect("A struct is expected");
    Ok(ColumnarValue::Array(Arc::clone(struct_array.column(self.ordinal))))
}

In Arrow, a StructArray's child arrays carry their own validity, independent of the parent struct's null buffer. At a row where the struct itself is null, the child buffer can still hold a non-null value. Returning the child verbatim therefore reads a field of a NULL struct as non-null, which violates Spark semantics (a field of a null struct is null). Concretely, isnotnull(structCol.field) returns true for a row whose structCol is null.

This is a data-correctness bug for any query that accesses a field of a nullable struct read from a parquet file where a logically-null struct column still has a populated child buffer.

Steps to reproduce

Read such a parquet file and filter on a struct field:

SELECT * FROM t WHERE structCol.field IS NOT NULL

Comet returns rows where structCol is null.

It surfaces in Delta: CheckpointProvider.readV2ActionsFromParquetCheckpoint runs
... .where("checkpointMetadata.version is not null or sidecar.path is not null")
over a checkpoint where those structs are all null, expecting zero rows; the leak yields scala.MatchError: (null, null) (Delta's DeltaIncrementalSetTransactionsSuite).

Simple structs written by createDataFrame happen to align child validity with the parent, so the bug only manifests when the child buffer is populated under a null parent (e.g. a coalesce-rewritten checkpoint).

Expected behavior

A field of a NULL struct is NULL.

Additional context

Found while working on the contrib Delta native scan (#4366). The fix — union the parent struct's null mask into the extracted child (null where the struct is null OR the child is null), plus a unit test — is included in PR #4366 (native/spark-expr/src/struct_funcs/get_struct_field.rs). It is independent of Delta and could be reviewed/cherry-picked as a standalone core fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:expressionsExpression evaluationpriority:criticalData corruption, silent wrong results, security issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions