Implement Spark-compatible array_distinct that preserves insertion order

### Describe the enhancement

`array_distinct` is currently marked `Incompatible` because the DataFusion implementation sorts output elements, while Spark preserves insertion order of first occurrences.

For example:
- Spark: `array_distinct([2, 1, 2, 3])` → `[2, 1, 3]` (insertion order)
- Comet: `array_distinct([2, 1, 2, 3])` → `[1, 2, 3]` (sorted)

This difference means users must opt in with `spark.comet.expr.ArrayDistinct.allowIncompatible=true`.

### Proposed approach

Implement a custom `array_distinct` in `native/spark-expr/` that uses a hash set for deduplication while preserving insertion order, matching Spark's behavior. This would allow upgrading the support level from `Incompatible` to `Compatible`.

Key behaviors to match:
- Preserve insertion order of first occurrence
- De-duplicate NULL elements (keep only the first NULL)
- De-duplicate NaN values (Float and Double)
- Support all orderable element types

### Additional context

The current implementation delegates to DataFusion's built-in `array_distinct` UDF (`datafusion-functions-nested`, `set_ops.rs`), which uses `RowConverter` with `.sorted().dedup()`. A Spark-compatible version would need to use `.dedup()` without sorting, or use a different deduplication strategy entirely.

Test coverage was expanded in #3887.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Spark-compatible array_distinct that preserves insertion order #3888

Describe the enhancement

Proposed approach

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Spark-compatible array_distinct that preserves insertion order #3888

Description

Describe the enhancement

Proposed approach

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions