Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .claude/skills/audit-comet-expression/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,20 @@ After implementing tests, tell the user how to run them:

---

## Step 8: Update the Expression Audit Log

After completing the audit (whether or not tests were added), append a row to the audit log at
`docs/source/contributor-guide/expression-audit-log.md`.

The row should include:

- Expression name
- Spark versions checked (e.g. 3.4.3, 3.5.8, 4.0.1)
- Today's date
- A brief summary of findings (behavioral differences, bugs found/fixed, tests added, known incompatibilities)

---

## Output Format

Present the audit as:
Expand Down
32 changes: 32 additions & 0 deletions docs/source/contributor-guide/expression-audit-log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Expression Audit Log
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to keep track of which expressions have been audited and against which Spark versions. Maintaining a markdown doc is one approach. Another approach could be to add comments and/or annotations directly in the code in the serde handlers.


This document tracks which Comet expressions have been audited against their Spark
implementations for correctness and test coverage.

Each audit compares the Comet implementation against the Spark source code for the listed
versions, reviews existing test coverage, identifies gaps, and adds missing tests where needed.

## Audited Expressions

| Expression | Spark Versions Checked | Date | Findings |
| -------------- | ---------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `array_insert` | 3.4.3, 3.5.8, 4.0.1 | 2026-04-02 | No behavioral differences across Spark versions. Fixed `nullable()` metadata bug (did not account for `pos_expr`). Added SQL tests for multiple types (string, boolean, double, float, long, short, byte), literal arguments, null handling, negative indices, out-of-bounds padding, special float values (NaN, Infinity), multibyte UTF-8, and legacy negative index mode. Known incompatibility: pos=0 error message differs from Spark's `INVALID_INDEX_OF_ZERO`. |
2 changes: 1 addition & 1 deletion native/spark-expr/src/array_funcs/array_insert.rs
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ impl PhysicalExpr for ArrayInsert {
}

fn nullable(&self, input_schema: &Schema) -> DataFusionResult<bool> {
self.src_array_expr.nullable(input_schema)
Ok(self.src_array_expr.nullable(input_schema)? || self.pos_expr.nullable(input_schema)?)
}

fn evaluate(&self, batch: &RecordBatch) -> DataFusionResult<ColumnarValue> {
Expand Down
226 changes: 224 additions & 2 deletions spark/src/test/resources/sql-tests/expressions/array/array_insert.sql
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,234 @@
-- under the License.

-- ConfigMatrix: parquet.enable.dictionary=false,true
-- Config: spark.comet.expression.ArrayInsert.allowIncompatible=true

-- ============================================================
-- Integer arrays with column arguments
-- ============================================================

statement
CREATE TABLE test_array_insert(arr array<int>, pos int, val int) USING parquet

statement
INSERT INTO test_array_insert VALUES (array(1, 2, 3), 2, 10), (array(1, 2, 3), 1, 10), (array(1, 2, 3), 4, 10), (array(), 1, 10), (NULL, 1, 10)
INSERT INTO test_array_insert VALUES
(array(1, 2, 3), 2, 10),
(array(1, 2, 3), 1, 10),
(array(1, 2, 3), 4, 10),
(array(), 1, 10),
(NULL, 1, 10),
(array(1, 2, 3), NULL, 10),
(array(1, 2, 3), 3, NULL)

query spark_answer_only
-- basic column arguments
query
SELECT array_insert(arr, pos, val) FROM test_array_insert

-- ============================================================
-- Literal arguments (all-literal queries test native eval
-- because CometSqlFileTestSuite disables constant folding)
-- ============================================================

-- basic insert at various positions
query
SELECT array_insert(array(1, 2, 3), 2, 10)

-- prepend (pos=1)
query
SELECT array_insert(array(1, 2, 3), 1, 10)

-- append (pos = len+1)
query
SELECT array_insert(array(1, 2, 3), 4, 10)

-- insert into empty array
query
SELECT array_insert(array(), 1, 10)

-- ============================================================
-- NULL handling
-- ============================================================

-- null array
query
SELECT array_insert(CAST(NULL AS ARRAY<INT>), 1, 10)

-- null position
query
SELECT array_insert(array(1, 2, 3), CAST(NULL AS INT), 10)

-- null value (insert null into array)
query
SELECT array_insert(array(1, 2, 3), 2, CAST(NULL AS INT))

-- null value appended beyond end
query
SELECT array_insert(array(1, 2, 3, NULL), 4, CAST(NULL AS INT))

-- array with existing nulls
query
SELECT array_insert(array(1, NULL, 3), 2, 10)

-- ============================================================
-- Positive out-of-bounds (null padding)
-- ============================================================

-- one beyond end
query
SELECT array_insert(array(1, 2, 3), 5, 99)

-- far beyond end
query
SELECT array_insert(array(1, 2, 3), 10, 99)

-- ============================================================
-- Negative indices (non-legacy mode, which is the default)
-- ============================================================

-- -1 appends after last element
query
SELECT array_insert(array(1, 2, 3), -1, 10)

-- -2 inserts before last element
query
SELECT array_insert(array(1, 2, 3), -2, 10)

-- -3 inserts before second-to-last
query
SELECT array_insert(array(1, 2, 3), -3, 10)

-- -4 inserts before first element (len = 3, so -4 = before start)
query
SELECT array_insert(array(1, 2, 3), -4, 10)

-- negative beyond start (null padding)
query
SELECT array_insert(array(1, 2, 3), -5, 10)

-- far negative beyond start
query
SELECT array_insert(array(1, 2, 3), -10, 10)

-- ============================================================
-- pos=0 error
-- Note: Spark throws INVALID_INDEX_OF_ZERO, Comet throws a different message.
-- Cannot use expect_error because patterns differ. This is a known incompatibility.
-- ============================================================

-- ============================================================
-- String arrays
-- ============================================================

statement
CREATE TABLE test_array_insert_str(arr array<string>, pos int, val string) USING parquet

statement
INSERT INTO test_array_insert_str VALUES
(array('a', 'b', 'c'), 2, 'd'),
(array('a', 'b', 'c'), 1, 'd'),
(array('a', 'b', 'c'), 4, 'd'),
(array(), 1, 'x'),
(NULL, 1, 'x'),
(array('a', NULL, 'c'), 2, 'z'),
(array('a', 'b', 'c'), 3, NULL)

-- column arguments with strings
query
SELECT array_insert(arr, pos, val) FROM test_array_insert_str

-- literal string arrays
query
SELECT array_insert(array('hello', 'world'), 1, 'first')

query
SELECT array_insert(array('hello', 'world'), 3, 'last')

-- empty strings
query
SELECT array_insert(array('', 'a', ''), 2, '')

-- multibyte UTF-8 characters
query
SELECT array_insert(array('hello', 'world'), 2, 'cafe\u0301')

query
SELECT array_insert(array('abc', 'def'), 1, '中文')

-- ============================================================
-- Boolean arrays
-- ============================================================

statement
CREATE TABLE test_array_insert_bool(arr array<boolean>, pos int, val boolean) USING parquet

statement
INSERT INTO test_array_insert_bool VALUES
(array(true, false, true), 2, false),
(array(true, false), 3, true),
(NULL, 1, true),
(array(true), 1, NULL)

query
SELECT array_insert(arr, pos, val) FROM test_array_insert_bool

query
SELECT array_insert(array(true, false, true), 3, true)

-- ============================================================
-- Double arrays
-- ============================================================

statement
CREATE TABLE test_array_insert_double(arr array<double>, pos int, val double) USING parquet

statement
INSERT INTO test_array_insert_double VALUES
(array(1.1, 2.2, 3.3), 2, 4.4),
(array(1.1, 2.2), 3, 0.0),
(NULL, 1, 1.0),
(array(1.1, 2.2), 1, NULL)

query
SELECT array_insert(arr, pos, val) FROM test_array_insert_double

-- special float values
query
SELECT array_insert(array(CAST(1.0 AS DOUBLE), CAST(2.0 AS DOUBLE)), 2, CAST('NaN' AS DOUBLE))

query
SELECT array_insert(array(CAST(1.0 AS DOUBLE), CAST(2.0 AS DOUBLE)), 2, CAST('Infinity' AS DOUBLE))

query
SELECT array_insert(array(CAST(1.0 AS DOUBLE), CAST(2.0 AS DOUBLE)), 2, CAST('-Infinity' AS DOUBLE))

-- negative zero
query
SELECT array_insert(array(CAST(1.0 AS DOUBLE), CAST(2.0 AS DOUBLE)), 1, CAST(-0.0 AS DOUBLE))

-- ============================================================
-- Long arrays
-- ============================================================

query
SELECT array_insert(array(CAST(1 AS BIGINT), CAST(2 AS BIGINT)), 2, CAST(3 AS BIGINT))

-- ============================================================
-- Short arrays
-- ============================================================

query
SELECT array_insert(array(CAST(1 AS SMALLINT), CAST(2 AS SMALLINT)), 2, CAST(3 AS SMALLINT))

-- ============================================================
-- Byte arrays
-- ============================================================

query
SELECT array_insert(array(CAST(1 AS TINYINT), CAST(2 AS TINYINT)), 2, CAST(3 AS TINYINT))

-- ============================================================
-- Float arrays
-- ============================================================

query
SELECT array_insert(array(CAST(1.1 AS FLOAT), CAST(2.2 AS FLOAT)), 2, CAST(3.3 AS FLOAT))
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
-- Licensed to the Apache Software Foundation (ASF) under one
-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

-- Tests for array_insert with legacy negative index mode enabled.
-- In legacy mode, -1 inserts BEFORE the last element (not after).

-- ConfigMatrix: parquet.enable.dictionary=false,true
-- Config: spark.comet.expression.ArrayInsert.allowIncompatible=true
-- Config: spark.sql.legacy.negativeIndexInArrayInsert=true

-- -1 inserts before last element in legacy mode
query
SELECT array_insert(array(1, 2, 3), -1, 10)

-- -2 inserts before second-to-last
query
SELECT array_insert(array(1, 2, 3), -2, 10)

-- -3 inserts before first element
query
SELECT array_insert(array(1, 2, 3), -3, 10)

-- negative beyond start with null padding (legacy mode pads differently)
query
SELECT array_insert(array(1, 2, 3), -5, 10)

-- far negative beyond start
query
SELECT array_insert(array(1, 3, 4), -2, 2)

-- column-based test
statement
CREATE TABLE test_ai_legacy(arr array<int>, pos int, val int) USING parquet

statement
INSERT INTO test_ai_legacy VALUES
(array(1, 2, 3), -1, 10),
(array(4, 5), -1, 20),
(array(1, 2, 3), -4, 10),
(NULL, -1, 10)

query
SELECT array_insert(arr, pos, val) FROM test_ai_legacy
Loading