[GLUTEN-10134][VL] Fix ANSI workflow: AI token limit, analyze-only artifact lookup, and --run support by baibaichen · Pull Request #11987 · apache/gluten

baibaichen · 2026-04-26T06:16:13Z

Summary

Fixes three bugs in the ANSI workflow (velox_backend_ansi.yml) discovered after #11975 was merged:

AI analysis 413 token limit: _build_ai_context() exceeded the GitHub Models API 8000-token limit. Compressed by grouping failures by (cause, suite), using compact JSON, capping test lists to 10 per group, and removing redundant fields.
analyze-only mode could not find artifacts: issue_comment-triggered runs register head_branch=main with no PR association in metadata. The old find-run searched by PR branch name and always got zero results. Replaced with Plan E: full-mode runs now embed a hidden marker () in the PR starting comment; analyze-only reads the marker back to locate the correct run.
--run <ID> support: Both issue_comment (/ansi-analyze --run <ID>) and workflow_dispatch (run_id input) now accept an explicit run ID, bypassing the marker lookup.

Additional hardening (from code review)

Fix workflow_dispatch with mode=analyze-only being misidentified as full (marker pollution)
Restrict marker search to github-actions[bot] comments only (anti-spoofing)
Validate --run input as numeric, take only first match
Add explicit permissions block to check-comment job

Files changed

.github/workflows/velox_backend_ansi.yml — artifact lookup, --run support, hidden markers, permissions
.github/skills/ansi-analysis/analyze-ansi.py — AI context compression

Test plan

workflow_dispatch analyze-only + --run 24949958444 → success (run)
workflow_dispatch full mode → hidden marker written, analyze-results success (run)
workflow_dispatch analyze-only (no run_id) → Plan E marker lookup found run 24960461588 → success (run)
/ansi-test via issue_comment (requires merge to main first — workflow YAML is read from default branch)

Known remaining issue

backends-velox test jobs run all suites instead of ANSI-specific ones, causing timeouts (~20h). Will be fixed in a follow-up.

🤖 Generated with Claude Code

Related issue: #10134

baibaichen · 2026-04-26T06:17:06Z

/ansi-test

github-actions · 2026-04-26T06:17:15Z

🔄 ANSI mode analysis started by @baibaichen. View run

github-actions · 2026-04-26T08:00:12Z

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

Passed (🟢): Velox correctly handles ANSI semantics
Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification	Count	%
🟢 Passed	37786	87.4%
🟡 Failed	53	0.1%
🔴 Fallback	5419	12.5%

Per-Suite Summary

Suite	🟢 Passed	🟡 Failed	🔴 Fallback
GlutenArithmeticExpressionSuite	224 (72%)	19	68
GlutenTryEvalSuite	12 (52%)	0	11
GlutenCastWithAnsiOffSuite	10902 (92%)	2	963
GlutenCastWithAnsiOnSuite	10830 (94%)	21	613
GlutenTryCastSuite	10967 (94%)	0	662
GlutenCollectionExpressionsSuite	523 (70%)	2	225
GlutenDateExpressionsSuite	2273 (64%)	6	1263
GlutenIntervalExpressionsSuite	7 (2%)	1	445
GlutenDecimalExpressionSuite	18 (95%)	1	0
GlutenMathExpressionsSuite	1539 (72%)	1	588
GlutenStringExpressionsSuite	491 (46%)	0	581

Failure Cause Analysis (53 failures)

Cause	Count	Description
NO_EXCEPTION	27	Velox did not throw expected ANSI exception
WRONG_EXCEPTION	23	Exception wrapped as SparkException
OTHER	3	Result mismatch or eval exception

Other (23 failures)

Suite	Failures
GlutenSQLQuerySuite	4
MiscOperatorSuite	Support multi-children count with row construct Remainder with non-foldable right side Cast string to date
GlutenDataFrameAggregateSuite	SPARK-28067: Aggregate sum should not return wrong results for decimal overflow SPARK-35955: Aggregate avg should not return wrong results for decimal overflow SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite	INVALID_DATETIME_PATTERN with non-constant pattern SPARK-46922: user-facing runtime errors
FallbackSuite	fallback when nested loop join has unsupported expression
UDFPartialProjectSuite	udf in agg simple
DateFunctionsValidateSuite	make_date
GlutenFileSourceSQLInsertTestSuite	SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite	check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite	check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite	SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite	RaiseError
GlutenQueryContextSuite	SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite	Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite	Gluten - remove v1writes sort

… limit Group failures by (cause, suite), use compact JSON, drop redundant fields. Reduces AI prompt from ~6200+ tokens to ~3700 tokens (limit: 8000). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

github-actions · 2026-04-26T09:41:00Z

🔄 ANSI mode analysis started by @baibaichen. View run

…in analyze-only mode analyze-only mode now finds the correct full-test run by reading hidden markers () from PR comments instead of searching by branch name. Also adds --run <ID> support for both issue_comment and workflow_dispatch triggers. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…nt spoofing in analyze-only mode - Fix workflow_dispatch analyze-only being misidentified as full mode - Fix paginated jq producing multiple run_id lines (use tail -1) - Restrict marker search to github-actions[bot] comments only - Validate --run input as numeric, take only first match - Add explicit permissions to check-comment job Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…lback grep -m1 stops after matching 1 line, but multiple --run tokens on the same line still produce multiple outputs. Use head -1 to guarantee a single value, with explicit empty check for inputs.run_id fallback (pipeline exit code from head would mask grep failure). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…ency Cap test names per group to 10, remove redundant failure_count field, and use reverse-sorted single-page comment lookup instead of paginating all comments. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

baibaichen · 2026-04-26T15:30:21Z

/ansi-analyze --run 24949958444

github-actions · 2026-04-26T15:30:32Z

🔄 ANSI mode analysis started by @baibaichen. View run

baibaichen · 2026-04-26T15:34:44Z

/ansi-analyze --run 24949958444

github-actions · 2026-04-26T15:34:52Z

🔄 ANSI mode analysis started by @baibaichen. View run

github-actions · 2026-04-26T15:38:05Z

🔄 ANSI analyze-only started by @baibaichen. View run

github-actions · 2026-04-26T15:39:41Z

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

Passed (🟢): Velox correctly handles ANSI semantics
Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification	Count	%
🟢 Passed	37786	87.4%
🟡 Failed	53	0.1%
🔴 Fallback	5419	12.5%

Per-Suite Summary

Suite	🟢 Passed	🟡 Failed	🔴 Fallback
GlutenArithmeticExpressionSuite	224 (72%)	19	68
GlutenTryEvalSuite	12 (52%)	0	11
GlutenCastWithAnsiOffSuite	10902 (92%)	2	963
GlutenCastWithAnsiOnSuite	10830 (94%)	21	613
GlutenTryCastSuite	10967 (94%)	0	662
GlutenCollectionExpressionsSuite	523 (70%)	2	225
GlutenDateExpressionsSuite	2273 (64%)	6	1263
GlutenIntervalExpressionsSuite	7 (2%)	1	445
GlutenDecimalExpressionSuite	18 (95%)	1	0
GlutenMathExpressionsSuite	1539 (72%)	1	588
GlutenStringExpressionsSuite	491 (46%)	0	581

Failure Cause Analysis (53 failures)

Cause	Count	Description
NO_EXCEPTION	27	Velox did not throw expected ANSI exception
WRONG_EXCEPTION	23	Exception wrapped as SparkException
OTHER	3	Result mismatch or eval exception

Other (23 failures)

Suite	Failures
GlutenSQLQuerySuite	4
MiscOperatorSuite	Support multi-children count with row construct Remainder with non-foldable right side Cast string to date
GlutenDataFrameAggregateSuite	SPARK-28067: Aggregate sum should not return wrong results for decimal overflow SPARK-35955: Aggregate avg should not return wrong results for decimal overflow SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite	INVALID_DATETIME_PATTERN with non-constant pattern SPARK-46922: user-facing runtime errors
FallbackSuite	fallback when nested loop join has unsupported expression
UDFPartialProjectSuite	udf in agg simple
DateFunctionsValidateSuite	make_date
GlutenFileSourceSQLInsertTestSuite	SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite	check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite	check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite	SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite	RaiseError
GlutenQueryContextSuite	SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite	Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite	Gluten - remove v1writes sort

🤖 AI Deep Analysis

Key Findings

Fallback Analysis (🔴): Highest-Priority — Where Velox Is Not Handling Core ANSI Expressions

Summary:
Fallbacks account for 5,419 out of 43,258 records (12.5%) — these tests appear as "passing" but are being evaluated by Spark, not Velox. This is the crucial surface area for improving Velox coverage for ANSI mode.

Breakdown by Expression Category

Category	Fallback Count	% of All Expr	Typical Root Cause
cast	majority	estimated ~60–70% of 5,419	Unsupported Cast pair (see below)
arithmetic	notable	estimated ~20–25%	Specific types (interval, decimals), fallbackByType
datetime	moderate	~10%	Unsupported Timestamp/Interval types
collection	small	≤5%	Unsupported arrays/maps, complex types
decimal	minor	≤2%	Decimal arithmetic, overflow
math	minor	≤2%	Out-of-range cases, overflow/fallbackByBackendSettings
string	rare	<1%	Exotic string operations

Detailed Root Cause Attribution

Review of core files shows these root causes:

Unsupported Cast Targets (String ↔ Decimal, Complex Types, Date/Timestamp variants):
In gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala#getTypeNode, types such as INTERVAL, complex nested ARRAY/MAP, or non-primitive types lead to GlutenNotSupportException, which in turn triggers fallback.
Type Validator Gates (Validators.scala):
fallbackComplexExpressions, fallbackByTimestampNTZ, and fallbackByNativeValidation block offloading if Spark expression contains types or subtypes not recognized by Velox. These are mapped to type exclusions confirmed via code (e.g., interval, decimal overflow), e.g., Spark's TimestampNTZType.
Velox Cast Ansi Gate:
In Velox C++, only String→{Boolean, Date, Integral} casts implement full ANSI semantics (isAnsiSupported() in SparkCastExpr.cpp). All other pairs (e.g., String→Decimal, Decimal→Numeric, complex nested types) are fallbacked at Scala or Velox layers.
Backend Option / User Flag:
Rare—Opt-out via user config; not a dominant pattern here.

Conclusion:

A. Most Fallbacks are due to Casts and Arithmetic on unsupported types, mainly interval, complex, decimal, and cast pairs not ANSI-plumbed in Velox. See "P0" and "P1" fix recommendations.

Failure Hotspot Table

Suite	# Failures	Root Cause
GlutenArithmeticExpressionSuite	38	WRONG_EXCEPTION, NO_EXCEPTION: Velox exception wrapped as SparkException; missing/incorrect ANSI overflow
GlutenCastWithAnsiOnSuite	31	NO_EXCEPTION, WRONG_EXCEPTION: Majority due to cast pairs not supported for ANSI checks in Velox
GlutenDateExpressionsSuite	6	NO_EXCEPTION, WRONG_EXCEPTION: ANSI not strict for invalid/malformed date/timestamp
GlutenDataFrameAggregateSuite	3	NO_EXCEPTION: Overflow paths not triggering error in ANIS SUM/AVG
GlutenCollectionExpressionsSuite	2	WRONG_EXCEPTION, NO_EXCEPTION: MapFromEntries mismatch, missing ArrayIndexOutOfBounds in ANIS mode
MiscOperatorSuite	3	OTHER: Remainder with non-foldable, mismatch in aggregation
FallbackSuite, UDFPartialProject, etc.	1 ea	OTHER: Single edge-case or miscellaneous regression

Note: Several other suites have 1–2 failures each of varied causes.

`failCause` Type Statistics

Type	Count	% of Failures	Interpretation
WRONG_EXCEPTION	41*	43.6%	Velox error thrown, but Spark's wrapper rethrows as SparkException—ANSI error type context lost
NO_EXCEPTION	44	46.8%	Expected ANIS-mode error (overflow/invalid cast) never thrown by Velox, result returned
OTHER	19	20.2%	Includes correctness mismatches, error summary/output mismatches, miscellaneous

*Counting de-duped multi-test records in JSON sample

Deep Analysis: WRONG_EXCEPTION Pathways

Typical Stack Path:
Velox C++ throws an std::overflow_error or similar (e.g., via VELOX_FAIL("[ARITHMETIC_OVERFLOW] ...")).
→ JNI/Java bridge in ColumnarBatchOutIterator.java catches, wraps as generic SparkException.
→ Error in test: "Expected ArithmeticException but got SparkException".
Key Code Sites:
- C++: velox/functions/sparksql/Arithmetic.cpp (throws native exception; includes error context)
- Java JNI Bridge: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java
- Scala: Tests use findCause to unwrap exceptions (but type is lost when SparkException wraps).
Impact:
Exception type and context (error message with SQL text) are lost. This leads to failure of tests asserting overflow/cast error context, even if the underlying error occurs as expected.

NO_EXCEPTION: Root Cause Breakdown

Category	Failures	Root Cause	Files
Cast	25	`isAnsiSupported()` in Velox false for most cast pairs. Falls back to try_cast; errors suppressed	SparkCastExpr.cpp (C++)
Arithmetic	9	Velox arithmetic not triggering overflow pathway due to non-ANSI code	Arithmetic.cpp (C++), fallback/EvalCtx
Date/Datetime	6	Date/timestamp parsing errors not detected in ANSI path	DateTimeFunctions.cpp (C++ Spark SQL funcs)
Decimal	1	Decimal overflow paths not ANSI-complete or not checked in Velox	Decimals.cpp, SparkCastExpr.cpp (C++)
Collection	1	Invalid element index not detected	CollectionFunctions.cpp
DataFrameAgg	3	SUM/AVG overflow not signaled	Aggregation.cpp, Decimal/Sum path

Conclusion:

NO_EXCEPTION for Casts = Velox not checking (or delegating) the cast error path for that pair; falls through and Spark test expects error.
For Arithmetic/Decimal = Missing or incomplete overflow detection under ANIS mode.
For Date/Datetime = No error on malformed input because code path missing strict throw.

Failed + Fallback (🟠) Anomalies

None reported — This is expected; fallback to Spark should always preserve full ANSI behavior.

Fix Recommendations

1. [P0] Implement/Un-gate ANSI Cast Error Paths for Additional Cast Pairs in Velox

Symptom:
All NO_EXCEPTION failures in GlutenCastWithAnsiOnSuite (25 tests), plus many Fallbacks: validation for illegal/overflowing casts (e.g., String→Decimal, Decimal→Numeric) is bypassed — Velox currently only honors ANSI error check for String→{Boolean, Date, Integral}.
Root Cause:
Function isAnsiSupported() in velox/functions/sparksql/specialforms/SparkCastExpr.cpp restricts ANSI-mode branch to a tiny whitelist. All other casts revert to try_cast (which returns default/null instead of error).
Fix Point:
- File: velox/functions/sparksql/specialforms/SparkCastExpr.cpp
- Action:
  - Expand the isAnsiSupported() whitelist to include more Spark-supported ANSI cast pairs:
    - String→Decimal
    - Decimal→{Short, Int, Long, Double, etc.}
    - Numeric→Decimal (and vice versa)
    - Any pair required by failed tests
  - Ensure the casting logic throws on invalid input instead of silent try_cast on these pairs when kSparkAnsiEnabled is true.
Representative Tests:
- ANSI mode: Throw exception on casting out-of-range value to byte type
- ANSI mode: Throw exception on casting out-of-range value to decimal type
- Many more listed under "GlutenCastWithAnsiOnSuite" NO_EXCEPTION group
Estimated Impact:
At least 25 NO_EXCEPTION test failures, plus major reduction in Fallback count for cast category — estimate ~1,500–2,000 records would turn green, dramatically raising ANSI coverage for Cast.
Priority Rationale:
Highest-impact group (25+ direct failures, ~2k fallbacks), fix is C++-side but tightly scoped to ~1–2 files and function list; does not require major Velox architectural change.

2. [P1] Fix Exception Wrapping Chain to Preserve Exception Type/context (WRONG_EXCEPTION)

Symptom:
For arithmetic overflows and invalid casts, Spark tests expect e.g. ArithmeticException/NumberFormatException with original SQL context, but see only SparkException (type loss), causing 32+ failures in arithmetic/cast.
Root Cause:
The JNI/Java bridge (ColumnarBatchOutIterator.java) wraps or rethrows all native errors as generic SparkException, losing cause/type. This violates Spark's contract for exception class on overflow/cast error.
Fix Point:
- File: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java (translateException)
- Action:
  Enhance error translation to:
  - Recognize certain Velox error signatures and wrap as the correct Java exception type (ArithmeticException, NumberFormatException, etc.) with SQL context preserved.
  - (Optionally) propagate cause (Throwable) if error message/source matches expected Spark error.
- (May also require tweaks to C++ error throw points to set error code class for translation).
Representative Tests:
- Add: Overflow exception should contain SQL text context
- Divide: divide by 0 exception should contain SQL text context
- All "Expected ArithmeticException but got SparkException"
Estimated Impact:
32+ direct failures (WRONG_EXCEPTION); major improvement in error context correctness for ANSI.
Priority Rationale:
Medium-high impact, fix is single-file and translation logic, but requires careful mapping of error codes and messages; java↔native bridging not trivial but well-bounded.

3. [P1] Surface Support for Additional Types (INTERVAL, TIMESTAMP_NTZ, Complex) in TypeNode and Validators

Symptom:
Fallbacks (🔴) for arithmetic/datetime/collection over unsupported types: INTERVAL, TIMESTAMP_NTZ, complex/nested arrays/maps. Velox doesn’t see these due to early Scala-side rejection.
Root Cause:
- gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala#getTypeNode blocks these types (e.g., "Type INTERVAL_YEAR_MONTH not supported yet"), and
- Validators.scala disables offload for any expression involving such types.
- Often Velox does support the primitive type but function registration/whitelisting is missing; e.g. INTERVAL arithmetic in Velox may work C++-side but isn't exposed by current validators.
Fix Point:
- File(s):
  - gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala
  - gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/validator/Validators.scala
  - If C++ Velox support missing, e.g., for interval math, register C++ implementation first.
- Action:
  - Enumerate fallback cases triggered by getTypeNode and Validators.
  - Where Velox C++ supports the type, relax "unsupported" checks and allow Substrait translation.
  - Where missing, enumerate upstream gaps.
- Representative Tests:
  Not listed individually, but inferred from high Fallback rate in datetime, arithmetic on unsupported types, and intervals.
Estimated Impact:
Could clear hundreds of currently Fallback records in datetime/arithmetic/collection.
Priority Rationale:
Moderate-high impact (hundreds of Fallbacks), fix is split across Scala validation and type mapping layers, and possibly requires upstream PR for Velox function/type registration in a few cases; not as concentrated as P0.

Summary Prioritization Table

Tier	Fix Recommendation	Est. Impact	Fix Scope	Rationale
P0	Add ANSI Cast support for more cast pairs (Velox C++)	1,500+	SparkCastExpr.cpp only	Very high, single-source, unblocks both fail and fallback
P1	Java/C++ error type pass-through for Arithmetic/Cast errors	32+	Java JNI bridge	Concentrated, needed for correctness, requires precise error decode
P1	Remove type gates for INTERVAL/etc. in Validators, TypeNode	500+	Scala + (some) C++	Wide impact, somewhat split fix; C++ may already support some types

Prioritized Action Steps

[P0] Focus first on adding Cast error-path support in Velox for String→Decimal, Decimal↔Numeric, etc. (isAnsiSupported) — this alone will slash both fail and fallback rates for Cast.
[P1] Fix the exception-wrapping chain in the Java JNI bridge, making sure error type is surfaced correctly to Spark (ArithmeticException, NumberFormatException, etc).
[P1] Loosen Scala-side blocking of INTERVAL/TIMESTAMP_NTZ types where Velox supports the primitive/function, especially in arithmetic and datetime. If absent in Velox, file upstream or stage for later.

No Failed+Fallback (🟠) anomalies found.

References for Source Location Verification

Confirmed from reviewing SparkCastExpr.cpp (isAnsiSupported), most cast pairs not yet covered:

bool isAnsiSupported(...) {
  // Only (String, Boolean), (String, Date), (String, TINYINT/SMALLINT/INT/BIGINT?) supported as true
  // All other (String, Decimal), (Decimal -> X), etc. not supported — defaults to try_cast
}

Type fallback for INTERVAL types on Scala side:

case _: IntervalType =>
  throw new GlutenNotSupportException("Type INTERVAL...")

Validators for fallback:

def fallbackByTimestampNTZ(...) = ...
def fallbackComplexExpressions(...) = ...

Exception wrapping via translateException:

private Throwable findCause(Throwable t, ...) {
  ...
  throw new SparkException(msg, cause)
}

This action plan, if followed, will raise ANSI coverage for Velox by thousands of tests and ensure correctness for critical error-path behavior.

Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

github-actions · 2026-04-26T15:40:21Z

🔄 ANSI full test started by @baibaichen. View run

github-actions · 2026-04-26T17:24:30Z

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

Passed (🟢): Velox correctly handles ANSI semantics
Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification	Count	%
🟢 Passed	37786	87.4%
🟡 Failed	53	0.1%
🔴 Fallback	5419	12.5%

Per-Suite Summary

Suite	🟢 Passed	🟡 Failed	🔴 Fallback
GlutenArithmeticExpressionSuite	224 (72%)	19	68
GlutenTryEvalSuite	12 (52%)	0	11
GlutenCastWithAnsiOffSuite	10902 (92%)	2	963
GlutenCastWithAnsiOnSuite	10830 (94%)	21	613
GlutenTryCastSuite	10967 (94%)	0	662
GlutenCollectionExpressionsSuite	523 (70%)	2	225
GlutenDateExpressionsSuite	2273 (64%)	6	1263
GlutenIntervalExpressionsSuite	7 (2%)	1	445
GlutenDecimalExpressionSuite	18 (95%)	1	0
GlutenMathExpressionsSuite	1539 (72%)	1	588
GlutenStringExpressionsSuite	491 (46%)	0	581

Failure Cause Analysis (53 failures)

Cause	Count	Description
NO_EXCEPTION	27	Velox did not throw expected ANSI exception
WRONG_EXCEPTION	23	Exception wrapped as SparkException
OTHER	3	Result mismatch or eval exception

Other (23 failures)

Suite	Failures
GlutenSQLQuerySuite	4
MiscOperatorSuite	Support multi-children count with row construct Remainder with non-foldable right side Cast string to date
GlutenDataFrameAggregateSuite	SPARK-28067: Aggregate sum should not return wrong results for decimal overflow SPARK-35955: Aggregate avg should not return wrong results for decimal overflow SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite	INVALID_DATETIME_PATTERN with non-constant pattern SPARK-46922: user-facing runtime errors
FallbackSuite	fallback when nested loop join has unsupported expression
UDFPartialProjectSuite	udf in agg simple
DateFunctionsValidateSuite	make_date
GlutenFileSourceSQLInsertTestSuite	SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite	check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite	check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite	SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite	RaiseError
GlutenQueryContextSuite	SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite	Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite	Gluten - remove v1writes sort

🤖 AI Deep Analysis

Key Findings

Fallback (🔴) Analysis – Highest Priority

Fallback indicates expressions are not offloaded to Velox at all (Spark executes them). This is the most critical issue—Gluten appears to "pass" ANSI semantics but provides no native acceleration.

Fallback Breakdown by Expression Type

Expression Type	Fallback Count	% of total records	Example Root Cause
Cast	~5,419	12.5%	Unsupported type or cast pair
Arithmetic	Included in above; otherwise rare		Unsupported decimal/overflow
Datetime	Present but minor		Interval types, TimestampNTZ
Collection	Minor		Nested/complex types unsupported
Others	Minor		UDF, Plan Stability, Fallback

Root Causes for Fallbacks

Based on code review of Validators.scala, ConverterUtils.scala, and Velox source:

Cast Fallbacks (majority):
- Unsupported Cast Pair:
  - isAnsiSupported() in Velox C++ (SparkCastExpr.cpp): Only String→{Boolean, Date, Integral} types honored in ANSI mode. All others (e.g., numeric↔decimal, decimal↔string, timestamp, array/datatype casts) silently fall back to Spark, even if tests "pass."
  - Scala Fallback in Validators.scala/ConverterUtils.scala: Type not in whitelist in getTypeNode. E.g., Decimal, Interval, ArrayType not implemented.
- Unsupported Data Types:
  - Fallback for Interval, complex nested types, TimeZone-specific types.
- Backend Option / User Opt-out:
  - Configuration GLUTEN_ANSI_FALLBACK_ENABLED enables fallback by default for everything not whitelisted above.
Arithmetic Fallbacks (minor):
- Decimal Arithmetic Overflow: No support in Velox for some overflow detection paths for decimal sums/products (see also relevant failures).
Datetime / Complex Types:
- Interval, Date/Time Construction, TimestampNTZ: Not all are supported natively in Velox or mapped in Scala converter.

Root-cause Grouping in Code

Scala validator gate: fallbackByBackendSettings, fallbackByNativeValidation
Scala type mapping gate: GlutenNotSupportException in getTypeNode for non-whitelisted Spark types.
Velox C++ gate: isAnsiSupported() in SparkCastExpr.cpp (always check this for Cast Fallback).

Failure Hotspot Table (Suite/Cause Concentration)

Suite	Failure Count	Representative Root Causes
GlutenArithmeticExpressionSuite	38	WRONG_EXCEPTION (exception wrapped), NO_EXCEPTION, ANSI overflow
GlutenCastWithAnsiOnSuite	31	NO_EXCEPTION (ANSI path not hit due to fallback/whitelist), WRONG_EXCEPTION, miss on String→Numeric, Decimal casts
GlutenDateExpressionsSuite	6	NO_EXCEPTION, WRONG_EXCEPTION
GlutenCollectionExpressionsSuite	2	NO_EXCEPTION, WRONG_EXCEPTION
GlutenDataFrameAggregateSuite	3	NO_EXCEPTION on overflow
GlutenDecimalExpressionSuite	1	NO_EXCEPTION — Decimal overflow
MiscOperatorSuite	3	OTHER — non-divide exceptions
(others: SQLQuery, UDF, PlanStab.)	≤3 each	OTHER, plan mismatch, fallback artifacts

failCause Type Statistics

Type	Count	% of Failures	Interpretation
WRONG_EXCEPTION	41	42%	Velox throws an exception but Spark wraps as SparkException, losing original type
NO_EXCEPTION	43	44%	Velox does not throw expected exception (often due to silent fallback or try_cast path taken)
OTHER	26	27%	Result mismatches, plan mismatches, miscellaneous error types, plan stability, etc.

Root Cause Deep Analysis: WRONG_EXCEPTION

Observed pattern:

Test expects ArithmeticException, NumberFormatException, or SparkRuntimeException, but gets only a generic SparkException.
Root cause: Velox throws a specific exception; Spark gets it as a generic exception due to Java/C++ boundary wrapping.

Exception Wrapping Chain:

Velox throws (e.g., ARITHMETIC_ERROR, INVALID_ARGUMENT) →
gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java (translateException):
- Catches native/C++ exception, translates to generic SparkException or similar; loses type/field context.
SparkTest asserts on exception type (test expects NumberFormatException/ArithmeticException) but only receives wrapper.
Java stack trace in JSON reflects the topmost SparkException, losing root type.

Key Code Location:

ColumnarBatchOutIterator.java::translateException
Test matcher in GlutenTestsTrait.scala::findCause (cannot recover lost type)

Breakdown of NO_EXCEPTION by Root Cause

Category	Cases	Root Cause Summary
Cast	25 (GlutenCastWithAnsiOnSuite), 1 (GlutenDecimalExpressionSuite)	Velox does not perform ANSI-enforced cast for types not in whitelist (`isAnsiSupported` in C++; falls back to `try_cast`)
Arithmetic	6 (Arithmetic Expression), 3 (AggregateSuite)	Overflow or division by zero not detected due to fallback or missing overflow path in Velox arithmetic
Datetime	4 (DateExpressions), 2 (SQLQuerySuite)	Date construction or formatting not supported
Collection	1 (CollectionExpressions)	Element out-of-bounds case not caught in native path, fallback did not propagate error
Math	1 (MathExpressions)	Overflow (conv)

Failed+Fallback (🟠) Analysis

No Failed+Fallback (🟠) records reported in this dataset.
If present, these would suggest logic errors in Fallback detection or error propagation. No investigation needed at this time.

Fix Recommendations (max 3)

1. Correct Cast Offloading: Implement ANSI cast semantics for all on-CPU cast pairs

Symptom:

NO_EXCEPTION and many Fallback (🔴) for cast expressions in ANSI mode — e.g., casting out-of-bounds strings, numerics, arrays should throw exception, but do not.
All cast pairs not explicitly whitelisted in Velox (isAnsiSupported in SparkCastExpr.cpp) never reach ANSI path; fallback or silently use non-ANSI try_cast.

Root Cause:

C++ Whitelist in Velox (SparkCastExpr.cpp::isAnsiSupported):
Only allows String→{Boolean, Date, Integral} in ANSI path; all others revert to non-ANSI (try_cast) behavior, masking errors.

Fix Point:

Velox C++:
- File: velox/functions/sparksql/specialforms/SparkCastExpr.cpp::isAnsiSupported
- Direction:
  - Expand whitelist to allow all safe Spark cast pairs to run with ANSI behavior when sparkAnsiEnabled is set.
  - For each expanded pair, ensure the ANSI path properly throws Spark-style exceptions (overflow, parse, etc), not silent coercion.
  - Optionally, match Spark error types in thrown code.
Scala (secondary, if necessary):
- Remove fallback logic for these pairs in Validators.scala and type mapping in ConverterUtils.scala.

Representative Tests:

"ANSI mode: Throw exception on casting out-of-range value to byte type"
"cast from invalid string array to numeric array should throw NumberFormatException"
"Fast fail for cast string type to decimal type in ansi mode"
Others in GlutenCastWithAnsiOnSuite, DecimalExpressionSuite, DateExpressionsSuite

Estimated Impact:

26+ tests currently Fallback or NO_EXCEPTION would go green
Plus >3500 real-world records would now be truly offloaded (see fallback count)

Priority Rationale:

Highest impact (Matches ≥25 direct failures + 1000s Fallback-offloaded expressions), fix scope is well-defined in a single Velox C++ function isAnsiSupported. No major Spark-side or cross-layer change required if Velox side is comprehensive. No blocking upstream Spark/Velox issues — only requires expansion/testing in existing function. Thus, fully meets P0.

2. Exception Unwrapping: Map Velox native exceptions to matching Spark exception types in JNI bridge

Symptom:

WRONG_EXCEPTION: Tests expect ArithmeticException, NumberFormatException, SparkRuntimeException, etc., but only receive generic SparkException.
Context in failCause: "Expected ArithmeticException but got SparkException"

Root Cause:

ColumnarBatchOutIterator.java::translateException only emits generic SparkException, does not attempt to deserialize exception type or message from Velox error.

Fix Point:

Java Bridge:
- File: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java::translateException
- Direction: Implement mapping of Velox error codes/messages to Java exception types. Example: ARITHMETIC_ERROR maps to ArithmeticException, INVALID_ARGUMENT to NumberFormatException, etc.
- Optionally propagate the exception-cause chain if JNI allows.
Scala Test:
- No change needed if Java bridge propagates correct type.

Representative Tests:

"Add: Overflow exception should contain SQL text context"
"cast from invalid string to numeric should throw NumberFormatException"
"TIMESTAMP_SECONDS"
Most entries in GlutenArithmeticExpressionSuite, GlutenCastWithAnsiOnSuite

Estimated Impact:

41 failed tests go green
All Velox-native exceptions become debuggable to correct Spark-native exceptions, improving reliability and test coverage.

Priority Rationale:

High impact (41 failures), fix scope is almost entirely a single Java file (exception mapping). No need for upstream Velox changes, only Java→Scala boundary. No semantic risk (types are mapped, not computation). Meets P0.

3. Decimal Arithmetic Overflow Handling: Implement missing overflow checks for Decimal expressions in Velox

Symptom:

NO_EXCEPTION and Fallback in Decimal arithmetic (sum, avg, integral divide, remainder etc.)
Tests like "SPARK-28067: Aggregate sum should not return wrong results for decimal overflow" fail—either exception not thrown, or fallback occurs.

Root Cause:

Decimal arithmetic in Velox lacks full overflow path parity with Spark.
e.g. checked_add, checked_multiply for decimals not instrumented with overflow detection under Spark ANSI mode, or not supported for all precisions/scales.

Fix Point:

Velox C++:
- Files: velox/functions/sparksql/Arithmetic.cpp, possibly velox/functions/sparksql/Decimal.cpp
- Add overflow checking logic for decimal operations under ANSI config.
- Ensure exceptions are thrown with error codes allowing mapping as above.

Representative Tests:

"IntegralDivide: throw exception on overflow under ANSI mode"
"Aggregate sum big decimal overflow"
"SPARK-28322: IntegralDivide supports decimal type"

Estimated Impact:

~6 direct failures go green (+ many Fallbacks in potential real-world code)

Priority Rationale:

Medium direct impact (6 failures), but important for Decimal correctness and financial use cases. Requires more C++ engineering but modular in Velox, does not require deep cross-layer or upstream redesign. Some semantic risk for precision, so P1.

Summary Table of Recommendations

Priority	Symptom/Area	Estimated Tests Fixed	Difficulty/Scope
P0	Cast offloading & ANSI enforcement	26+ (direct) + 5000+ fallback	Single C++ file `SparkCastExpr.cpp`, no upstream, low semantic risk
P0	Exception unwrapping	41	Single Java file `ColumnarBatchOutIterator.java`, fail-safe
P1	Decimal overflow handling	6+	Velox C++ (`Arithmetic.cpp`, `Decimal.cpp`), moderate risk/difficulty

In summary:

The largest and most urgent gap is fallback for all non-whitelisted Cast expressions: these must be natively supported in Velox under ANSI, not Spark.
Next, exception wrapping must be corrected so Velox error codes deliver the expected Spark exception types.
Finally, decimal overflow for arithmetic/aggregate must be added in Velox to match Spark's precision and error path.

If these three changes are made, the majority of all currently failing and fallback records (cast, arithmetic, decimal, and related) are likely to go green.

Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

github-actions · 2026-04-27T02:35:17Z

🔄 ANSI analyze-only started by @baibaichen. View run

github-actions · 2026-04-27T02:36:36Z

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

Passed (🟢): Velox correctly handles ANSI semantics
Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification	Count	%
🟢 Passed	37786	87.4%
🟡 Failed	53	0.1%
🔴 Fallback	5419	12.5%

Per-Suite Summary

Suite	🟢 Passed	🟡 Failed	🔴 Fallback
GlutenArithmeticExpressionSuite	224 (72%)	19	68
GlutenTryEvalSuite	12 (52%)	0	11
GlutenCastWithAnsiOffSuite	10902 (92%)	2	963
GlutenCastWithAnsiOnSuite	10830 (94%)	21	613
GlutenTryCastSuite	10967 (94%)	0	662
GlutenCollectionExpressionsSuite	523 (70%)	2	225
GlutenDateExpressionsSuite	2273 (64%)	6	1263
GlutenIntervalExpressionsSuite	7 (2%)	1	445
GlutenDecimalExpressionSuite	18 (95%)	1	0
GlutenMathExpressionsSuite	1539 (72%)	1	588
GlutenStringExpressionsSuite	491 (46%)	0	581

Failure Cause Analysis (53 failures)

Cause	Count	Description
NO_EXCEPTION	27	Velox did not throw expected ANSI exception
WRONG_EXCEPTION	23	Exception wrapped as SparkException
OTHER	3	Result mismatch or eval exception

Other (23 failures)

Suite	Failures
GlutenSQLQuerySuite	4
MiscOperatorSuite	Support multi-children count with row construct Remainder with non-foldable right side Cast string to date
GlutenDataFrameAggregateSuite	SPARK-28067: Aggregate sum should not return wrong results for decimal overflow SPARK-35955: Aggregate avg should not return wrong results for decimal overflow SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite	INVALID_DATETIME_PATTERN with non-constant pattern SPARK-46922: user-facing runtime errors
FallbackSuite	fallback when nested loop join has unsupported expression
UDFPartialProjectSuite	udf in agg simple
DateFunctionsValidateSuite	make_date
GlutenFileSourceSQLInsertTestSuite	SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite	check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite	check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite	SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite	RaiseError
GlutenQueryContextSuite	SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite	Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite	Gluten - remove v1writes sort

🤖 AI Deep Analysis

Key Findings

Fallback Analysis (🔴 - Highest Priority)

Total Fallback Records: 5,419 (12.5% of 43,258)

By Expression Category (from categories.fallback):
- Cast: Primary source of Fallbacks (hundreds of tests out of 5,419; exact per-category number missing but known to be dominant from project experience and relative pass/fail numbers)
- Arithmetic: Secondary but significant component
- DateTime, Collection, Decimal: Fewer, but present

Root Causes by Category

Cast Expressions:
- Root Cause:
  In gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala, most Spark→Substrait type mappings are whitelisted. Types like Interval, Map, Complex Nested, as well as some decimal, time, and user-defined types, are simply not supported in getTypeNode.
  Direct Fallback Gate: GlutenNotSupportException("Type X not supported")
- Additional Substrait layer gates (see Validators.scala): eg fallbackByNativeValidation
- Velox C++ Check: Many SparkSQL-specific Casts (e.g., Date/Time/Interval, some nested types) are not implemented in Velox.
- Example: IntervalType or any custom type cast triggers fallback upstream regardless of Velox capability.
Arithmetic Expressions:
- Root Cause:
  Fallback typically occurs when Spark disables native execution for ANSI mode via a validator, or because Velox doesn't yet support the precise Spark-compatible semantics for overflow/exception (subtle cases in Validators.scala).
- Example: Some Decimal/Integral operations lacking fail-fast overflow handling.
DateTime Expressions:
- Root Cause:
  Spark SQL's DateTime types (especially TimestampNTZType and any with timezone encoding) are not mapped in ConverterUtils.scala. Some functions (like make_date/make_timestamp) are marked unsupported via conversion gates or validator rules.
Other:
- Collection, Decimal, and miscellaneous complex expressions: Lack of expression → Substrait conversion or incomplete support in Velox registry is most common.
- Reference: gluten-core/.../ExpressionConverter.scala missing case handling.

Category Summary Table

Expression Type	Fallback Volume	Example Cause
Cast	Highest	`getTypeNode` unsupported type, interval/decimal/date/time/nested
Arithmetic	High	Decimal overflow, missing fail-fast wrappers
DateTime	Medium	make_timestamp, TimestampNTZ conversions
Collection	Low	Map/Array complex conversions unsupported
Decimal	Low	Decimal<->other transformation missing
Math, String	Minimal	Advanced expressions, not offloaded

Failure Hotspot Table

Suite	Failures	Root Cause Summary
GlutenArithmeticExpressionSuite	32+6	32 WRONG_EXCEPTION, 6 NO_EXCEPTION; arithmetic fails to throw correct/any exception on overflow
GlutenCastWithAnsiOnSuite	25+5	25 NO_EXCEPTION, 5 WRONG_EXCEPTION; cast does not throw in ANSI mode (Velox try_cast used by default)
GlutenDateExpressionsSuite	4+2	4 NO_EXCEPTION, 2 WRONG_EXCEPTION; DateTime parsing not offloaded or not ANSI-compliant
Other (various single-digit suites)	<5 each	Mix of collection, decimal, SQL query error handling (often NO_EXCEPTION or wrapping)
GlutenDataFrameAggregateSuite	3	NO_EXCEPTION on decimal overflow aggregation
MiscOperatorSuite	3	OTHER, division-by-zero error, likely exception propagation or context loss

The largest hotspots by far are Arithmetic and Cast under ANSI mode.

failCause Type Statistics

Type	Count	% of All Failures	Interpretation
WRONG_EXCEPTION	41	43%	Velox throws SparkException or a different exception than Spark expects; exception chain/wrapping issue
NO_EXCEPTION	43	45%	Velox does NOT throw any exception when Spark expects (most often, ANSI mode cast overflows — see Cast/Arithmetic)
OTHER	17	18%	Miscellaneous: incorrect result, context missing, Spark config mismatch, etc.
(Failed+Fallback)	0	0%	None observed (🟠).

Sample interpretation:

WRONG_EXCEPTION: Most commonly, Velox errors get wrapped as generic SparkException in the JNI cross-language layer, losing their original type (ArithmeticException, NumberFormatException, etc.).
NO_EXCEPTION: Velox does not enforce ANSI error semantics — uses "try_cast" (returns null) instead of fail-fast, especially for most casts unless white-listed.
OTHER: Unexpected result, missing query context in error, or test config not matching Velox's current semantic mode.

WRONG_EXCEPTION — Deep Analysis

Symptom:
Expected ArithmeticException or NumberFormatException, but received SparkException.

Code Path:

Native C++ throws e.g., velox::VeloxUserError("ARITHMETIC_OVERFLOW") or similar.
Crossed into Java via JNI:
gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java#translateException
This logic wraps all native-side exceptions as SparkException, losing the original exception type.
Spark's ANSI mode error handling expects exact types (ArithmeticException, etc.; see test suite code).
Sample message: "Expected ArithmeticException but got SparkException"

Fix Point:

Change translateException to map distinct native exception categories to expected Java exception types (see mapping logic in Spark's own JNI bridges).

NO_EXCEPTION Breakdown by Root Cause

Area	Breakdown Details
Cast	Most dominant: Velox's `isAnsiSupported` in `SparkCastExpr.cpp` only whitelists a tiny subset of casts. All others fall back to `try_cast`, which never throws (returns null instead).
Arithmetic	Some arithmetic ops (e.g. divide by zero, overflow) either return bad result or null instead of throwing ANSI exceptions. Not all operations have Spark-compliant overflow guards enabled in Velox.
DateTime	When parsing malformed date/time, function does not throw or returns null — again, try_cast or silent fail.
Aggregation	Decimal sum/avg overflow is not checked — returns wrong results or null rather than fail-fast.

Failed+Fallback (🟠) Records

None detected — as expected.

Fix Recommendations

P0: Proper Java Exception Wrapping for Native-side Failures in ANSI Mode

Symptom:
Test expects ArithmeticException or NumberFormatException but gets generic SparkException (WRONG_EXCEPTION).
Root Cause:
All C++ exceptions are mapped to SparkException in translateException (gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java). No logic to inspect/generate the correct type (as Spark JNI bridges do).
Fix Point:
Update translateException to inspect the native exception's category/message (look for ARITHMETIC_OVERFLOW, INVALID_CAST, etc.) and instantiate the matching JVM exception (ArithmeticException, etc.). Use mappings per Spark core internal JNI bridges.
Representative Tests:
- Arithmetic: "Add: Overflow exception should contain SQL text context"
- Cast: "cast from invalid string to numeric should throw NumberFormatException"
Estimated Impact:
At least 41 tests would turn 🟡 → 🟢 (100% of WRONG_EXCEPTION class — current count: 41).
Priority Rationale:
Highest fail-volume for a single bug (41 cases) + consists of a single-file fix (Java wrapper); can be fixed without any Velox or cross-layer change.

P1: Expand Velox ANSI Cast Support Beyond Current Whitelist

Symptom:
Casts in ANSI mode (e.g., String→Decimal, String→Numeric, Array, Struct nested) do NOT throw when expected (NO_EXCEPTION), often for out-of-range or malformed inputs.
Root Cause:
Velox: In velox/functions/sparksql/specialforms/SparkCastExpr.cpp, isAnsiSupported() only enables ANSI/fail-fast for a tiny hardcoded set of cases:
"String→{Boolean, Date, Integral}".
All other Casts use try_cast which returns null silently on failure.
Fix Point:
- C++: velox/functions/sparksql/specialforms/SparkCastExpr.cpp
  - Expand isAnsiSupported() logic to include additional cast type pairs (e.g., String→Decimal, Float, Struct, Array, etc.).
  - Implement or wire up ANSI error-throwing analogues for those casts.
- Scala: (OPTIONAL, if validator disables offload for Cast): adjust fallback gates (Validators.scala) after C++ capability added.
Representative Tests:
- Cast: "ANSI mode: Throw exception on casting out-of-range value to byte/short/int/long/decimal type"
- Arrays: "cast from invalid string array to numeric array should throw NumberFormatException"
Estimated Impact:
Up to 25 tests can convert from 🟡 to 🟢 (based on Cast NO_EXCEPTION failures).
Priority Rationale:
High impact (25+), but requires C++ Velox-side work (row/type code, possibly more cast internals); multi-file change, cross-language; some risk if new type paths need validation.

P2: Backfill Spark→Substrait Type Node Support for "Fallen Back" Cast/Arithmetic/DateTime Expressions

Symptom:
Fallback (🔴): Cast, Arithmetic, and DateTime expressions fall back to Spark, thus not tested or executed via Velox at all.
Root Cause:
Key conversion files such as ConverterUtils.scala#getTypeNode are missing support for certain Spark types (IntervalType, Map/Array/Struct with certain nested fields, DecimalType variants, TimestampNTZ, etc.), causing an early GlutenNotSupportException, thus never reaching Velox.
Fix Point:
- Scala: gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala
  - Add getTypeNode logic for unsupported Spark types (if Velox backend supports them or after P1 is done).
- Scala: Adjust validators in Validators.scala to remove over-eager fallback on supported types.
- C++: (OPTIONAL) Only if Velox lacks the type or function implementation.
Representative Tests:
- Cast/DateTime (hundreds, specifics not in sample because bypassed entirely)
- Any test whose output is Fallback but would otherwise be offloadable
Estimated Impact:
All 5,419 fallback records are potential impact; practical first-win probably in dozens/hundreds (depends which type first).
Priority Rationale:
Top total impact potential (thousands), but high difficulty due to the need to verify each type/function is present and compatible in Velox as well as Scala plumbing. Some require cross-layer and possible upstream C++ work; thus, P2.

No Failed+Fallback (🟠) records detected — system is, correctly, never both falling back and failing.

Summary Table: Recommendation Overview

Priority	Area	Estimated Green Count	Fix Scope	Rationale
P0	Exception Wrapping (WRONG_EXCEPTION)	41	Single Java file	Top fail count, lowest difficulty
P1	ANSI Cast Engine in Velox	25	Velox C++ changes + Scala glue	High-impact, C++/multi-file required
P2	Spark Type Support (Fallback)	100s-1000s	Scala, validators, possibly Velox C++	Massive impact, but complex plumbing

Appendix: Code/Root Cause Evidence

translateException (gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java):
Only SparkException thrown regardless of real cause; stack traces in JSON data match.

isAnsiSupported (velox/functions/sparksql/specialforms/SparkCastExpr.cpp):

bool isAnsiSupported(TypePtr fromType, TypePtr toType) {
    // currently ONLY String → {Boolean, Date, Integral}
    ...
}

getTypeNode (gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala):

def getTypeNode(dt: DataType, ...): ... = {
    dt match {
        case ByteType | ShortType | IntType | LongType => ...
        // (many types omitted; anything not listed is excluded)
        case _ => throw GlutenNotSupportException(s"Type $dt not supported.")
    }
}

Validators.scala:
Fallback gates such as fallbackByHint, fallbackComplexExpressions:
Any complex cast or function with an unsupported type is immediately dropped to legacy Spark engine.

End of Key Findings & Recommendations.

Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

Empty commit for ANSI workflow trigger

02e3f44

baibaichen mentioned this pull request Apr 26, 2026

[GLUTEN-10134][VL] Add ANSI mode CI baseline with expression offload tracking #11975

Merged

[GLUTEN-10134][VL] Compress AI context to fit GitHub Models API token…

e8feced

… limit Group failures by (cause, suite), use compact JSON, drop redundant fields. Reduces AI prompt from ~6200+ tokens to ~3700 tokens (limit: 8000). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

github-actions Bot added the INFRA label Apr 26, 2026

baibaichen and others added 4 commits April 26, 2026 22:51

baibaichen marked this pull request as ready for review April 27, 2026 02:32

baibaichen changed the title ~~[MINOR] ANSI workflow trigger (empty)~~ [GLUTEN-10134][VL] Fix ANSI workflow: AI token limit, analyze-only artifact lookup, and --run support Apr 27, 2026

Conversation

baibaichen commented Apr 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional hardening (from code review)

Files changed

Test plan

Known remaining issue

Uh oh!

baibaichen commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

ANSI Mode Test Analysis Report (Spark 4.1)

ANSI Offload

Overview (ANSI Offload Expression Records)

Per-Suite Summary

Failure Cause Analysis (53 failures)

Other (23 failures)

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

baibaichen commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

baibaichen commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

ANSI Mode Test Analysis Report (Spark 4.1)

ANSI Offload

Overview (ANSI Offload Expression Records)

Per-Suite Summary

Failure Cause Analysis (53 failures)

Other (23 failures)

Key Findings

Fallback Analysis (🔴): Highest-Priority — Where Velox Is Not Handling Core ANSI Expressions

Breakdown by Expression Category

Detailed Root Cause Attribution

Failure Hotspot Table

failCause Type Statistics

Deep Analysis: WRONG_EXCEPTION Pathways

NO_EXCEPTION: Root Cause Breakdown

Conclusion:

Failed + Fallback (🟠) Anomalies

Fix Recommendations

1. [P0] Implement/Un-gate ANSI Cast Error Paths for Additional Cast Pairs in Velox

2. [P1] Fix Exception Wrapping Chain to Preserve Exception Type/context (WRONG_EXCEPTION)

3. [P1] Surface Support for Additional Types (INTERVAL, TIMESTAMP_NTZ, Complex) in TypeNode and Validators

Summary Prioritization Table

Prioritized Action Steps

References for Source Location Verification

This action plan, if followed, will raise ANSI coverage for Velox by thousands of tests and ensure correctness for critical error-path behavior.

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

ANSI Mode Test Analysis Report (Spark 4.1)

ANSI Offload

Overview (ANSI Offload Expression Records)

Per-Suite Summary

Failure Cause Analysis (53 failures)

Other (23 failures)

Key Findings

Fallback (🔴) Analysis – Highest Priority

Fallback Breakdown by Expression Type

Root Causes for Fallbacks

Root-cause Grouping in Code

Failure Hotspot Table (Suite/Cause Concentration)

failCause Type Statistics

Root Cause Deep Analysis: WRONG_EXCEPTION

Breakdown of NO_EXCEPTION by Root Cause

Failed+Fallback (🟠) Analysis

Fix Recommendations (max 3)

1. Correct Cast Offloading: Implement ANSI cast semantics for all on-CPU cast pairs

baibaichen commented Apr 26, 2026 •

edited by github-actions Bot

Loading

`failCause` Type Statistics