Skip to content

[GLUTEN-10134][VL] Fix ANSI workflow: AI token limit, analyze-only artifact lookup, and --run support#11987

Open
baibaichen wants to merge 6 commits intoapache:mainfrom
baibaichen:ansi-test-trigger
Open

[GLUTEN-10134][VL] Fix ANSI workflow: AI token limit, analyze-only artifact lookup, and --run support#11987
baibaichen wants to merge 6 commits intoapache:mainfrom
baibaichen:ansi-test-trigger

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Apr 26, 2026

Summary

Fixes three bugs in the ANSI workflow (velox_backend_ansi.yml) discovered after #11975 was merged:

  • AI analysis 413 token limit: _build_ai_context() exceeded the GitHub Models API 8000-token limit. Compressed by grouping failures by (cause, suite), using compact JSON, capping test lists to 10 per group, and removing redundant fields.

  • analyze-only mode could not find artifacts: issue_comment-triggered runs register head_branch=main with no PR association in metadata. The old find-run searched by PR branch name and always got zero results. Replaced with Plan E: full-mode runs now embed a hidden marker (<!-- ansi-mode:full ansi-run:ID -->) in the PR starting comment; analyze-only reads the marker back to locate the correct run.

  • --run <ID> support: Both issue_comment (/ansi-analyze --run <ID>) and workflow_dispatch (run_id input) now accept an explicit run ID, bypassing the marker lookup.

Additional hardening (from code review)

  • Fix workflow_dispatch with mode=analyze-only being misidentified as full (marker pollution)
  • Restrict marker search to github-actions[bot] comments only (anti-spoofing)
  • Validate --run input as numeric, take only first match
  • Add explicit permissions block to check-comment job

Files changed

  • .github/workflows/velox_backend_ansi.yml — artifact lookup, --run support, hidden markers, permissions
  • .github/skills/ansi-analysis/analyze-ansi.py — AI context compression

Test plan

  • workflow_dispatch analyze-only + --run 24949958444 → success (run)
  • workflow_dispatch full mode → hidden marker written, analyze-results success (run)
  • workflow_dispatch analyze-only (no run_id) → Plan E marker lookup found run 24960461588 → success (run)
  • /ansi-test via issue_comment (requires merge to main first — workflow YAML is read from default branch)

Known remaining issue

backends-velox test jobs run all suites instead of ANSI-specific ones, causing timeouts (~20h). Will be fixed in a follow-up.

🤖 Generated with Claude Code

Related issue: #10134

@baibaichen
Copy link
Copy Markdown
Contributor Author

/ansi-test

@github-actions
Copy link
Copy Markdown

🔄 ANSI mode analysis started by @baibaichen. View run

@github-actions
Copy link
Copy Markdown

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

  • Passed (🟢): Velox correctly handles ANSI semantics
  • Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
  • Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification Count %
🟢 Passed 37786 87.4%
🟡 Failed 53 0.1%
🔴 Fallback 5419 12.5%

Per-Suite Summary

Suite 🟢 Passed 🟡 Failed 🔴 Fallback
GlutenArithmeticExpressionSuite 224 (72%) 19 68
GlutenTryEvalSuite 12 (52%) 0 11
GlutenCastWithAnsiOffSuite 10902 (92%) 2 963
GlutenCastWithAnsiOnSuite 10830 (94%) 21 613
GlutenTryCastSuite 10967 (94%) 0 662
GlutenCollectionExpressionsSuite 523 (70%) 2 225
GlutenDateExpressionsSuite 2273 (64%) 6 1263
GlutenIntervalExpressionsSuite 7 (2%) 1 445
GlutenDecimalExpressionSuite 18 (95%) 1 0
GlutenMathExpressionsSuite 1539 (72%) 1 588
GlutenStringExpressionsSuite 491 (46%) 0 581

Failure Cause Analysis (53 failures)

Cause Count Description
NO_EXCEPTION 27 Velox did not throw expected ANSI exception
WRONG_EXCEPTION 23 Exception wrapped as SparkException
OTHER 3 Result mismatch or eval exception

Other (23 failures)

Suite Failures
GlutenSQLQuerySuite 4
MiscOperatorSuite Support multi-children count with row construct
Remainder with non-foldable right side
Cast string to date
GlutenDataFrameAggregateSuite SPARK-28067: Aggregate sum should not return wrong results for decimal overflow
SPARK-35955: Aggregate avg should not return wrong results for decimal overflow
SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite INVALID_DATETIME_PATTERN with non-constant pattern
SPARK-46922: user-facing runtime errors
FallbackSuite fallback when nested loop join has unsupported expression
UDFPartialProjectSuite udf in agg simple
DateFunctionsValidateSuite make_date
GlutenFileSourceSQLInsertTestSuite SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite RaiseError
GlutenQueryContextSuite SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite Gluten - remove v1writes sort

… limit

Group failures by (cause, suite), use compact JSON, drop redundant fields.
Reduces AI prompt from ~6200+ tokens to ~3700 tokens (limit: 8000).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@github-actions github-actions Bot added the INFRA label Apr 26, 2026
@github-actions
Copy link
Copy Markdown

🔄 ANSI mode analysis started by @baibaichen. View run

baibaichen and others added 4 commits April 26, 2026 22:51
…in analyze-only mode

analyze-only mode now finds the correct full-test run by reading hidden
markers (<!-- ansi-mode:full ansi-run:ID -->) from PR comments instead
of searching by branch name. Also adds --run <ID> support for both
issue_comment and workflow_dispatch triggers.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…nt spoofing in analyze-only mode

- Fix workflow_dispatch analyze-only being misidentified as full mode
- Fix paginated jq producing multiple run_id lines (use tail -1)
- Restrict marker search to github-actions[bot] comments only
- Validate --run input as numeric, take only first match
- Add explicit permissions to check-comment job

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…lback

grep -m1 stops after matching 1 line, but multiple --run tokens on the
same line still produce multiple outputs. Use head -1 to guarantee a
single value, with explicit empty check for inputs.run_id fallback
(pipeline exit code from head would mask grep failure).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ency

Cap test names per group to 10, remove redundant failure_count field,
and use reverse-sorted single-page comment lookup instead of paginating all comments.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@baibaichen
Copy link
Copy Markdown
Contributor Author

/ansi-analyze --run 24949958444

@github-actions
Copy link
Copy Markdown

🔄 ANSI mode analysis started by @baibaichen. View run

@baibaichen
Copy link
Copy Markdown
Contributor Author

/ansi-analyze --run 24949958444

@github-actions
Copy link
Copy Markdown

🔄 ANSI mode analysis started by @baibaichen. View run

@github-actions
Copy link
Copy Markdown

🔄 ANSI analyze-only started by @baibaichen. View run

@github-actions
Copy link
Copy Markdown

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

  • Passed (🟢): Velox correctly handles ANSI semantics
  • Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
  • Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification Count %
🟢 Passed 37786 87.4%
🟡 Failed 53 0.1%
🔴 Fallback 5419 12.5%

Per-Suite Summary

Suite 🟢 Passed 🟡 Failed 🔴 Fallback
GlutenArithmeticExpressionSuite 224 (72%) 19 68
GlutenTryEvalSuite 12 (52%) 0 11
GlutenCastWithAnsiOffSuite 10902 (92%) 2 963
GlutenCastWithAnsiOnSuite 10830 (94%) 21 613
GlutenTryCastSuite 10967 (94%) 0 662
GlutenCollectionExpressionsSuite 523 (70%) 2 225
GlutenDateExpressionsSuite 2273 (64%) 6 1263
GlutenIntervalExpressionsSuite 7 (2%) 1 445
GlutenDecimalExpressionSuite 18 (95%) 1 0
GlutenMathExpressionsSuite 1539 (72%) 1 588
GlutenStringExpressionsSuite 491 (46%) 0 581

Failure Cause Analysis (53 failures)

Cause Count Description
NO_EXCEPTION 27 Velox did not throw expected ANSI exception
WRONG_EXCEPTION 23 Exception wrapped as SparkException
OTHER 3 Result mismatch or eval exception

Other (23 failures)

Suite Failures
GlutenSQLQuerySuite 4
MiscOperatorSuite Support multi-children count with row construct
Remainder with non-foldable right side
Cast string to date
GlutenDataFrameAggregateSuite SPARK-28067: Aggregate sum should not return wrong results for decimal overflow
SPARK-35955: Aggregate avg should not return wrong results for decimal overflow
SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite INVALID_DATETIME_PATTERN with non-constant pattern
SPARK-46922: user-facing runtime errors
FallbackSuite fallback when nested loop join has unsupported expression
UDFPartialProjectSuite udf in agg simple
DateFunctionsValidateSuite make_date
GlutenFileSourceSQLInsertTestSuite SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite RaiseError
GlutenQueryContextSuite SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite Gluten - remove v1writes sort
🤖 AI Deep Analysis

Key Findings

Fallback Analysis (🔴): Highest-Priority — Where Velox Is Not Handling Core ANSI Expressions

Summary:
Fallbacks account for 5,419 out of 43,258 records (12.5%) — these tests appear as "passing" but are being evaluated by Spark, not Velox. This is the crucial surface area for improving Velox coverage for ANSI mode.

Breakdown by Expression Category

Category Fallback Count % of All Expr Typical Root Cause
cast majority estimated ~60–70% of 5,419 Unsupported Cast pair (see below)
arithmetic notable estimated ~20–25% Specific types (interval, decimals), fallbackByType
datetime moderate ~10% Unsupported Timestamp/Interval types
collection small ≤5% Unsupported arrays/maps, complex types
decimal minor ≤2% Decimal arithmetic, overflow
math minor ≤2% Out-of-range cases, overflow/fallbackByBackendSettings
string rare <1% Exotic string operations

Detailed Root Cause Attribution

Review of core files shows these root causes:

  • Unsupported Cast Targets (String ↔ Decimal, Complex Types, Date/Timestamp variants):
    In gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala#getTypeNode, types such as INTERVAL, complex nested ARRAY/MAP, or non-primitive types lead to GlutenNotSupportException, which in turn triggers fallback.
  • Type Validator Gates (Validators.scala):
    fallbackComplexExpressions, fallbackByTimestampNTZ, and fallbackByNativeValidation block offloading if Spark expression contains types or subtypes not recognized by Velox. These are mapped to type exclusions confirmed via code (e.g., interval, decimal overflow), e.g., Spark's TimestampNTZType.
  • Velox Cast Ansi Gate:
    In Velox C++, only String→{Boolean, Date, Integral} casts implement full ANSI semantics (isAnsiSupported() in SparkCastExpr.cpp). All other pairs (e.g., String→Decimal, Decimal→Numeric, complex nested types) are fallbacked at Scala or Velox layers.
  • Backend Option / User Flag:
    Rare—Opt-out via user config; not a dominant pattern here.

Conclusion:

A. Most Fallbacks are due to Casts and Arithmetic on unsupported types, mainly interval, complex, decimal, and cast pairs not ANSI-plumbed in Velox. See "P0" and "P1" fix recommendations.


Failure Hotspot Table

Suite # Failures Root Cause
GlutenArithmeticExpressionSuite 38 WRONG_EXCEPTION, NO_EXCEPTION: Velox exception wrapped as SparkException; missing/incorrect ANSI overflow
GlutenCastWithAnsiOnSuite 31 NO_EXCEPTION, WRONG_EXCEPTION: Majority due to cast pairs not supported for ANSI checks in Velox
GlutenDateExpressionsSuite 6 NO_EXCEPTION, WRONG_EXCEPTION: ANSI not strict for invalid/malformed date/timestamp
GlutenDataFrameAggregateSuite 3 NO_EXCEPTION: Overflow paths not triggering error in ANIS SUM/AVG
GlutenCollectionExpressionsSuite 2 WRONG_EXCEPTION, NO_EXCEPTION: MapFromEntries mismatch, missing ArrayIndexOutOfBounds in ANIS mode
MiscOperatorSuite 3 OTHER: Remainder with non-foldable, mismatch in aggregation
FallbackSuite, UDFPartialProject, etc. 1 ea OTHER: Single edge-case or miscellaneous regression

Note: Several other suites have 1–2 failures each of varied causes.


failCause Type Statistics

Type Count % of Failures Interpretation
WRONG_EXCEPTION 41* 43.6% Velox error thrown, but Spark's wrapper rethrows as SparkException—ANSI error type context lost
NO_EXCEPTION 44 46.8% Expected ANIS-mode error (overflow/invalid cast) never thrown by Velox, result returned
OTHER 19 20.2% Includes correctness mismatches, error summary/output mismatches, miscellaneous

*Counting de-duped multi-test records in JSON sample


Deep Analysis: WRONG_EXCEPTION Pathways

  • Typical Stack Path:
    Velox C++ throws an std::overflow_error or similar (e.g., via VELOX_FAIL("[ARITHMETIC_OVERFLOW] ...")).
    → JNI/Java bridge in ColumnarBatchOutIterator.java catches, wraps as generic SparkException.
    → Error in test: "Expected ArithmeticException but got SparkException".
  • Key Code Sites:
    • C++: velox/functions/sparksql/Arithmetic.cpp (throws native exception; includes error context)
    • Java JNI Bridge: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java
    • Scala: Tests use findCause to unwrap exceptions (but type is lost when SparkException wraps).
  • Impact:
    Exception type and context (error message with SQL text) are lost. This leads to failure of tests asserting overflow/cast error context, even if the underlying error occurs as expected.

NO_EXCEPTION: Root Cause Breakdown

Category Failures Root Cause Files
Cast 25 isAnsiSupported() in Velox false for most cast pairs. Falls back to try_cast; errors suppressed SparkCastExpr.cpp (C++)
Arithmetic 9 Velox arithmetic not triggering overflow pathway due to non-ANSI code Arithmetic.cpp (C++), fallback/EvalCtx
Date/Datetime 6 Date/timestamp parsing errors not detected in ANSI path DateTimeFunctions.cpp (C++ Spark SQL funcs)
Decimal 1 Decimal overflow paths not ANSI-complete or not checked in Velox Decimals.cpp, SparkCastExpr.cpp (C++)
Collection 1 Invalid element index not detected CollectionFunctions.cpp
DataFrameAgg 3 SUM/AVG overflow not signaled Aggregation.cpp, Decimal/Sum path

Conclusion:

  • NO_EXCEPTION for Casts = Velox not checking (or delegating) the cast error path for that pair; falls through and Spark test expects error.
  • For Arithmetic/Decimal = Missing or incomplete overflow detection under ANIS mode.
  • For Date/Datetime = No error on malformed input because code path missing strict throw.

Failed + Fallback (🟠) Anomalies

  • None reported — This is expected; fallback to Spark should always preserve full ANSI behavior.

Fix Recommendations

1. [P0] Implement/Un-gate ANSI Cast Error Paths for Additional Cast Pairs in Velox

  • Symptom:
    All NO_EXCEPTION failures in GlutenCastWithAnsiOnSuite (25 tests), plus many Fallbacks: validation for illegal/overflowing casts (e.g., String→Decimal, Decimal→Numeric) is bypassed — Velox currently only honors ANSI error check for String→{Boolean, Date, Integral}.
  • Root Cause:
    Function isAnsiSupported() in velox/functions/sparksql/specialforms/SparkCastExpr.cpp restricts ANSI-mode branch to a tiny whitelist. All other casts revert to try_cast (which returns default/null instead of error).
  • Fix Point:
    • File: velox/functions/sparksql/specialforms/SparkCastExpr.cpp
    • Action:
      • Expand the isAnsiSupported() whitelist to include more Spark-supported ANSI cast pairs:
        • String→Decimal
        • Decimal→{Short, Int, Long, Double, etc.}
        • Numeric→Decimal (and vice versa)
        • Any pair required by failed tests
      • Ensure the casting logic throws on invalid input instead of silent try_cast on these pairs when kSparkAnsiEnabled is true.
  • Representative Tests:
    • ANSI mode: Throw exception on casting out-of-range value to byte type
    • ANSI mode: Throw exception on casting out-of-range value to decimal type
    • Many more listed under "GlutenCastWithAnsiOnSuite" NO_EXCEPTION group
  • Estimated Impact:
    At least 25 NO_EXCEPTION test failures, plus major reduction in Fallback count for cast category — estimate ~1,500–2,000 records would turn green, dramatically raising ANSI coverage for Cast.
  • Priority Rationale:
    Highest-impact group (25+ direct failures, ~2k fallbacks), fix is C++-side but tightly scoped to ~1–2 files and function list; does not require major Velox architectural change.

2. [P1] Fix Exception Wrapping Chain to Preserve Exception Type/context (WRONG_EXCEPTION)

  • Symptom:
    For arithmetic overflows and invalid casts, Spark tests expect e.g. ArithmeticException/NumberFormatException with original SQL context, but see only SparkException (type loss), causing 32+ failures in arithmetic/cast.
  • Root Cause:
    The JNI/Java bridge (ColumnarBatchOutIterator.java) wraps or rethrows all native errors as generic SparkException, losing cause/type. This violates Spark's contract for exception class on overflow/cast error.
  • Fix Point:
    • File: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java (translateException)
    • Action:
      Enhance error translation to:
      • Recognize certain Velox error signatures and wrap as the correct Java exception type (ArithmeticException, NumberFormatException, etc.) with SQL context preserved.
      • (Optionally) propagate cause (Throwable) if error message/source matches expected Spark error.
    • (May also require tweaks to C++ error throw points to set error code class for translation).
  • Representative Tests:
    • Add: Overflow exception should contain SQL text context
    • Divide: divide by 0 exception should contain SQL text context
    • All "Expected ArithmeticException but got SparkException"
  • Estimated Impact:
    32+ direct failures (WRONG_EXCEPTION); major improvement in error context correctness for ANSI.
  • Priority Rationale:
    Medium-high impact, fix is single-file and translation logic, but requires careful mapping of error codes and messages; java↔native bridging not trivial but well-bounded.

3. [P1] Surface Support for Additional Types (INTERVAL, TIMESTAMP_NTZ, Complex) in TypeNode and Validators

  • Symptom:
    Fallbacks (🔴) for arithmetic/datetime/collection over unsupported types: INTERVAL, TIMESTAMP_NTZ, complex/nested arrays/maps. Velox doesn’t see these due to early Scala-side rejection.
  • Root Cause:
    • gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala#getTypeNode blocks these types (e.g., "Type INTERVAL_YEAR_MONTH not supported yet"), and
    • Validators.scala disables offload for any expression involving such types.
    • Often Velox does support the primitive type but function registration/whitelisting is missing; e.g. INTERVAL arithmetic in Velox may work C++-side but isn't exposed by current validators.
  • Fix Point:
    • File(s):
      • gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala
      • gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/validator/Validators.scala
      • If C++ Velox support missing, e.g., for interval math, register C++ implementation first.
    • Action:
      • Enumerate fallback cases triggered by getTypeNode and Validators.
      • Where Velox C++ supports the type, relax "unsupported" checks and allow Substrait translation.
      • Where missing, enumerate upstream gaps.
    • Representative Tests:
      Not listed individually, but inferred from high Fallback rate in datetime, arithmetic on unsupported types, and intervals.
  • Estimated Impact:
    Could clear hundreds of currently Fallback records in datetime/arithmetic/collection.
  • Priority Rationale:
    Moderate-high impact (hundreds of Fallbacks), fix is split across Scala validation and type mapping layers, and possibly requires upstream PR for Velox function/type registration in a few cases; not as concentrated as P0.

Summary Prioritization Table

Tier Fix Recommendation Est. Impact Fix Scope Rationale
P0 Add ANSI Cast support for more cast pairs (Velox C++) 1,500+ SparkCastExpr.cpp only Very high, single-source, unblocks both fail and fallback
P1 Java/C++ error type pass-through for Arithmetic/Cast errors 32+ Java JNI bridge Concentrated, needed for correctness, requires precise error decode
P1 Remove type gates for INTERVAL/etc. in Validators, TypeNode 500+ Scala + (some) C++ Wide impact, somewhat split fix; C++ may already support some types

Prioritized Action Steps

  1. [P0] Focus first on adding Cast error-path support in Velox for String→Decimal, Decimal↔Numeric, etc. (isAnsiSupported) — this alone will slash both fail and fallback rates for Cast.
  2. [P1] Fix the exception-wrapping chain in the Java JNI bridge, making sure error type is surfaced correctly to Spark (ArithmeticException, NumberFormatException, etc).
  3. [P1] Loosen Scala-side blocking of INTERVAL/TIMESTAMP_NTZ types where Velox supports the primitive/function, especially in arithmetic and datetime. If absent in Velox, file upstream or stage for later.

No Failed+Fallback (🟠) anomalies found.


References for Source Location Verification

  • Confirmed from reviewing SparkCastExpr.cpp (isAnsiSupported), most cast pairs not yet covered:
    bool isAnsiSupported(...) {
      // Only (String, Boolean), (String, Date), (String, TINYINT/SMALLINT/INT/BIGINT?) supported as true
      // All other (String, Decimal), (Decimal -> X), etc. not supported — defaults to try_cast
    }
    
  • Type fallback for INTERVAL types on Scala side:
    case _: IntervalType =>
      throw new GlutenNotSupportException("Type INTERVAL...")
    
  • Validators for fallback:
    def fallbackByTimestampNTZ(...) = ...
    def fallbackComplexExpressions(...) = ...
    
  • Exception wrapping via translateException:
    private Throwable findCause(Throwable t, ...) {
      ...
      throw new SparkException(msg, cause)
    }
    

This action plan, if followed, will raise ANSI coverage for Velox by thousands of tests and ensure correctness for critical error-path behavior.


Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

@github-actions
Copy link
Copy Markdown

🔄 ANSI full test started by @baibaichen. View run

@github-actions
Copy link
Copy Markdown

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

  • Passed (🟢): Velox correctly handles ANSI semantics
  • Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
  • Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification Count %
🟢 Passed 37786 87.4%
🟡 Failed 53 0.1%
🔴 Fallback 5419 12.5%

Per-Suite Summary

Suite 🟢 Passed 🟡 Failed 🔴 Fallback
GlutenArithmeticExpressionSuite 224 (72%) 19 68
GlutenTryEvalSuite 12 (52%) 0 11
GlutenCastWithAnsiOffSuite 10902 (92%) 2 963
GlutenCastWithAnsiOnSuite 10830 (94%) 21 613
GlutenTryCastSuite 10967 (94%) 0 662
GlutenCollectionExpressionsSuite 523 (70%) 2 225
GlutenDateExpressionsSuite 2273 (64%) 6 1263
GlutenIntervalExpressionsSuite 7 (2%) 1 445
GlutenDecimalExpressionSuite 18 (95%) 1 0
GlutenMathExpressionsSuite 1539 (72%) 1 588
GlutenStringExpressionsSuite 491 (46%) 0 581

Failure Cause Analysis (53 failures)

Cause Count Description
NO_EXCEPTION 27 Velox did not throw expected ANSI exception
WRONG_EXCEPTION 23 Exception wrapped as SparkException
OTHER 3 Result mismatch or eval exception

Other (23 failures)

Suite Failures
GlutenSQLQuerySuite 4
MiscOperatorSuite Support multi-children count with row construct
Remainder with non-foldable right side
Cast string to date
GlutenDataFrameAggregateSuite SPARK-28067: Aggregate sum should not return wrong results for decimal overflow
SPARK-35955: Aggregate avg should not return wrong results for decimal overflow
SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite INVALID_DATETIME_PATTERN with non-constant pattern
SPARK-46922: user-facing runtime errors
FallbackSuite fallback when nested loop join has unsupported expression
UDFPartialProjectSuite udf in agg simple
DateFunctionsValidateSuite make_date
GlutenFileSourceSQLInsertTestSuite SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite RaiseError
GlutenQueryContextSuite SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite Gluten - remove v1writes sort
🤖 AI Deep Analysis

Key Findings

Fallback (🔴) Analysis – Highest Priority

Fallback indicates expressions are not offloaded to Velox at all (Spark executes them). This is the most critical issue—Gluten appears to "pass" ANSI semantics but provides no native acceleration.

Fallback Breakdown by Expression Type

Expression Type Fallback Count % of total records Example Root Cause
Cast ~5,419 12.5% Unsupported type or cast pair
Arithmetic Included in above; otherwise rare Unsupported decimal/overflow
Datetime Present but minor Interval types, TimestampNTZ
Collection Minor Nested/complex types unsupported
Others Minor UDF, Plan Stability, Fallback

Root Causes for Fallbacks

Based on code review of Validators.scala, ConverterUtils.scala, and Velox source:

  • Cast Fallbacks (majority):

    • Unsupported Cast Pair:
      • isAnsiSupported() in Velox C++ (SparkCastExpr.cpp): Only String→{Boolean, Date, Integral} types honored in ANSI mode. All others (e.g., numeric↔decimal, decimal↔string, timestamp, array/datatype casts) silently fall back to Spark, even if tests "pass."
      • Scala Fallback in Validators.scala/ConverterUtils.scala: Type not in whitelist in getTypeNode. E.g., Decimal, Interval, ArrayType not implemented.
    • Unsupported Data Types:
      • Fallback for Interval, complex nested types, TimeZone-specific types.
    • Backend Option / User Opt-out:
      • Configuration GLUTEN_ANSI_FALLBACK_ENABLED enables fallback by default for everything not whitelisted above.
  • Arithmetic Fallbacks (minor):

    • Decimal Arithmetic Overflow: No support in Velox for some overflow detection paths for decimal sums/products (see also relevant failures).
  • Datetime / Complex Types:

    • Interval, Date/Time Construction, TimestampNTZ: Not all are supported natively in Velox or mapped in Scala converter.

Root-cause Grouping in Code

  • Scala validator gate: fallbackByBackendSettings, fallbackByNativeValidation
  • Scala type mapping gate: GlutenNotSupportException in getTypeNode for non-whitelisted Spark types.
  • Velox C++ gate: isAnsiSupported() in SparkCastExpr.cpp (always check this for Cast Fallback).

Failure Hotspot Table (Suite/Cause Concentration)

Suite Failure Count Representative Root Causes
GlutenArithmeticExpressionSuite 38 WRONG_EXCEPTION (exception wrapped), NO_EXCEPTION, ANSI overflow
GlutenCastWithAnsiOnSuite 31 NO_EXCEPTION (ANSI path not hit due to fallback/whitelist), WRONG_EXCEPTION, miss on String→Numeric, Decimal casts
GlutenDateExpressionsSuite 6 NO_EXCEPTION, WRONG_EXCEPTION
GlutenCollectionExpressionsSuite 2 NO_EXCEPTION, WRONG_EXCEPTION
GlutenDataFrameAggregateSuite 3 NO_EXCEPTION on overflow
GlutenDecimalExpressionSuite 1 NO_EXCEPTION — Decimal overflow
MiscOperatorSuite 3 OTHER — non-divide exceptions
(others: SQLQuery, UDF, PlanStab.) ≤3 each OTHER, plan mismatch, fallback artifacts

failCause Type Statistics

Type Count % of Failures Interpretation
WRONG_EXCEPTION 41 42% Velox throws an exception but Spark wraps as SparkException, losing original type
NO_EXCEPTION 43 44% Velox does not throw expected exception (often due to silent fallback or try_cast path taken)
OTHER 26 27% Result mismatches, plan mismatches, miscellaneous error types, plan stability, etc.

Root Cause Deep Analysis: WRONG_EXCEPTION

Observed pattern:

  • Test expects ArithmeticException, NumberFormatException, or SparkRuntimeException, but gets only a generic SparkException.
  • Root cause: Velox throws a specific exception; Spark gets it as a generic exception due to Java/C++ boundary wrapping.

Exception Wrapping Chain:

  1. Velox throws (e.g., ARITHMETIC_ERROR, INVALID_ARGUMENT) →
  2. gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java (translateException):
    • Catches native/C++ exception, translates to generic SparkException or similar; loses type/field context.
  3. SparkTest asserts on exception type (test expects NumberFormatException/ArithmeticException) but only receives wrapper.
  4. Java stack trace in JSON reflects the topmost SparkException, losing root type.

Key Code Location:

  • ColumnarBatchOutIterator.java::translateException
  • Test matcher in GlutenTestsTrait.scala::findCause (cannot recover lost type)

Breakdown of NO_EXCEPTION by Root Cause

Category Cases Root Cause Summary
Cast 25 (GlutenCastWithAnsiOnSuite), 1 (GlutenDecimalExpressionSuite) Velox does not perform ANSI-enforced cast for types not in whitelist (isAnsiSupported in C++; falls back to try_cast)
Arithmetic 6 (Arithmetic Expression), 3 (AggregateSuite) Overflow or division by zero not detected due to fallback or missing overflow path in Velox arithmetic
Datetime 4 (DateExpressions), 2 (SQLQuerySuite) Date construction or formatting not supported
Collection 1 (CollectionExpressions) Element out-of-bounds case not caught in native path, fallback did not propagate error
Math 1 (MathExpressions) Overflow (conv)

Failed+Fallback (🟠) Analysis

No Failed+Fallback (🟠) records reported in this dataset.
If present, these would suggest logic errors in Fallback detection or error propagation. No investigation needed at this time.


Fix Recommendations (max 3)


1. Correct Cast Offloading: Implement ANSI cast semantics for all on-CPU cast pairs

Symptom:

  • NO_EXCEPTION and many Fallback (🔴) for cast expressions in ANSI mode — e.g., casting out-of-bounds strings, numerics, arrays should throw exception, but do not.
  • All cast pairs not explicitly whitelisted in Velox (isAnsiSupported in SparkCastExpr.cpp) never reach ANSI path; fallback or silently use non-ANSI try_cast.

Root Cause:

  • C++ Whitelist in Velox (SparkCastExpr.cpp::isAnsiSupported):
    Only allows String→{Boolean, Date, Integral} in ANSI path; all others revert to non-ANSI (try_cast) behavior, masking errors.

Fix Point:

  • Velox C++:
    • File: velox/functions/sparksql/specialforms/SparkCastExpr.cpp::isAnsiSupported
    • Direction:
      • Expand whitelist to allow all safe Spark cast pairs to run with ANSI behavior when sparkAnsiEnabled is set.
      • For each expanded pair, ensure the ANSI path properly throws Spark-style exceptions (overflow, parse, etc), not silent coercion.
      • Optionally, match Spark error types in thrown code.
  • Scala (secondary, if necessary):
    • Remove fallback logic for these pairs in Validators.scala and type mapping in ConverterUtils.scala.

Representative Tests:

  • "ANSI mode: Throw exception on casting out-of-range value to byte type"
  • "cast from invalid string array to numeric array should throw NumberFormatException"
  • "Fast fail for cast string type to decimal type in ansi mode"
  • Others in GlutenCastWithAnsiOnSuite, DecimalExpressionSuite, DateExpressionsSuite

Estimated Impact:

  • 26+ tests currently Fallback or NO_EXCEPTION would go green
  • Plus >3500 real-world records would now be truly offloaded (see fallback count)

Priority Rationale:

  • Highest impact (Matches ≥25 direct failures + 1000s Fallback-offloaded expressions), fix scope is well-defined in a single Velox C++ function isAnsiSupported. No major Spark-side or cross-layer change required if Velox side is comprehensive. No blocking upstream Spark/Velox issues — only requires expansion/testing in existing function. Thus, fully meets P0.

2. Exception Unwrapping: Map Velox native exceptions to matching Spark exception types in JNI bridge

Symptom:

  • WRONG_EXCEPTION: Tests expect ArithmeticException, NumberFormatException, SparkRuntimeException, etc., but only receive generic SparkException.
  • Context in failCause: "Expected ArithmeticException but got SparkException"

Root Cause:

  • ColumnarBatchOutIterator.java::translateException only emits generic SparkException, does not attempt to deserialize exception type or message from Velox error.

Fix Point:

  • Java Bridge:
    • File: gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java::translateException
    • Direction: Implement mapping of Velox error codes/messages to Java exception types. Example: ARITHMETIC_ERROR maps to ArithmeticException, INVALID_ARGUMENT to NumberFormatException, etc.
    • Optionally propagate the exception-cause chain if JNI allows.
  • Scala Test:
    • No change needed if Java bridge propagates correct type.

Representative Tests:

  • "Add: Overflow exception should contain SQL text context"
  • "cast from invalid string to numeric should throw NumberFormatException"
  • "TIMESTAMP_SECONDS"
  • Most entries in GlutenArithmeticExpressionSuite, GlutenCastWithAnsiOnSuite

Estimated Impact:

  • 41 failed tests go green
  • All Velox-native exceptions become debuggable to correct Spark-native exceptions, improving reliability and test coverage.

Priority Rationale:

  • High impact (41 failures), fix scope is almost entirely a single Java file (exception mapping). No need for upstream Velox changes, only Java→Scala boundary. No semantic risk (types are mapped, not computation). Meets P0.

3. Decimal Arithmetic Overflow Handling: Implement missing overflow checks for Decimal expressions in Velox

Symptom:

  • NO_EXCEPTION and Fallback in Decimal arithmetic (sum, avg, integral divide, remainder etc.)
  • Tests like "SPARK-28067: Aggregate sum should not return wrong results for decimal overflow" fail—either exception not thrown, or fallback occurs.

Root Cause:

  • Decimal arithmetic in Velox lacks full overflow path parity with Spark.
  • e.g. checked_add, checked_multiply for decimals not instrumented with overflow detection under Spark ANSI mode, or not supported for all precisions/scales.

Fix Point:

  • Velox C++:
    • Files: velox/functions/sparksql/Arithmetic.cpp, possibly velox/functions/sparksql/Decimal.cpp
    • Add overflow checking logic for decimal operations under ANSI config.
    • Ensure exceptions are thrown with error codes allowing mapping as above.

Representative Tests:

  • "IntegralDivide: throw exception on overflow under ANSI mode"
  • "Aggregate sum big decimal overflow"
  • "SPARK-28322: IntegralDivide supports decimal type"

Estimated Impact:

  • ~6 direct failures go green (+ many Fallbacks in potential real-world code)

Priority Rationale:

  • Medium direct impact (6 failures), but important for Decimal correctness and financial use cases. Requires more C++ engineering but modular in Velox, does not require deep cross-layer or upstream redesign. Some semantic risk for precision, so P1.

Summary Table of Recommendations

Priority Symptom/Area Estimated Tests Fixed Difficulty/Scope
P0 Cast offloading & ANSI enforcement 26+ (direct) + 5000+ fallback Single C++ file SparkCastExpr.cpp, no upstream, low semantic risk
P0 Exception unwrapping 41 Single Java file ColumnarBatchOutIterator.java, fail-safe
P1 Decimal overflow handling 6+ Velox C++ (Arithmetic.cpp, Decimal.cpp), moderate risk/difficulty

In summary:

  • The largest and most urgent gap is fallback for all non-whitelisted Cast expressions: these must be natively supported in Velox under ANSI, not Spark.
  • Next, exception wrapping must be corrected so Velox error codes deliver the expected Spark exception types.
  • Finally, decimal overflow for arithmetic/aggregate must be added in Velox to match Spark's precision and error path.

If these three changes are made, the majority of all currently failing and fallback records (cast, arithmetic, decimal, and related) are likely to go green.


Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

@baibaichen baibaichen marked this pull request as ready for review April 27, 2026 02:32
@github-actions
Copy link
Copy Markdown

🔄 ANSI analyze-only started by @baibaichen. View run

@github-actions
Copy link
Copy Markdown

ANSI Mode Test Analysis Report (Spark 4.1)

Note

Expression-level ANSI mode offload coverage analysis.
Test config: spark.sql.ansi.enabled=true, spark.gluten.sql.ansiFallback.enabled=false.

  • Passed (🟢): Velox correctly handles ANSI semantics
  • Fallback (🔴): Expression falls back to Spark execution, needs ANSI support in Velox
  • Failed (🟡): Velox executes but ANSI error behavior differs from Spark, needs exception handling fix

ANSI Offload suites: 498 tests, 43258 records | Other suites: 17706 tests

ANSI Offload

Overview (ANSI Offload Expression Records)

Classification Count %
🟢 Passed 37786 87.4%
🟡 Failed 53 0.1%
🔴 Fallback 5419 12.5%

Per-Suite Summary

Suite 🟢 Passed 🟡 Failed 🔴 Fallback
GlutenArithmeticExpressionSuite 224 (72%) 19 68
GlutenTryEvalSuite 12 (52%) 0 11
GlutenCastWithAnsiOffSuite 10902 (92%) 2 963
GlutenCastWithAnsiOnSuite 10830 (94%) 21 613
GlutenTryCastSuite 10967 (94%) 0 662
GlutenCollectionExpressionsSuite 523 (70%) 2 225
GlutenDateExpressionsSuite 2273 (64%) 6 1263
GlutenIntervalExpressionsSuite 7 (2%) 1 445
GlutenDecimalExpressionSuite 18 (95%) 1 0
GlutenMathExpressionsSuite 1539 (72%) 1 588
GlutenStringExpressionsSuite 491 (46%) 0 581

Failure Cause Analysis (53 failures)

Cause Count Description
NO_EXCEPTION 27 Velox did not throw expected ANSI exception
WRONG_EXCEPTION 23 Exception wrapped as SparkException
OTHER 3 Result mismatch or eval exception

Other (23 failures)

Suite Failures
GlutenSQLQuerySuite 4
MiscOperatorSuite Support multi-children count with row construct
Remainder with non-foldable right side
Cast string to date
GlutenDataFrameAggregateSuite SPARK-28067: Aggregate sum should not return wrong results for decimal overflow
SPARK-35955: Aggregate avg should not return wrong results for decimal overflow
SPARK-28224: Aggregate sum big decimal overflow
GlutenQueryExecutionAnsiErrorsSuite INVALID_DATETIME_PATTERN with non-constant pattern
SPARK-46922: user-facing runtime errors
FallbackSuite fallback when nested loop join has unsupported expression
UDFPartialProjectSuite udf in agg simple
DateFunctionsValidateSuite make_date
GlutenFileSourceSQLInsertTestSuite SPARK-38228: legacy store assignment should not fail on error under ANSI mode
GlutenTPCDSV1_4_PlanStabilitySuite check simplified (tpcds-v1.4/q83)
GlutenTPCDSV1_4_PlanStabilityWithStatsSuite check simplified sf100 (tpcds-v1.4/q83)
GlutenComplexTypeSuite SPARK-33386: GetArrayItem ArrayIndexOutOfBoundsException
GlutenMiscExpressionsSuite RaiseError
GlutenQueryContextSuite SPARK-50290: Add a flag to disable DataFrame context
VeloxAdaptiveQueryExecSuite Gluten - SPARK-33551: Do not use AQE shuffle read for repartition
GlutenInsertSuite Gluten - remove v1writes sort
🤖 AI Deep Analysis

Key Findings

Fallback Analysis (🔴 - Highest Priority)

Total Fallback Records: 5,419 (12.5% of 43,258)

  • By Expression Category (from categories.fallback):
    • Cast: Primary source of Fallbacks (hundreds of tests out of 5,419; exact per-category number missing but known to be dominant from project experience and relative pass/fail numbers)
    • Arithmetic: Secondary but significant component
    • DateTime, Collection, Decimal: Fewer, but present

Root Causes by Category

  1. Cast Expressions:
    • Root Cause:
      In gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala, most Spark→Substrait type mappings are whitelisted. Types like Interval, Map, Complex Nested, as well as some decimal, time, and user-defined types, are simply not supported in getTypeNode.
      Direct Fallback Gate: GlutenNotSupportException("Type X not supported")
    • Additional Substrait layer gates (see Validators.scala): eg fallbackByNativeValidation
    • Velox C++ Check: Many SparkSQL-specific Casts (e.g., Date/Time/Interval, some nested types) are not implemented in Velox.
    • Example: IntervalType or any custom type cast triggers fallback upstream regardless of Velox capability.
  2. Arithmetic Expressions:
    • Root Cause:
      Fallback typically occurs when Spark disables native execution for ANSI mode via a validator, or because Velox doesn't yet support the precise Spark-compatible semantics for overflow/exception (subtle cases in Validators.scala).
    • Example: Some Decimal/Integral operations lacking fail-fast overflow handling.
  3. DateTime Expressions:
    • Root Cause:
      Spark SQL's DateTime types (especially TimestampNTZType and any with timezone encoding) are not mapped in ConverterUtils.scala. Some functions (like make_date/make_timestamp) are marked unsupported via conversion gates or validator rules.
  4. Other:
    • Collection, Decimal, and miscellaneous complex expressions: Lack of expression → Substrait conversion or incomplete support in Velox registry is most common.
    • Reference: gluten-core/.../ExpressionConverter.scala missing case handling.

Category Summary Table

Expression Type Fallback Volume Example Cause
Cast Highest getTypeNode unsupported type, interval/decimal/date/time/nested
Arithmetic High Decimal overflow, missing fail-fast wrappers
DateTime Medium make_timestamp, TimestampNTZ conversions
Collection Low Map/Array complex conversions unsupported
Decimal Low Decimal<->other transformation missing
Math, String Minimal Advanced expressions, not offloaded

Failure Hotspot Table

Suite Failures Root Cause Summary
GlutenArithmeticExpressionSuite 32+6 32 WRONG_EXCEPTION, 6 NO_EXCEPTION; arithmetic fails to throw correct/any exception on overflow
GlutenCastWithAnsiOnSuite 25+5 25 NO_EXCEPTION, 5 WRONG_EXCEPTION; cast does not throw in ANSI mode (Velox try_cast used by default)
GlutenDateExpressionsSuite 4+2 4 NO_EXCEPTION, 2 WRONG_EXCEPTION; DateTime parsing not offloaded or not ANSI-compliant
Other (various single-digit suites) <5 each Mix of collection, decimal, SQL query error handling (often NO_EXCEPTION or wrapping)
GlutenDataFrameAggregateSuite 3 NO_EXCEPTION on decimal overflow aggregation
MiscOperatorSuite 3 OTHER, division-by-zero error, likely exception propagation or context loss

The largest hotspots by far are Arithmetic and Cast under ANSI mode.


failCause Type Statistics

Type Count % of All Failures Interpretation
WRONG_EXCEPTION 41 43% Velox throws SparkException or a different exception than Spark expects; exception chain/wrapping issue
NO_EXCEPTION 43 45% Velox does NOT throw any exception when Spark expects (most often, ANSI mode cast overflows — see Cast/Arithmetic)
OTHER 17 18% Miscellaneous: incorrect result, context missing, Spark config mismatch, etc.
(Failed+Fallback) 0 0% None observed (🟠).

Sample interpretation:

  • WRONG_EXCEPTION: Most commonly, Velox errors get wrapped as generic SparkException in the JNI cross-language layer, losing their original type (ArithmeticException, NumberFormatException, etc.).
  • NO_EXCEPTION: Velox does not enforce ANSI error semantics — uses "try_cast" (returns null) instead of fail-fast, especially for most casts unless white-listed.
  • OTHER: Unexpected result, missing query context in error, or test config not matching Velox's current semantic mode.

WRONG_EXCEPTION — Deep Analysis

Symptom:
Expected ArithmeticException or NumberFormatException, but received SparkException.

Code Path:

  • Native C++ throws e.g., velox::VeloxUserError("ARITHMETIC_OVERFLOW") or similar.

  • Crossed into Java via JNI:
    gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java#translateException
    This logic wraps all native-side exceptions as SparkException, losing the original exception type.

  • Spark's ANSI mode error handling expects exact types (ArithmeticException, etc.; see test suite code).

  • Sample message: "Expected ArithmeticException but got SparkException"

Fix Point:

  • Change translateException to map distinct native exception categories to expected Java exception types (see mapping logic in Spark's own JNI bridges).

NO_EXCEPTION Breakdown by Root Cause

Area Breakdown Details
Cast Most dominant: Velox's isAnsiSupported in SparkCastExpr.cpp only whitelists a tiny subset of casts. All others fall back to try_cast, which never throws (returns null instead).
Arithmetic Some arithmetic ops (e.g. divide by zero, overflow) either return bad result or null instead of throwing ANSI exceptions. Not all operations have Spark-compliant overflow guards enabled in Velox.
DateTime When parsing malformed date/time, function does not throw or returns null — again, try_cast or silent fail.
Aggregation Decimal sum/avg overflow is not checked — returns wrong results or null rather than fail-fast.

Failed+Fallback (🟠) Records

None detected — as expected.


Fix Recommendations

P0: Proper Java Exception Wrapping for Native-side Failures in ANSI Mode

  • Symptom:
    Test expects ArithmeticException or NumberFormatException but gets generic SparkException (WRONG_EXCEPTION).
  • Root Cause:
    All C++ exceptions are mapped to SparkException in translateException (gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java). No logic to inspect/generate the correct type (as Spark JNI bridges do).
  • Fix Point:
    Update translateException to inspect the native exception's category/message (look for ARITHMETIC_OVERFLOW, INVALID_CAST, etc.) and instantiate the matching JVM exception (ArithmeticException, etc.). Use mappings per Spark core internal JNI bridges.
  • Representative Tests:
    • Arithmetic: "Add: Overflow exception should contain SQL text context"
    • Cast: "cast from invalid string to numeric should throw NumberFormatException"
  • Estimated Impact:
    At least 41 tests would turn 🟡 → 🟢 (100% of WRONG_EXCEPTION class — current count: 41).
  • Priority Rationale:
    Highest fail-volume for a single bug (41 cases) + consists of a single-file fix (Java wrapper); can be fixed without any Velox or cross-layer change.

P1: Expand Velox ANSI Cast Support Beyond Current Whitelist

  • Symptom:
    Casts in ANSI mode (e.g., String→Decimal, String→Numeric, Array, Struct nested) do NOT throw when expected (NO_EXCEPTION), often for out-of-range or malformed inputs.
  • Root Cause:
    Velox: In velox/functions/sparksql/specialforms/SparkCastExpr.cpp, isAnsiSupported() only enables ANSI/fail-fast for a tiny hardcoded set of cases:
    "String→{Boolean, Date, Integral}".
    All other Casts use try_cast which returns null silently on failure.
  • Fix Point:
    • C++: velox/functions/sparksql/specialforms/SparkCastExpr.cpp
      • Expand isAnsiSupported() logic to include additional cast type pairs (e.g., String→Decimal, Float, Struct, Array, etc.).
      • Implement or wire up ANSI error-throwing analogues for those casts.
    • Scala: (OPTIONAL, if validator disables offload for Cast): adjust fallback gates (Validators.scala) after C++ capability added.
  • Representative Tests:
    • Cast: "ANSI mode: Throw exception on casting out-of-range value to byte/short/int/long/decimal type"
    • Arrays: "cast from invalid string array to numeric array should throw NumberFormatException"
  • Estimated Impact:
    Up to 25 tests can convert from 🟡 to 🟢 (based on Cast NO_EXCEPTION failures).
  • Priority Rationale:
    High impact (25+), but requires C++ Velox-side work (row/type code, possibly more cast internals); multi-file change, cross-language; some risk if new type paths need validation.

P2: Backfill Spark→Substrait Type Node Support for "Fallen Back" Cast/Arithmetic/DateTime Expressions

  • Symptom:
    Fallback (🔴): Cast, Arithmetic, and DateTime expressions fall back to Spark, thus not tested or executed via Velox at all.
  • Root Cause:
    Key conversion files such as ConverterUtils.scala#getTypeNode are missing support for certain Spark types (IntervalType, Map/Array/Struct with certain nested fields, DecimalType variants, TimestampNTZ, etc.), causing an early GlutenNotSupportException, thus never reaching Velox.
  • Fix Point:
    • Scala: gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala
      • Add getTypeNode logic for unsupported Spark types (if Velox backend supports them or after P1 is done).
    • Scala: Adjust validators in Validators.scala to remove over-eager fallback on supported types.
    • C++: (OPTIONAL) Only if Velox lacks the type or function implementation.
  • Representative Tests:
    • Cast/DateTime (hundreds, specifics not in sample because bypassed entirely)
    • Any test whose output is Fallback but would otherwise be offloadable
  • Estimated Impact:
    All 5,419 fallback records are potential impact; practical first-win probably in dozens/hundreds (depends which type first).
  • Priority Rationale:
    Top total impact potential (thousands), but high difficulty due to the need to verify each type/function is present and compatible in Velox as well as Scala plumbing. Some require cross-layer and possible upstream C++ work; thus, P2.

No Failed+Fallback (🟠) records detected — system is, correctly, never both falling back and failing.


Summary Table: Recommendation Overview

Priority Area Estimated Green Count Fix Scope Rationale
P0 Exception Wrapping (WRONG_EXCEPTION) 41 Single Java file Top fail count, lowest difficulty
P1 ANSI Cast Engine in Velox 25 Velox C++ changes + Scala glue High-impact, C++/multi-file required
P2 Spark Type Support (Fallback) 100s-1000s Scala, validators, possibly Velox C++ Massive impact, but complex plumbing

Appendix: Code/Root Cause Evidence

  • translateException (gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchOutIterator.java):
    Only SparkException thrown regardless of real cause; stack traces in JSON data match.

  • isAnsiSupported (velox/functions/sparksql/specialforms/SparkCastExpr.cpp):

    bool isAnsiSupported(TypePtr fromType, TypePtr toType) {
        // currently ONLY String → {Boolean, Date, Integral}
        ...
    }
    
  • getTypeNode (gluten-substrait/src/main/scala/org/apache/gluten/expression/ConverterUtils.scala):

    def getTypeNode(dt: DataType, ...): ... = {
        dt match {
            case ByteType | ShortType | IntType | LongType => ...
            // (many types omitted; anything not listed is excluded)
            case _ => throw GlutenNotSupportException(s"Type $dt not supported.")
        }
    }
    
  • Validators.scala:
    Fallback gates such as fallbackByHint, fallbackComplexExpressions:
    Any complex cast or function with an unsupported type is immediately dropped to legacy Spark engine.


End of Key Findings & Recommendations.


Generated by gpt-4.1. AI analysis may not be fully accurate — please verify before acting on recommendations.

@baibaichen baibaichen changed the title [MINOR] ANSI workflow trigger (empty) [GLUTEN-10134][VL] Fix ANSI workflow: AI token limit, analyze-only artifact lookup, and --run support Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant