Skip to content

fix: use UTC for Arrow schema timezone in SparkToColumnar conversions#3878

Open
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix-spark-to-columnar-timezone
Open

fix: use UTC for Arrow schema timezone in SparkToColumnar conversions#3878
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix-spark-to-columnar-timezone

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 1, 2026

Which issue does this PR close?

Closes #2720
Closes #2733

Rationale for this change

CometSparkToColumnarExec and CometLocalTableScanExec used conf.sessionLocalTimeZone when creating Arrow schemas for timestamp columns. However, the native side always deserializes Timestamp as Timestamp(Microsecond, Some("UTC")) in serde.rs. When the session timezone is non-UTC, this causes a timezone mismatch between the Arrow data schema and the native plan schema.

The native ScanExec.make_record_batch detects this mismatch and casts the column to match. Since Arrow timestamps with timezone are always stored as UTC microseconds internally, this cast is a no-op on the data — but it adds unnecessary overhead and is inconsistent with NativeBatchReader (the Parquet scan path), which already hardcodes UTC.

What changes are included in this PR?

  • Changed CometSparkToColumnarExec and CometLocalTableScanExec to use "UTC" instead of conf.sessionLocalTimeZone for Arrow schema timezone, matching NativeBatchReader and native serde.rs
  • Removed outdated "known issues with non-UTC timezones" warnings from config descriptions for spark.comet.convert.parquet.enabled, spark.comet.convert.json.enabled, spark.comet.convert.csv.enabled, and spark.comet.sparkToColumnar.enabled
  • Moved these configs from testing to exec category since the timezone concern is resolved

How are these changes tested?

Added two tests in CometExecSuite:

  • LocalTableScanExec with timestamps in non-UTC timezone — verifies correct results with CometLocalTableScanExec when session timezone is America/Los_Angeles
  • sort on timestamps with non-UTC timezone via LocalTableScan — verifies correct results through a sort + repartition path with non-UTC timezone

All 90 tests in CometExecSuite pass.

…d CometLocalTableScanExec

CometSparkToColumnarExec and CometLocalTableScanExec used the session
timezone for Arrow schema metadata, but the native side always
deserializes Timestamp as Timestamp(Microsecond, Some("UTC")). This
causes an unnecessary cast in ScanExec when timezones don't match.

Spark's internal timestamps are always UTC microseconds, so the
timezone in the Arrow schema is purely metadata. Using UTC matches
NativeBatchReader and the native serde.

Also remove outdated "known issues with non-UTC timezones" warnings
from config descriptions and move configs from testing to exec category.

Closes apache#2720
@andygrove andygrove marked this pull request as ready for review April 1, 2026 21:23
@andygrove andygrove requested a review from parthchandra April 2, 2026 13:49
}
}

test("LocalTableScanExec with timestamps in non-UTC timezone") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need COMET_SPARK_TO_ARROW_ENABLED to be set to true for these tests?

@@ -222,31 +222,28 @@ object CometConf extends ShimCometConf {

val COMET_CONVERT_FROM_PARQUET_ENABLED: ConfigEntry[Boolean] =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any test which exercise COMET_CONVERT_FROM_PARQUET_ENABLED, COMET_CONVERT_FROM_JSON_ENABLED, and COMET_CONVERT_FROM_CSV_ENABLED, since these will no longer be 'experimental' ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EPIC] [DISCUSS] Comet timezone handling Spark SQL test failures due to timestamp mismatch when LocalTableScan is native

2 participants