[SPARK-54890][PYTHON] Allow users to enforce timezone match for timestamp conversion #53667
+76
−10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
A new mode controlled by a
SQLConf-"spark.sql.session.enforceTimeZoneMatch"is introduced to enforce timezone check when converting timestamps.Under this mode, only timezone aware
datetime()can be converted from/toTimestampType()and only naivedatetime()can be converted from/toTimestampNTZType().To make this work in UDF workers where
SQLConfdoes not exist, a new class variable is introduced inDatetimeTypeas the fallback config. We set this class variable when we instantiate a worker to control the behavior.The current implementation is a PoC. Once the direction is approved, I'll fill the gaps.
TODO:
Why are the changes needed?
We have too many timezone related issues now. It's not even possible to define how timestamps should work in spark. Python has rules about naive timestamps which use the local machine timezone, which makes UDF workers super unpredictable. Spark also has a session local timezone config which makes the situation even more complicated.
The only way to make it explanable and consistent is to never mix timezone-aware and timezone-naive timestamps. If the user just want a timestamp without a timezone, they need to use
TimestampNTZType(), period.Does this PR introduce any user-facing change?
This PR is backward compatible. It introduces a new config to change the behavior.
How was this patch tested?
For now, locally tested the an error would be raised. Tests should be written in the future.
Was this patch authored or co-authored using generative AI tooling?
No