feat: make parse_url compatible#4413
Conversation
Spark 3.4 error messages don't include the [INVALID_URL] error class prefix that Spark 4.x uses. Use the URL value itself as the pattern since it appears in both versions' error messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks. This is a good improvement, but I think there are still a few remaining gaps before we can claim full compatibility. Ran the PR locally against Spark on
Quick notes on each:
Happy to share the four SQL fixture files if useful. |
|
@andygrove thank you for testing this and finding the edge cases. Added fixes and coverage for these cases as well. |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. Thanks @parthchandra.
Which issue does this PR close?
Closes #4150.
Rationale for this change
PR #4350 wired
parse_urlthrough Comet's serde layer to the upstreamdatafusion-sparkUDFs but marked itIncompatibledue to divergences between the WHATWG URL Standard (urlcrate) and Spark'sjava.net.URI(RFC 3986). This PR replaces the upstream implementation with a local RFC 3986 regex-basedparser that matches Spark's behavior, promoting
parse_urlfromIncompatibletoCompatible.What changes are included in this PR?
New:
native/spark-expr/src/url_funcs/parse_url.rsCometParseUrlandCometTryParseUrlUDFs using RFC 3986 Appendix B regex decomposition instead of the WHATWGurlcrate""(was NULL)http://host?foo=bar) returns?foo=bar(was/?foo=bar)http://host/) returns"/"(was"")=(e.g.,?a=b=c) now correctly returnb=cparse_url(url, 'HOST', 'key')) returns NULLhttp://host:/path) strips trailing colonhttp:///path) returns NULL instead of""[INVALID_URL]error for malformed URLs (spaces, control chars, missing scheme with://)Modified:
native/core/src/execution/jni_api.rsCometParseUrl/CometTryParseUrlfromspark-exprinstead of upstreamSparkParseUrl/SparkTryParseUrlModified:
spark/.../serde/url.scalaIncompatibleoverride andincompatibleReason(nowCompatibleby default)Modified: SQL test files
parse_url.sql: expanded from 4 fallback queries to 30+ native-execution queries covering all components, edge cases, and divergence fixesparse_url_ansi.sql: enables previously-ignored ANSI error tests withexpect_error(INVALID_URL)How are these changes tested?
parse_url.rscovering all URL components, divergence fixes, error handling, null propagation, and edge cases (query values with=, 3-arg non-QUERY, empty port, empty authority, IPv6, malformed URLs)parse_url.sql,parse_url_ansi.sql) verified against Spark on bothspark-4.0andspark-4.1profiles