Skip to content

fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348

Open
Sanjays2402 wants to merge 1 commit into
scrapinghub:masterfrom
Sanjays2402:fix/redundant-tz-abbreviation
Open

fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348
Sanjays2402 wants to merge 1 commit into
scrapinghub:masterfrom
Sanjays2402:fix/redundant-tz-abbreviation

Conversation

@Sanjays2402

Copy link
Copy Markdown

Summary

Fixes #1227. dateparser.parse returned None for RFC 2822 email dates that
carry both a numeric UTC offset and a redundant, equivalent timezone
abbreviation in parentheses:

>>> import dateparser
>>> dateparser.parse("Thu 30 May 2024 10:13:10 -0500 (CDT)")  # before: None
datetime.datetime(2024, 5, 30, 10, 13, 10, tzinfo=<StaticTzInfo '-05:00'>)

Python's own email.utils.parsedate_to_datetime parses this string fine, so
dateparser silently failing on a very common email-header form is surprising.

Root cause

dateparser.utils.strip_braces turns (CDT) into a bare CDT token, so the
string reaching the timezone step contains two timezone tokens:
-0500 and CDT. pop_tz_offset_from_string removed only the first token
it matched (CDT), leaving the numeric -0500 stranded in the string. The
absolute parser then choked on the leftover (ValueError: Unable to parse: 0500) and the whole parse returned None.

The equivalent GMT-prefixed form already worked:

>>> dateparser.parse("Fri Sep 23 2016 10:34:51 GMT+0800 (CST)")
datetime.datetime(2016, 9, 23, 10, 34, 51, tzinfo=<StaticTzInfo '+08:00'>)

...but only by accident: the offset regex for the GMT+0800 form is
(?:UTC|GMT)\+08:?00.*$, whose trailing .*$ greedily swallows the
CST abbreviation, so both tokens are consumed by one match. A bare
numeric offset like -0500 has no such spanning regex, so its redundant
abbreviation is orphaned.

Fix

After removing the first timezone token, strip a second, adjacent token too
when it denotes the same UTC offset — the parenthesised abbreviation is
purely informational and the numeric offset is authoritative (this matches
email.utils). A conflicting second timezone is deliberately left in place,
so existing behavior is preserved for contradictory input. The remainder is
right-stripped before the follow-up search because the numeric-offset regexes
are anchored at the end of the string.

The change is behavior-preserving for every existing case: I verified the new
pop_tz_offset_from_string returns byte-identical (string, offset) results
to the old implementation across all 40 inputs in the existing
test_extracting_valid_offset suite plus assorted trailing-whitespace edge
cases.

Before / after

input before after
Thu, 30 May 2024 10:13:10 -0500 (CDT) None 2024-05-30 10:13:10-05:00
30 May 2024 10:13:10 -0500 CDT None 2024-05-30 10:13:10-05:00
30 May 2024 10:13:10 CDT -0500 None 2024-05-30 10:13:10-05:00
Mon, 15 Jan 2024 09:30:00 +0000 (UTC) None 2024-01-15 09:30:00+00:00
Fri Sep 23 2016 10:34:51 GMT+0800 (CST) 2016-09-23 10:34:51+08:00 unchanged
30 May 2024 10:13:10 -0500 CST (conflicting) None None (unchanged)

Tests

Added regression coverage at two levels:

  • tests/test_timezone_parser.py::TestTZPopping::test_timezone_deleted_from_string
    — both token orderings (-0500 CDT and CDT -0500) must leave the string
    clean.
  • tests/test_date_parser.py::TestDateParser::test_parsing_with_utc_offsets
    — full-parse cases converted to UTC (the reported string plus a +0000 (UTC) variant).

Each new case fails on master and passes with the fix (verified by stashing
only the source change):

# without the source fix:
5 failed, 14 passed
FAILED ...test_timezone_deleted_from_string_8_...0500_CDT
FAILED ...test_timezone_deleted_from_string_9_...CDT_0500
FAILED ...test_parsing_with_utc_offsets_6_...0500_CDT_
FAILED ...test_parsing_with_utc_offsets_7_...0500_CDT
FAILED ...test_parsing_with_utc_offsets_8_...0000_UTC_

# with the fix:
19 passed

Full suite green with the fix: 24205 passed, 1 skipped, 1 xfailed
(baseline was 24200 passed + 5 new cases). ruff check and ruff format --check are clean on all changed files.

…crapinghub#1227)

RFC 2822 email dates carry both a numeric UTC offset and an equivalent
timezone abbreviation in parentheses, e.g.
``Thu, 30 May 2024 10:13:10 -0500 (CDT)``. ``dateparser.parse`` returned
None for these, while Python's own ``email.utils.parsedate_to_datetime``
handles them.

Root cause: ``strip_braces`` turns ``(CDT)`` into a bare ``CDT`` token, so
the string now contains two timezone tokens (``-0500`` and ``CDT``).
``pop_tz_offset_from_string`` removed only the first token it matched
(``CDT``), leaving the numeric ``-0500`` stranded in the string; the
absolute parser then failed on the leftover ``0500`` and the whole parse
returned None. The equivalent GMT-prefixed form (``GMT+0800 (CST)``)
already worked only because its offset regex greedily spans the trailing
abbreviation.

Fix: after removing the first timezone token, strip a second, adjacent
token too when it denotes the same UTC offset (the parenthesised
abbreviation is informational; the numeric offset is authoritative). A
conflicting second timezone is left in place, preserving current behavior.
The remainder is right-stripped before the follow-up search because the
numeric-offset regexes are anchored at the end of the string.

Adds regression tests for both token orderings at the pop-timezone level
and at the full-parse level; each fails without the fix and passes with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't parse "Thu 30 May 2024 10:13:10 -0500 (CDT)"

1 participant