fix: parse dates with a numeric offset and redundant tz abbreviation (#1227) by Sanjays2402 · Pull Request #1348 · scrapinghub/dateparser

Sanjays2402 · 2026-07-03T13:03:53Z

Summary

Fixes #1227. dateparser.parse returned None for RFC 2822 email dates that
carry both a numeric UTC offset and a redundant, equivalent timezone
abbreviation in parentheses:

>>> import dateparser
>>> dateparser.parse("Thu 30 May 2024 10:13:10 -0500 (CDT)")  # before: None
datetime.datetime(2024, 5, 30, 10, 13, 10, tzinfo=<StaticTzInfo '-05:00'>)

Python's own email.utils.parsedate_to_datetime parses this string fine, so
dateparser silently failing on a very common email-header form is surprising.

Root cause

dateparser.utils.strip_braces turns (CDT) into a bare CDT token, so the
string reaching the timezone step contains two timezone tokens:
-0500 and CDT. pop_tz_offset_from_string removed only the first token
it matched (CDT), leaving the numeric -0500 stranded in the string. The
absolute parser then choked on the leftover (ValueError: Unable to parse: 0500) and the whole parse returned None.

The equivalent GMT-prefixed form already worked:

>>> dateparser.parse("Fri Sep 23 2016 10:34:51 GMT+0800 (CST)")
datetime.datetime(2016, 9, 23, 10, 34, 51, tzinfo=<StaticTzInfo '+08:00'>)

...but only by accident: the offset regex for the GMT+0800 form is
(?:UTC|GMT)\+08:?00.*$, whose trailing .*$ greedily swallows the
CST abbreviation, so both tokens are consumed by one match. A bare
numeric offset like -0500 has no such spanning regex, so its redundant
abbreviation is orphaned.

Fix

After removing the first timezone token, strip a second, adjacent token too
when it denotes the same UTC offset — the parenthesised abbreviation is
purely informational and the numeric offset is authoritative (this matches
email.utils). A conflicting second timezone is deliberately left in place,
so existing behavior is preserved for contradictory input. The remainder is
right-stripped before the follow-up search because the numeric-offset regexes
are anchored at the end of the string.

The change is behavior-preserving for every existing case: I verified the new
pop_tz_offset_from_string returns byte-identical (string, offset) results
to the old implementation across all 40 inputs in the existing
test_extracting_valid_offset suite plus assorted trailing-whitespace edge
cases.

Before / after

input	before	after
`Thu, 30 May 2024 10:13:10 -0500 (CDT)`	`None`	`2024-05-30 10:13:10-05:00`
`30 May 2024 10:13:10 -0500 CDT`	`None`	`2024-05-30 10:13:10-05:00`
`30 May 2024 10:13:10 CDT -0500`	`None`	`2024-05-30 10:13:10-05:00`
`Mon, 15 Jan 2024 09:30:00 +0000 (UTC)`	`None`	`2024-01-15 09:30:00+00:00`
`Fri Sep 23 2016 10:34:51 GMT+0800 (CST)`	`2016-09-23 10:34:51+08:00`	unchanged
`30 May 2024 10:13:10 -0500 CST` (conflicting)	`None`	`None` (unchanged)

Tests

Added regression coverage at two levels:

tests/test_timezone_parser.py::TestTZPopping::test_timezone_deleted_from_string
— both token orderings (-0500 CDT and CDT -0500) must leave the string
clean.
tests/test_date_parser.py::TestDateParser::test_parsing_with_utc_offsets
— full-parse cases converted to UTC (the reported string plus a +0000 (UTC) variant).

Each new case fails on master and passes with the fix (verified by stashing
only the source change):

# without the source fix:
5 failed, 14 passed
FAILED ...test_timezone_deleted_from_string_8_...0500_CDT
FAILED ...test_timezone_deleted_from_string_9_...CDT_0500
FAILED ...test_parsing_with_utc_offsets_6_...0500_CDT_
FAILED ...test_parsing_with_utc_offsets_7_...0500_CDT
FAILED ...test_parsing_with_utc_offsets_8_...0000_UTC_

# with the fix:
19 passed

Full suite green with the fix: 24205 passed, 1 skipped, 1 xfailed
(baseline was 24200 passed + 5 new cases). ruff check and ruff format --check are clean on all changed files.

…crapinghub#1227) RFC 2822 email dates carry both a numeric UTC offset and an equivalent timezone abbreviation in parentheses, e.g. ``Thu, 30 May 2024 10:13:10 -0500 (CDT)``. ``dateparser.parse`` returned None for these, while Python's own ``email.utils.parsedate_to_datetime`` handles them. Root cause: ``strip_braces`` turns ``(CDT)`` into a bare ``CDT`` token, so the string now contains two timezone tokens (``-0500`` and ``CDT``). ``pop_tz_offset_from_string`` removed only the first token it matched (``CDT``), leaving the numeric ``-0500`` stranded in the string; the absolute parser then failed on the leftover ``0500`` and the whole parse returned None. The equivalent GMT-prefixed form (``GMT+0800 (CST)``) already worked only because its offset regex greedily spans the trailing abbreviation. Fix: after removing the first timezone token, strip a second, adjacent token too when it denotes the same UTC offset (the parenthesised abbreviation is informational; the numeric offset is authoritative). A conflicting second timezone is left in place, preserving current behavior. The remainder is right-stripped before the follow-up search because the numeric-offset regexes are anchored at the end of the string. Adds regression tests for both token orderings at the pop-timezone level and at the full-parse level; each fails without the fix and passes with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348

fix: parse dates with a numeric offset and redundant tz abbreviation (#1227)#1348
Sanjays2402 wants to merge 1 commit into
scrapinghub:masterfrom
Sanjays2402:fix/redundant-tz-abbreviation

Sanjays2402 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sanjays2402 commented Jul 3, 2026

Summary

Root cause

Fix

Before / after

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant