Skip to content

fix: Handling of site connection issues during outage#3470

Open
rippyboii wants to merge 140 commits intopython-discord:mainfrom
rippyboii:main
Open

fix: Handling of site connection issues during outage#3470
rippyboii wants to merge 140 commits intopython-discord:mainfrom
rippyboii:main

Conversation

@rippyboii
Copy link

Summary

This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.

  • Aggregate extension + cog load failures during startup and report them as a single alert in #mod-log.
  • Add retry + exponential backoff for cogs that depend on external sites/APIs (e.g., temporary 5xx/429/timeouts), with clear mod notifications on final failure.
  • Add unit tests to validate retry behavior, error classification, and startup reporting.

During setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:

  • stop startup unexpectedly, or
  • fail noisily/fragmentedly, making it hard to see what broke and why.

This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).

Changes

1) Startup failure aggregation (single #mod-log alert)

  • Added utils/startup_reporting.py to format a standardized startup failure message.
  • Updated bot.py to:
    • collect extension + cog load failures (import/setup/add_cog)
    • wait for all load tasks to complete
    • send one aggregated alert summarizing all failures
  • Reporting is defensive: it does not crash if the log channel is unavailable.
  • Startup continues for non-critical failures.

2) Retry + backoff for external/API-dependent cogs

Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.

Covered cogs include:

  • Filtering: 3 attempts with backoff (1s, 2s, 4s); retries on HTTP 429, HTTP 5xx, TimeoutError, OSError; final failure logs + alerts #mod-alerts.
  • Reminders: retry count is configured via URLs.connect_max_retries; warns are logged to Sentry; final failure posts to #mod-log.
  • PythonNews: retries on 408, 429, 5xx, TimeoutError, OSError; on max retries logs + alerts mod_alerts and re-raises to stop startup.
  • Superstarify cogs: added retry + notification and corresponding tests.

Tests / Verification

  • Added unit tests covering:
    • retry-then-success
    • max-retries then alert + failure behavior
    • non-retriable errors
    • retry classification logic
    • aggregated startup failure reporting for faulty extensions/cogs

Suggested checks:

  1. uv run task test
  2. Run the bot and simulate a faulty extension/cog load to confirm a single aggregated #mod-log startup alert.

Moderator Alert in Discord:

mod_alert

Closes #2918

rippyboii and others added 30 commits February 25, 2026 15:50
import the report markdown template from the assignment instructions.
Alerts the moderators through a discord error message if the loading of the Reminders Cog has failed.
Adds retry logic with time buff to `Filtering.cog_load()`
Changed cog_load() function to retry connecting to api if it fails initially with an exponential delay and limited max attempts.
Rewrote requirements to adhere to the assignment specifications.
Add test cases for retrying cog loads and skeleton for new functions
a-runebou and others added 25 commits March 2, 2026 20:32
fix: remove uncalled method (Closes #36)
refactor: rename variables (Closes #38)
refactor: simplify merging of lines (Closes #39)
refactor: remove explicit context helper function (Closes #45)
refactor: remove dataclass label (Closes #47)
@rippyboii rippyboii requested a review from jb3 March 2, 2026 23:43
@rippyboii
Copy link
Author

Hi @jb3 , Thank you so much for bringing out the suggestions. We have followed your guidance and implemented the refactor to most of it. Could you please review it once again in your free time?

Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of site connection issues during outage.

6 participants