Skip to content

Add dora_monitor: Slack alerting tool for ethrex devnet#1

Open
edg-l wants to merge 6 commits into
mainfrom
dora-monitor
Open

Add dora_monitor: Slack alerting tool for ethrex devnet#1
edg-l wants to merge 6 commits into
mainfrom
dora-monitor

Conversation

@edg-l
Copy link
Copy Markdown

@edg-l edg-l commented May 20, 2026

Adds dora_monitor/, a Python 3.10+ tool that polls a Dora explorer API and posts Slack alerts when the tracked client (default: ethrex) has issues.

Summary

  • Detects missed slot proposals, orphaned blocks, non-canonical heads (forks), sync lag past a configurable threshold, beacon status != online, and EL version drift (deploy/rollback detection).
  • Posts a periodic heartbeat digest (default every 6h) with canonical head, all-client status counts, and per-client detail.
  • State-change-only alerting with dedup state persisted to JSON; sends a recovery alert when a condition clears.
  • EL version detection scrapes the /clients/execution HTML page because Dora's /v1/clients/execution JSON endpoint reflects devp2p-crawler connectivity (connected/disconnected), not the UI's Ready/Synchronizing/Offline status. This is documented in the README.
  • Offline/fork/sync-lag detection uses /api/v1/network/client_head_forks (CL view); an EL-only crash is detected indirectly via the paired beacon's head_slot stalling.

Test plan

  • Copy config.example.yaml, fill in dora_url and slack_webhook_url, run make dry-run and verify alerts print to stdout without hitting Slack.
  • Run make dry-run-once against a live Dora instance and check parsed client data looks correct.
  • Run with a real webhook and confirm a heartbeat digest posts to Slack after --force-heartbeat.
  • Set a low sync_lag_slots threshold, confirm a sync-lag alert fires and a recovery alert fires when the node catches up.
  • Run make run for a full poll cycle; verify state JSON is written and dedup prevents duplicate alerts on subsequent runs.
  • Run --reset-state and confirm alerts re-fire on next tick.

@edg-l edg-l marked this pull request as ready for review May 20, 2026 09:08
edg-l added 5 commits May 20, 2026 11:13
- guard the slot-set trim against last_known_head=0 (previously the
  cutoff could go negative and silently never trim)
- pick canonical fork by client majority instead of highest head_slot
  (a minority fork can briefly be ahead during a split)
- offline alert only on status=offline; synchronizing/optimistic are
  normal transient states and were over-paging
- split Slack messages on line boundaries when they exceed 3800 chars
  instead of letting Slack silently truncate
- distinguish Slack 429 in the error log
- cap /clients/execution HTML read at 512KB to bound regex work
- clearer error on unknown YAML keys (top-level and under checks:)
- minor: docstring noting heartbeat snapshots aren't atomic, simpler
  dry-run prefix closure, cleaner status check in DoraClient._get
- post heartbeats via Block Kit (header / section / divider / context)
  instead of one mrkdwn blob; action alerts stay as plain text posts
- new send_blocks() on SlackNotifier with text fallback for notifications
- collapse online + canonical + distance=0 clients into one bucket;
  surface outliers (offline, synchronizing, non-canonical, lagging)
  above the healthy bucket with status emoji per row
- status emojis: green/yellow/orange/red circles for online/sync/opt/off
- dry-run patches both send and send_blocks; --debug dumps blocks JSON
  so it can be previewed in Slack's Block Kit Builder
Propagation timing routinely produces transient 1-2 slot leads or lags
that the previous code surfaced as fork alerts (and an immediate
resolved alert a tick later). Configurable via fork_confirm_ticks
(default 3 = ~90s at the default 30s poll), persisted per-client in
the dedup state so it survives restarts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant