Skip to content

Feature/alert muting#512

Open
HannahVernon wants to merge 20 commits intoerikdarlingdata:devfrom
HannahVernon:feature/alert-muting
Open

Feature/alert muting#512
HannahVernon wants to merge 20 commits intoerikdarlingdata:devfrom
HannahVernon:feature/alert-muting

Conversation

@HannahVernon
Copy link
Contributor

What does this PR do?

Adds configurable muting of Alerts. For full description, plan, and concise change-log see #493

Which component(s) does this affect?

  • [✅] Full Dashboard
  • [✅] Lite Dashboard
  • Lite Tests
  • SQL collection scripts
  • CLI Installer
  • GUI Installer
  • [✅] Documentation

How was this tested?

Manually tested both Dashboard and Lite per screenshots shown in #493

Checklist

  • [✅] I have read the contributing guide
  • [✅] My code builds with zero warnings (dotnet build -c Debug)
  • [✅] I have tested my changes against at least one SQL Server version
  • [✅] I have not introduced any hardcoded credentials or server names

HannahVernon and others added 18 commits March 9, 2026 12:36
Implement pattern-based and instance-based alert mute rules that suppress
notifications while continuing to log alerts for auditability.

Core:
- MuteRule model with AND-logic matching across server, metric, database,
  query text, wait type, and job name fields
- MuteRuleService with JSON persistence (Dashboard) and DuckDB persistence (Lite)
- Optional expiration (1h/24h/7d/permanent) with automatic skip on expired rules

Alert Pipeline:
- Mute checks wired into all 7 alert metrics in both editions
- Muted alerts logged with muted=true flag, notifications suppressed
- DuckDB schema v20: muted column on config_alert_log, config_mute_rules table

UI:
- MuteRuleDialog for creating/editing rules with match criteria and expiration
- ManageMuteRulesWindow with DataGrid listing, enable/disable/delete/purge
- Settings button in both editions to access mute rule management
- Right-click context menus in alert history: Mute This Alert / Mute Similar
- Muted alerts shown with reduced opacity and italic text in alert history

MCP:
- get_mute_rules tool added to both editions' MCP servers
- MuteRuleService registered in DI for MCP tool resolution

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Right-clicking an alert now shows 'View Details...' at the top of the
context menu, opening a read-only popup with all alert fields (time,
server, metric, values, notification type, status) and a muted indicator
banner. Implemented in both Dashboard and Lite editions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Store metric-specific detail text (query text, job names, wait types,
space breakdowns, etc.) alongside each alert log entry. The Alert Detail
Window now displays a Details section with rich contextual information
that varies by metric type:

- Long-Running Query: session ID, database, query text, CPU/reads/writes
- Long-Running Job: job name, duration, average/P95, % of average
- Blocking: blocked session count, longest wait duration
- Deadlock: new deadlock count
- Poison Wait: per-wait type breakdown (avg ms, delta, tasks)
- TempDB Space: usage breakdown by category

Both Dashboard and Lite editions updated with matching changes.
DuckDB schema bumped to v21 for the new detail_text column.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify the difference between 'Mute This Alert' (server + metric)
and 'Mute Similar Alerts' (metric only, all servers). Add documentation
for the context-sensitive View Details popup window.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Double-clicking a row in the Alert History DataGrid now opens the
Alert Detail Window, matching the existing right-click context menu
behavior. Both Dashboard and Lite editions updated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Enabled checkbox in the Manage Mute Rules DataGrid was non-clickable
because the grid had IsReadOnly=True. Changed to IsReadOnly=False on the
grid with IsReadOnly=True on individual text columns. Added CellEditEnding
handler to persist the toggle via the MuteRuleService. Both editions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CellEditEnding fires before the binding commits, so rule.Enabled still
held the old value when passed to SetRuleEnabled. Fixed by reading the
current state from the service and toggling the inverse instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaced DataGridCheckBoxColumn with a DataGridTemplateColumn containing
a regular CheckBox. The WPF DataGrid requires the first click to enter
edit mode on DataGridCheckBoxColumn, making it feel unresponsive. A
templated CheckBox with a Click handler responds immediately and persists
the change directly via the service. Reverted grid to IsReadOnly=True
since the checkbox is now independent of the edit pipeline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move BuildBlockingContextAsync and BuildDeadlockContextAsync calls before
RecordAlert in Dashboard so the rich context (session IDs, databases,
queries, wait times, lock modes) is stored as detail text instead of just
summary counts. Falls back to simple text if the context builder returns
null. Lite already had this behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MuteRuleDialog now shows only relevant pattern fields based on the
selected metric type:
- Blocking: Database, Query Text
- Deadlocks: Database
- High CPU / TempDB Space: no pattern fields
- Poison Wait: Wait Type
- Long-Running Query: Database, Query Text, Wait Type
- Long-Running Job: Job Name
- (any): all fields shown

Hidden fields are cleared on save to prevent stale matches.

ManageMuteRulesWindow Toggle and Delete buttons now return focus to the
rules grid and preserve selection index, enabling repeated operations
without re-clicking the list.

Both Dashboard and Lite editions updated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added missing metrics to the alert thresholds table (poison waits,
long-running queries, TempDB space, long-running agent jobs) and
documented the three poison wait types: THREADPOOL, RESOURCE_SEMAPHORE,
and RESOURCE_SEMAPHORE_QUERY_COMPILE.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: DataGrid SelectionChanged event bubbled up to the
TabControl's SelectionChanged handler, which called RefreshAlerts()
on every row click, replacing ItemsSource and clearing selection
before the UI could render it.

Fixes:
- Guard TabControl handler with OriginalSource check to ignore
  bubbled events from child controls
- Use visual tree hit-testing for double-click handlers instead of
  relying on SelectedItem (more robust in both editions)
- Defensive Owner handling for AlertDetailWindow

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Dashboard: Add OriginalSource guard to ServerTabControl_SelectionChanged
   to prevent child control SelectionChanged events from triggering the
   handler (same bubbling bug pattern fixed earlier in Lite).

2. Lite: Add re-entrancy guard to ServerTab refresh timer to prevent
   overlapping RefreshAllDataAsync calls if a refresh takes longer than
   the 60-second timer interval. Matches Dashboard's _isRefreshing pattern.

3. Dashboard: Add missing DefaultRowStyle to WaitDrillDownWindow resources
   so the DataGrid's RowStyle reference resolves and rows get the context
   menu.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Owner

@erikdarlingdata erikdarlingdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Alert Muting

Overall this is well-structured with full parity across both apps. Nice work on the context-sensitive dialog, the auditability (muted alerts still logged), and the MCP integration. A few things to address:


Bugs

1. [BLOCKER] DuckDB migration NOT NULL on ALTER TABLE
Lite/Database/DuckDbInitializer.cs:563 — The v20 migration uses BOOLEAN NOT NULL DEFAULT false, but DuckDB doesn't support NOT NULL on ALTER TABLE ADD COLUMN. There's even a comment in the v9 migration (line 424) warning about this exact issue. The catch block swallows the error so the column never gets added, and subsequent INSERTs that reference muted will fail.

Fix: Change to BOOLEAN DEFAULT false (drop the NOT NULL). The NOT NULL in Schema.CreateAlertLogTable for fresh installs is fine — this only affects the ALTER TABLE migration path.

2. "Mute This Alert" dialog title says "Edit Mute Rule"
Both Dashboard/Controls/AlertsHistoryContent.xaml.cs and Lite/Controls/AlertsHistoryTab.xaml.cs pass a pre-filled MuteRule to the MuteRuleDialog(MuteRule existingRule) constructor. Since existingRule != null, the dialog sets its title to "Edit Mute Rule" instead of "Create Mute Rule".

Fix: Either use the no-arg constructor and set fields after, add a bool isNew parameter, or use the AlertMuteContext constructor path.

3. "Mute This Alert" doesn't pre-fill context fields
Right-clicking an alert only pre-fills server name + metric name, not database/query text/wait type/job name. The AlertMuteContext constructor exists for rich pre-fill but is never called from the UI — the structured fields aren't on the display model. Users have to manually type pattern fields even though the detail info is visible in the alert.

This could be deferred to a follow-up, but worth noting the AlertMuteContext constructor is currently dead code.


Medium

4. Lite MuteRuleService skips the LockedConnection pattern
All DuckDB operations in Lite/Services/MuteRuleService.cs call _dbInitializer.CreateConnection() directly without AcquireReadLock()/AcquireWriteLock(). The established pattern in LocalDataService uses locks to coordinate with CHECKPOINT and archive operations. A concurrent CHECKPOINT during a mute rule write could cause issues.

5. Cache-first, persist-second risks silent data loss
Lite's MuteRuleService mutates the in-memory cache under lock, then does a best-effort async DuckDB write. If the write fails silently, the user's changes are lost on next app restart since LoadAsync() reloads from DuckDB.

Consider: attempt the DB write first, update the cache only on success. Or at minimum log a visible warning when persistence fails.

6. Poison wait mute only checks worst wait type
Lite/MainWindow.xaml.cs:1195 — The mute context only includes worst.WaitType. If a user mutes RESOURCE_SEMAPHORE but THREADPOOL is the worst wait, the mute check evaluates THREADPOOL and the alert fires. If the muted wait happens to be worst, other unmuted waits get suppressed too. This is a design choice but may surprise users.

7. MCP tools gap

  • get_alert_history in both apps doesn't include the new muted or detail_text columns
  • get_mute_rules is missing from the README tools table (line 411)
  • README tool counts (lines 49, 278) need incrementing

Low / Cosmetic

8. Hardcoded color #33D97706 on MutedBanner in AlertDetailWindow — verify it looks right on Light/Cool Breeze themes.

9. Dashboard AlertDetailWindow missing Icon attribute (Lite sets /EDD.ico).

10. Dashboard MuteRuleService.Save() swallows all exceptions silently — should at least Logger.Error() them.

11. No automatic purge of expired rules — they accumulate until manual purge. Consider purging on startup in LoadAsync().


What's Done Well

  • All 7 alert metrics covered with consistent mute check pattern
  • Muted alerts still logged for full auditability
  • Thread safety via lock(_lock) with defensive ToList() copies
  • All DuckDB queries properly parameterized (no SQL injection)
  • Schema migrations idempotent with IF NOT EXISTS
  • config_mute_rules correctly excluded from archivable tables
  • Context-sensitive dialog hides irrelevant fields per metric type
  • Clean settings integration entry point
  • Schema test updated correctly

Bug #1 is a blocker for existing Lite users upgrading. The rest can be follow-up items if needed.

erikdarlingdata

This comment was marked as resolved.

Bugs:
- Fix DuckDB migration v20 NOT NULL on ALTER TABLE (blocker for upgrades)
- Fix Mute This Alert dialog showing 'Edit' title instead of 'Create'
- Use AlertMuteContext constructor for proper field pre-fill path

Medium:
- Lite MuteRuleService now uses AcquireReadLock/AcquireWriteLock for
  CHECKPOINT coordination (matches LocalDataService pattern)
- Restructure to persist-first: attempt DB write before updating cache,
  log warnings on failure instead of silently swallowing
- Add poison wait mute limitation comment in both editions
- Add muted and detail_text columns to get_alert_history MCP tool
- Add get_mute_rules to README tools table, update tool counts

Low/Cosmetic:
- MutedBanner uses DynamicResource WarningColor instead of hardcoded color
- Add missing Icon attribute to Dashboard AlertDetailWindow
- Add Logger.Error() to Dashboard MuteRuleService.Save() catch block
- Auto-purge expired mute rules on startup in both editions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@HannahVernon
Copy link
Contributor Author

PR Review Feedback — Changes Made (commit 7dc5744)

All 11 items from the review have been addressed. Here's a summary organized by severity:

Bug Fixes

1. [BLOCKER] DuckDB NOT NULL on ALTER TABLE (DuckDbInitializer.cs:563)
Changed v20 migration from BOOLEAN NOT NULL DEFAULT false to BOOLEAN DEFAULT false. DuckDB does not support NOT NULL on ALTER TABLE ADD COLUMN — this would crash on upgrade. Matches the pattern established in the v9 migration (line 424) which already had a comment about this limitation.

2. Dialog title showing "Edit" instead of "Create" (both editions)
Changed MuteThisAlert_Click in both AlertsHistoryContent.xaml.cs (Dashboard) and AlertsHistoryTab.xaml.cs (Lite) to construct MuteRuleDialog with an AlertMuteContext instead of a MuteRule. This uses the context-aware constructor which sets the title to "Create Mute Rule" and pre-fills available fields.

3. Context pre-fill / dead code (both editions)
By switching to the AlertMuteContext constructor (item #2 above), the pre-fill code path is now exercised — ServerName and MetricName are populated from the alert row. Richer field extraction from DetailText (database, query text, etc.) is deferred to a follow-up as suggested.

Medium Priority

4. Lite MuteRuleService — missing locked connections
Added AcquireReadLock() / AcquireWriteLock() to all DuckDB operations in Lite's MuteRuleService.cs, matching the LockedConnection pattern used throughout LocalDataService. This prevents CHECKPOINT from reorganizing the database file while mute rule operations are in progress.

5. Persist-first pattern (Lite MuteRuleService)
Restructured all mutation methods (AddRuleAsync, RemoveRuleAsync, UpdateRuleAsync, SetRuleEnabledAsync) to attempt the DB write first, then update the in-memory cache. Failures are logged via AppLogger.Warn() so the feature still works for the current session even if DuckDB persistence fails.

6. Poison wait mute limitation comment (both editions)
Added explanatory comment in both MainWindow.xaml.cs files (Dashboard line ~1390, Lite line ~1195) documenting that the mute check uses the worst triggered wait type, and that muting a non-worst type won't suppress the alert.

7. MCP tool gaps (both editions + README)

  • Added muted and detail_text columns to get_alert_history results in both McpAlertTools.cs files
  • Added get_mute_rules to the Alerts row in the README tools table
  • Updated tool counts: Dashboard 27→28, Lite 31→32

Low Priority / Cosmetic

8. MutedBanner theme color (both editions)
Replaced hardcoded #33D97706 background with a theme-aware SolidColorBrush using {DynamicResource WarningColor} at 0.2 opacity in both AlertDetailWindow.xaml files. This ensures the muted banner renders correctly across Dark, Light, and CoolBreeze themes.

9. Dashboard AlertDetailWindow icon
Added Icon="/EDD.ico" to Dashboard/AlertDetailWindow.xaml to match the application icon shown on other windows.

10. Dashboard MuteRuleService.Save() error logging
Replaced silent catch {} with catch (Exception ex) { Helpers.Logger.Error(...) } in the Dashboard MuteRuleService.Save() method so persistence failures are logged.

11. Auto-purge expired rules on startup

  • Dashboard: PurgeExpiredRules() is called after Load() in the MuteRuleService constructor
  • Lite: PurgeExpiredRulesAsync() is called at the end of LoadAsync()

Files Changed (12 files, +112 / -65)

File Changes
Dashboard/AlertDetailWindow.xaml Icon, theme-aware MutedBanner
Dashboard/Controls/AlertsHistoryContent.xaml.cs Use AlertMuteContext constructor
Dashboard/MainWindow.xaml.cs Poison wait limitation comment
Dashboard/Mcp/McpAlertTools.cs Add muted + detail_text columns
Dashboard/Services/MuteRuleService.cs Error logging in Save(), auto-purge
Lite/Controls/AlertsHistoryTab.xaml.cs Use AlertMuteContext constructor
Lite/Database/DuckDbInitializer.cs Fix NOT NULL on v20 migration
Lite/MainWindow.xaml.cs Poison wait limitation comment
Lite/Mcp/McpAlertTools.cs Add muted + detail_text columns
Lite/Services/MuteRuleService.cs Locked connections, persist-first, logging, auto-purge
Lite/Windows/AlertDetailWindow.xaml Theme-aware MutedBanner
README.md Tool counts, get_mute_rules in tools table

Build: 0 errors, 0 new warnings (all warnings are pre-existing). All 6 unit tests pass.

Resolved conflict in Lite/Controls/ServerTab.xaml.cs: kept our
re-entrancy guard (_isRefreshing) and adopted upstream's
fullRefresh: false parameter for timer-based refreshes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants