Skip to content

Mimir integration#42

Open
gsanchietti wants to merge 35 commits intomainfrom
mimir-integration
Open

Mimir integration#42
gsanchietti wants to merge 35 commits intomainfrom
mimir-integration

Conversation

@gsanchietti
Copy link
Copy Markdown
Member

@gsanchietti gsanchietti commented Feb 20, 2026

📋 Description

This pull request adds Alertmanager integration based on Grafana Mimir, backend APIs for alert configuration and inspection, resolved-alert history persistence, automatic HostDown monitoring, and a system-level silence action for active alerts.

Backend API (/api/alerts)

  • GET /api/alerts/config — retrieve the current alerting configuration from Mimir as structured JSON or redacted YAML
  • POST /api/alerts/config — apply a new alerting configuration
  • DELETE /api/alerts/config — replace the tenant configuration with a blackhole-only config while keeping the built-in history webhook active
  • GET /api/alerts — list active alerts with optional filters (state, severity, system_key)
  • GET /api/alerts/totals — return active alert counters plus resolved-history totals
  • GET /api/alerts/trend — return resolved-alert trend data for the selected period
  • GET /api/systems/:id/alerts — list active alerts for a single system
  • POST /api/systems/:id/alerts/silences — create a silence for a single active system alert
  • GET /api/systems/:id/alerts/history — return paginated resolved-alert history for a single system

Alerting configuration

  • AlertingConfig supports global settings, per-severity overrides, and per-system overrides
  • SMTP settings are injected server-side
  • The built-in history webhook is always included in the generated Alertmanager config
  • Email templates are available in English and Italian
  • Backend access to alerting configuration and active-alert APIs is scoped through the authenticated user plus the organization_id query parameter where required by the current handlers

Collect service

  • POST /api/alert_history receives Alertmanager webhooks and stores resolved alerts in PostgreSQL
  • Bearer-token authentication is enforced through ALERTING_HISTORY_WEBHOOK_SECRET
  • POST /api/services/mimir/alertmanager/api/v2/alerts proxies authenticated systems to Alertmanager with X-Scope-OrgID derived server-side
  • When a system posts alerts through the collect proxy, labels.system_key is always overwritten with the authenticated system value
  • Additional system and organization context labels are injected when missing
  • POST /api/services/mimir/alertmanager/api/v2/silences proxies authenticated systems to Alertmanager with tenant scoping enforced by the server

Frontend

  • The system detail active-alerts card exposes a silence action for users with manage:systems
  • The silence flow uses a small confirmation modal with an optional comment and refreshes the active-alerts card after success

HostDown monitoring

  • The heartbeat monitor checks every 60 seconds
  • Systems move to inactive after exceeding HEARTBEAT_TIMEOUT_MINUTES
  • A HostDown alert is posted when inactivity persists beyond the timeout and one additional monitor interval
  • The alert is resolved automatically when the system becomes active again

Tooling and docs

  • services/mimir/scripts/alerting_config.py manages alerting config and alert queries through the MY API
  • services/mimir/scripts/alert.py fires, resolves, silences, and lists alerts through the collect proxy
  • OpenAPI, database schema, migrations, tests, and docs cover the new alerting surface

🧪 Validation

  • cd backend && make pre-commit
  • cd collect && make pre-commit
  • cd frontend && npm run pre-commit

Related issue

Implements requirements from #72 (Alarm Management - Alertmanager Integration)

@github-actions
Copy link
Copy Markdown
Contributor

🔗 Redirect URIs Added to Logto

The following redirect URIs have been automatically added to the Logto application configuration:

Redirect URIs:

  • https://my-frontend-qa-pr-42.onrender.com/login-redirect
  • https://my-proxy-qa-pr-42.onrender.com/login-redirect

Post-logout redirect URIs:

  • https://my-frontend-qa-pr-42.onrender.com/login
  • https://my-proxy-qa-pr-42.onrender.com/login

These will be automatically removed when the PR is closed or merged.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 20, 2026

🤖 My API structural change detected

Preview documentation

Structural change details

Added (17)

  • DELETE /alerts/config
  • DELETE /services/mimir/alertmanager/api/v2/silences/{silence_id}
  • DELETE /systems/{id}/alerts/silences/{silence_id}
  • GET /alerts
  • GET /alerts/config
  • GET /alerts/totals
  • GET /alerts/trend
  • GET /services/mimir/alertmanager/api/v2/alerts
  • GET /services/mimir/alertmanager/api/v2/silences
  • GET /services/mimir/alertmanager/api/v2/silences/{silence_id}
  • GET /systems/{id}/alerts
  • GET /systems/{id}/alerts/history
  • POST /alert_history
  • POST /alerts/config
  • POST /services/mimir/alertmanager/api/v2/alerts
  • POST /services/mimir/alertmanager/api/v2/silences
  • POST /systems/{id}/alerts/silences
Powered by Bump.sh

@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 20, 2026 11:00 — with Render Active
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:13 — with Render Active
@edospadoni edospadoni temporarily deployed to mimir-integration - my-collect-qa PR #42 February 24, 2026 16:13 — with Render Destroyed
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:15 — with Render Active
@edospadoni edospadoni requested a deployment to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:38 — with Render In progress
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:40 — with Render Active
@edospadoni edospadoni had a problem deploying to mimir-integration - my-mimir-qa PR #42 February 25, 2026 06:59 — with Render Failure
@gsanchietti
Copy link
Copy Markdown
Member Author

update deploy

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Build triggers updated!

All .render-build-trigger files have been automatically updated to ensure fresh deployments of all services in the PR preview environment.

gsanchietti and others added 30 commits April 10, 2026 14:19
…end APIs

- Rename API routes from /alerting to /alerts for RESTful consistency
- Add GET /api/systems/:id/alerts for per-system active alerts
- Add GET /api/alerts/totals and GET /api/alerts/trend endpoints
- Use RequireResourcePermission on alerts group (read:systems for GET, manage:systems for POST/DELETE)
- Fix OpenAPI paths (remove duplicate /api/ prefix), tags, and security scheme names
- Add composite index (system_key, created_at) and unique constraint (fingerprint, system_key)
- Remove dead code (DeleteConfig), rename alertmanager_history.go to alerting_history.go
- Fix collect: http client timeout, endsAt zero-time handling, timing-safe token comparison
- Fix collect Redis config: only override ParseURL values when env vars are explicitly set
- Add missing env vars to collect .env.example and render.yaml
- Add alert history webhook endpoint to OpenAPI spec
- Move scripts to services/mimir/scripts, remove hardcoded QA credentials
- Add local dev setup: docker-compose.local.yml + my-local.yaml (filesystem storage)
- Fix Mimir config: reference runtime_config.yaml, remove emoji from docker-compose
- Update copyrights to 2026
- collect/middleware: WebhookAuthMiddleware tests (valid/invalid/missing token, unconfigured, timing-safe)
- collect/methods: ReceiveAlertHistory tests (resolved, firing skipped, missing system_key, invalid body, DB error, zero-time endsAt, nullableString)
- backend/methods: filterAlerts tests (all filter combinations, missing labels, empty input)
- backend/entities: alert history repository tests with sqlmock (query, sort validation, totals owner/non-owner, trend up/down/stable)
…_id injection

- Collect proxy injects system_id (DB UUID) label in addition to system_key
- Backend BuildTemplateFiles substitutes ${APP_URL} placeholder in templates
- Templates use localized annotations: summary_en/it and description_en/it with fallback
- Add "service" label display in all 4 HTML/TXT templates
- Add "View system" / "Visualizza sistema" CTA button linking to app_url/systems/:id
- Rewrite TXT templates with welcome-style separators and footer (info@nethesis.it)
- Align label columns in TXT templates (rename FIRING SINCE→SINCE, STARTED AT→STARTED, etc.)
- Align headers/footers with welcome email style (MSO conditionals, backgroundTable)
- Change alert_history unique constraint to (fingerprint, system_key, starts_at)
- Use ON CONFLICT DO NOTHING to avoid overwriting distinct occurrences of same alert
- Add tests for injectSystemLabels helper
- Merge full alerting integration guide into services/mimir/README.md
- Remove separate language files (docs/en/08-alerting.md, docs/it/08-alerting.md)
- Document system_id/system_key auto-injection and summary_en/summary_it/description_en/description_it conventions
- Update alert catalog examples with localized annotations
- Add user-facing alerting guide in docs/docs/features/alerting.md (EN + IT)
- Add "Alerting System" link in Docusaurus Developer Docs dropdown and footer pointing to mimir README
The unique index (fingerprint, system_key, starts_at) was only used by the
ON CONFLICT clause and never helped any SELECT query. Removing both simplifies
the schema and saves index space. If Alertmanager retries a webhook after an
error, a duplicate row may occasionally be inserted — acceptable trade-off for
a rare edge case.
…th system context

Organization lifecycle:
- Auto-provision default alerting config on customer/distributor/reseller creation
- Use org email from custom_data as default notification recipient
- Use org language (en/it) from custom_data for email_template_lang
- Retry config push to Mimir with backoff (1s/3s/5s) to tolerate transient errors
- Built-in history webhook is always active so alert_history works from day one

Collect Mimir proxy:
- Inject organization context labels (name, vat, type) in addition to system_id/key
- Inject system_name, system_fqdn, system_ipv4 from the systems table
- Replace injectSystemLabels with generic injectLabels helper
- Join distributors/resellers/customers in the org lookup query

Email templates (HTML + TXT, EN + IT):
- Two-card layout: alert card (colored) + system info card (neutral) with CTA
- Dynamic organization label based on organization_type
  - IT: CLIENTE/RIVENDITORE/DISTRIBUTORE/ORGANIZZAZIONE
  - EN: CUSTOMER/RESELLER/DISTRIBUTOR/ORGANIZATION
- Dynamic FQDN/IP label (shows whichever is available)
- Subject format: [FIRING][AlertName] - SystemKey
- Plain-text templates abbreviate long labels (RIVEND./DISTRIB./ORG.) for column alignment
- CTA "View system" button linked to APP_URL/systems/<system_id>
- alerting.GetConfig returns (nil, nil) when Mimir responds 404 (no config
  has ever been pushed for this tenant)
- GetAlertingConfig handler returns HTTP 200 with "config": null when the
  body is empty, so the frontend shows the "no configuration found" empty
  state instead of a 500 error
- Previously the API returned 500 "mimir returned 404: alertmanager storage
  object not found" for any org without a pushed config, which broke the UI
  for newly created orgs where auto-provisioning failed
- Update all API calls in lib/alerting.ts to use /alerts instead of /alerting:
  - GET /alerts/config, POST /alerts/config, DELETE /alerts/config
  - GET /alerts (list active alerts)
  - GET /systems/:id/alerts/history
- Replace getSystemActiveAlerts helper to use the dedicated
  GET /systems/:id/alerts endpoint instead of filtering the global alerts
  list by system_key client-side
- SystemActiveAlertsCard: switch from (organizationId, systemKey) to (systemId)
  so it no longer relies on the sanitized system_key field for unregistered
  systems
Provides make targets to manage a local Mimir instance with filesystem
storage (no S3 required), wrapping docker-compose.local.yml:

- dev-setup:   inject MIMIR_URL and alerting webhook env vars into
               backend/.env and collect/.env (idempotent)
- dev-up:      start Mimir container and wait for readiness
- dev-down:    stop container
- dev-restart: restart container
- dev-logs:    follow container logs
- dev-status:  show container status and Mimir readiness
- dev-ready:   check readiness endpoint

Update README with the local development workflow.
- Update all API paths in alerting_config.py from /alerting/... to /alerts/...
  to match the backend API rename:
  - GET/POST/DELETE /alerts/config
  - GET /alerts (list active alerts)
  - GET /systems/:id/alerts/history
- Document the LOGTO_ENDPOINT, LOGTO_APP_ID and AUTH_BASE_URL environment
  variables in scripts/README.md, which replaced the hardcoded QA values
  removed in a previous commit
Short, tool-agnostic reference for AI coding agents working in this
monorepo. Covers components actually on the current branch (backend,
collect, sync, frontend, proxy, services/mimir) and explicitly marks
services/support and services/ssh-gateway as stubs here. API reference
defers to openapi.yaml as source of truth. Includes coding patterns,
RBAC model, alerting invariants, and a short pitfalls list.

Claude Code auto-loads CLAUDE.md; developers who use Claude Code can
create a local CLAUDE.md shim that points to this file.
The script was failing with 404 errors because it was using hardcoded
default Logto endpoint 'https://your-tenant.logto.app' which doesn't exist.

Changes:
- Add required CLI arguments: --tenant-id and --app-id
- Derive Logto endpoint dynamically from tenant ID
- Use the proxy URL as redirect_uri base instead of hardcoded _AUTH_BASE_URL
- Update all examples in docstring to include new arguments
- Pass tenant_id and app_id to all command functions

This allows the script to work with any MY proxy deployment by providing
the Logto tenant configuration at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ments

- Add new required arguments to all command examples
- Update full example workflow to include Logto configuration
- Document the new CLI arguments in the Common arguments table

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove required constraint from --app-id argument
- Use environment variable LOGTO_APP_ID as default if set
- Fall back to standard app ID 'my_frontend_app' if not set
- Update all documentation and examples to show --app-id is now optional
- Update README table to show required/optional arguments clearly

This simplifies the CLI usage for most deployments that use the standard
frontend app ID.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Make --tenant-id optional with TENANT_ID environment variable fallback
- Add validation to ensure tenant_id is provided (via CLI or env var)
- Add detailed debugging in _logto_login() to identify which step fails
- Improve error messages to help user troubleshoot authentication issues
- Show which endpoint failed and provide guidance for common issues
- Display Logto endpoint, tenant ID, and app ID in error output

This helps users quickly identify if the issue is:
  1. Invalid/missing tenant ID
  2. Incorrect app ID
  3. Unregistered redirect URI
  4. Logto service unavailable

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fire the alert only for systems that have been inactive
for at least 2 check intervals (120 seconds) to avoid flapping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants