diff --git a/newrelic/CHANGES.md b/newrelic/CHANGES.md index 491b8f7..cc40b95 100644 --- a/newrelic/CHANGES.md +++ b/newrelic/CHANGES.md @@ -268,3 +268,26 @@ cd /tmp/wt-nr-rollup/newrelic The README update is intentionally left as a manual follow-up because the existing list is hand-curated. + +## customer-db destructive-DDL trap alert + dashboard page (2026-06-11, task D3) + +New CRITICAL log-based alert `alerts/customer-db-destructive-ddl.json` and a +third page ("customer-db DDL trap") on `dashboards/admin-defense.json`. + +- **What it watches:** the `log_statement='ddl'` trap set on postgres-customers + during the 2026-06-03 truehomie-db incident (ALTER SYSTEM, persists on the + PVC). Any `DROP DATABASE/ROLE/USER/OWNED` line from the pod + (`k8s_namespace_name='instant-data'`, `k8s_label_app='postgres-customers'`) + is balanced against the provisioner's sanctioned-drop ledger + (`event=provisioner.drop` from server.guardedDrop and, as of provisioner + PR #56, pool.deprovisionBacking with `caller='pool_reaper'`); a positive + delta (budget: 4 DDL statements per sanctioned shared-pg drop) pages. +- **Why FROM Log:** metric-based NR alerting has no live Prometheus pipeline + in prod; the companion metric alerts (`provisioner-drop-*.json`, + `instant_provisioner_drop_total` — now also carrying an + `outcome="refused"` label from the dropguard name-convention guard) stay + as-is for when the pipeline lands. +- **Upstream dependency:** provisioner PR #56 (dropguard + + `provisioner.drop.refused` event + pool-reaper ledger entry). The alert + works without it but would false-positive on pool reaps; merge #56 first. +- **Apply:** operator runs `newrelic/apply.sh` (no auto-apply in this repo). diff --git a/newrelic/alerts/customer-db-destructive-ddl.json b/newrelic/alerts/customer-db-destructive-ddl.json new file mode 100644 index 0000000..73e01e9 --- /dev/null +++ b/newrelic/alerts/customer-db-destructive-ddl.json @@ -0,0 +1,31 @@ +{ + "name": "postgres-customers — destructive DDL with no sanctioned provisioner drop (truehomie DDL trap)", + "type": "NRQL", + "description": "P0 DATA-LOSS. Pages when the postgres-customers pod logs a DROP DATABASE / DROP ROLE / DROP USER / DROP OWNED statement that is NOT accounted for by a sanctioned provisioner drop in the same window — the exact signature of the 2026-06-03 truehomie-db incident (an active Pro customer's db_/usr_ dropped by an unidentified, non-audited path; root cause still OPEN).\n\nSIGNAL SOURCES (both are Log records via the newrelic-logging Fluent Bit DaemonSet — this alert is deliberately FROM Log because metric-based alerting has no live Prometheus pipeline in prod):\n (1) The DDL trap: `ALTER SYSTEM SET log_statement='ddl'` + log_connections=on was set on postgres-customers during the 2026-06-03 incident response (persisted in postgresql.auto.conf on the PVC — the visible `connection received/authorized` lines in the pod log prove the setting survived restarts). Every DDL statement appears on the pod stdout as a standard postgres line, e.g.:\n 2026-06-10 18:55:14.433 UTC [1704166] LOG: statement: DROP DATABASE \"db_96edf9eed8ed42929036b63298ec5b2b\" WITH (FORCE)\n (extended-protocol clients log `LOG: execute : DROP ...` instead of `statement:` — the match is on the DROP fragment, not the prefix). Pod selector: k8s_namespace_name='instant-data', k8s_label_app='postgres-customers'.\n (2) The sanctioned-drop ledger: every legitimate customer-data drop the provisioner performs emits a structured `event=provisioner.drop` JSON log line BEFORE executing (server.guardedDrop for RPC drops; pool.deprovisionBacking with caller='pool_reaper' for hot-pool reaps — provisioner PR #56). Matched by attribute or raw-message so it is robust to whether NR lifts the JSON fields.\n\nQUERY SEMANTICS: count(pg DROP-DDL lines) - 4 * count(sanctioned shared-postgres provisioner.drop events). Each sanctioned shared-pg drop executes at most 4 matching statements (up to 3 DROP DATABASE attempts in the in-use retry loop + 1 DROP USER), so the budget is generous; a positive value means DDL ran on the shared customer cluster that the provisioner never announced. The truehomie class (drops with ZERO provisioner.drop events) always fires. fillValue STATIC 0 keeps the quiet state at 0; sanctioned-only windows go negative and never alert.\n\nKNOWN BLIND SPOT / FALSE POSITIVES (accepted for a P0 trap): (a) a window containing BOTH a sanctioned drop and a small unsanctioned one can be masked by the 4x budget — the burst class still fires; (b) an aggregation-window boundary can split a sanctioned drop from its DDL lines and blip a false CRITICAL — the 15-minute window makes this rare, and triage step 1 clears it in one grep; (c) an operator running a manual psql DROP (e.g. an in-cluster smoke per POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md) WILL page — that is intended behaviour: manual admin DDL on the customer cluster must be a paged, attributed event.\n\nWHEN THIS FIRES:\n 1. Pull the offending statements: `kubectl logs -n instant-data -l app=postgres-customers --since=30m | grep -iE 'DROP (DATABASE|ROLE|USER|OWNED)'` — log_connections=on means the surrounding `connection authorized: user=... host=...` lines name the role and client_addr.\n 2. Cross-check the sanctioned ledger: `kubectl logs -n instant-infra -l app=instant-provisioner --since=30m | grep provisioner.drop` (includes caller gRPC peer / pool_reaper attribution).\n 3. If the DROP names an ACTIVE customer db_/usr_ (check resources table status): treat as truehomie recurrence — restore per the incident memory (recreate role+db with the stored decrypted connection_url password), and capture the client_addr/application_name THIS time.\n 4. Check `provisioner.drop.refused` events (the dropguard name-convention guard, provisioner PR #56): a refusal plus a pg DDL line means something bypassed the provisioner entirely.\nRunbooks: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md, memory project_truehomie_db_drop_incident_2026_06_03.", + "enabled": true, + "nrql": { + "query": "SELECT filter(count(*), WHERE k8s_namespace_name = 'instant-data' AND k8s_label_app = 'postgres-customers' AND (message LIKE '%DROP DATABASE%' OR message LIKE '%DROP ROLE%' OR message LIKE '%DROP USER%' OR message LIKE '%DROP OWNED%')) - 4 * filter(count(*), WHERE service = 'provisioner' AND (event = 'provisioner.drop' OR message LIKE '%\"event\":\"provisioner.drop\"%') AND (resource_type = 'RESOURCE_TYPE_POSTGRES' OR message LIKE '%RESOURCE_TYPE_POSTGRES%') AND (backend = 'shared' OR message LIKE '%\"backend\":\"shared\"%')) FROM Log" + }, + "terms": [ + { + "priority": "CRITICAL", + "operator": "ABOVE", + "threshold": 0, + "thresholdDuration": 900, + "thresholdOccurrences": "AT_LEAST_ONCE" + } + ], + "signal": { + "aggregationWindow": 900, + "aggregationMethod": "EVENT_FLOW", + "aggregationDelay": 120, + "fillOption": "STATIC", + "fillValue": 0 + }, + "expiration": { + "expirationDuration": 3600, + "openViolationOnExpiration": false, + "closeViolationsOnExpiration": true + }, + "violationTimeLimitSeconds": 86400 +} diff --git a/newrelic/dashboards/admin-defense.json b/newrelic/dashboards/admin-defense.json index 8ab70b4..28c4206 100644 --- a/newrelic/dashboards/admin-defense.json +++ b/newrelic/dashboards/admin-defense.json @@ -133,6 +133,112 @@ } } ] + }, + { + "name": "customer-db DDL trap", + "description": "Truehomie-db DROP incident (2026-06-03) trap surface: every DROP DATABASE/ROLE/USER/OWNED the postgres-customers pod logs (log_statement='ddl', set 2026-06-03, persists on the PVC) vs the provisioner's sanctioned-drop ledger (event=provisioner.drop — guardedDrop + pool_reaper, provisioner PR #56). Pairs with the customer-db-destructive-ddl.json CRITICAL alert (FROM Log — no Prometheus pipeline in prod).", + "widgets": [ + { + "title": "UNSANCTIONED drop-DDL delta (alert signal; must be <= 0)", + "layout": { + "column": 1, + "row": 1, + "width": 4, + "height": 3 + }, + "visualization": { + "id": "viz.billboard" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT filter(count(*), WHERE k8s_namespace_name = 'instant-data' AND k8s_label_app = 'postgres-customers' AND (message LIKE '%DROP DATABASE%' OR message LIKE '%DROP ROLE%' OR message LIKE '%DROP USER%' OR message LIKE '%DROP OWNED%')) - 4 * filter(count(*), WHERE service = 'provisioner' AND (event = 'provisioner.drop' OR message LIKE '%\"event\":\"provisioner.drop\"%') AND (resource_type = 'RESOURCE_TYPE_POSTGRES' OR message LIKE '%RESOURCE_TYPE_POSTGRES%') AND (backend = 'shared' OR message LIKE '%\"backend\":\"shared\"%')) AS 'unsanctioned drop-DDL delta' FROM Log SINCE 3 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, + { + "title": "postgres-customers DROP statements vs sanctioned provisioner drops (24h)", + "layout": { + "column": 5, + "row": 1, + "width": 8, + "height": 3 + }, + "visualization": { + "id": "viz.line" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT filter(count(*), WHERE k8s_namespace_name = 'instant-data' AND k8s_label_app = 'postgres-customers' AND (message LIKE '%DROP DATABASE%' OR message LIKE '%DROP ROLE%' OR message LIKE '%DROP USER%' OR message LIKE '%DROP OWNED%')) AS 'pg DROP statements', filter(count(*), WHERE service = 'provisioner' AND (event = 'provisioner.drop' OR message LIKE '%\"event\":\"provisioner.drop\"%') AND (resource_type = 'RESOURCE_TYPE_POSTGRES' OR message LIKE '%RESOURCE_TYPE_POSTGRES%') AND (backend = 'shared' OR message LIKE '%\"backend\":\"shared\"%')) AS 'sanctioned shared-pg drops' FROM Log TIMESERIES AUTO SINCE 24 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, + { + "title": "Latest DROP statements on postgres-customers (raw trap lines, 7d)", + "layout": { + "column": 1, + "row": 4, + "width": 6, + "height": 3 + }, + "visualization": { + "id": "viz.table" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT timestamp, message FROM Log WHERE k8s_namespace_name = 'instant-data' AND k8s_label_app = 'postgres-customers' AND (message LIKE '%DROP DATABASE%' OR message LIKE '%DROP ROLE%' OR message LIKE '%DROP USER%' OR message LIKE '%DROP OWNED%') SINCE 7 days ago LIMIT 50" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, + { + "title": "Sanctioned drops by caller + dropguard REFUSALS (must be 0) (24h)", + "layout": { + "column": 7, + "row": 4, + "width": 6, + "height": 3 + }, + "visualization": { + "id": "viz.table" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT filter(count(*), WHERE service = 'provisioner' AND (event = 'provisioner.drop' OR message LIKE '%\"event\":\"provisioner.drop\"%')) AS 'sanctioned drops (all types)', filter(count(*), WHERE service = 'provisioner' AND (event = 'provisioner.drop.refused' OR message LIKE '%\"event\":\"provisioner.drop.refused\"%')) AS 'dropguard refusals (bug/attack if > 0)' FROM Log SINCE 24 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + } + ] } ] }