Skip to content

obs(newrelic): P0 log-based alert for unsanctioned destructive DDL on postgres-customers (truehomie DDL trap, D3)#70

Merged
mastermanas805 merged 1 commit into
masterfrom
obs/d3-ddl-trap-alert
Jun 10, 2026
Merged

obs(newrelic): P0 log-based alert for unsanctioned destructive DDL on postgres-customers (truehomie DDL trap, D3)#70
mastermanas805 merged 1 commit into
masterfrom
obs/d3-ddl-trap-alert

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

Task D3 deliverable 4 — wires the postgres-customers DDL-logging trap (set during the 2026-06-03 truehomie-db incident) to a P0 NR alert, plus a dashboard page.

  • newrelic/alerts/customer-db-destructive-ddl.json — CRITICAL, FROM Log (metric-based NR alerting is not live in prod — no Prometheus pipeline). Fires when the postgres-customers pod (k8s_namespace_name='instant-data', k8s_label_app='postgres-customers') logs a DROP DATABASE/ROLE/USER/OWNED statement not accounted for by a sanctioned provisioner drop (event=provisioner.dropserver.guardedDrop for RPC drops, pool.deprovisionBacking caller='pool_reaper' for hot-pool reaps) in the same 15-minute window, with a 4× DDL budget per sanctioned shared-pg drop (up to 3 retried DROP DATABASE attempts + 1 DROP USER). The truehomie signature — drops with ZERO provisioner.drop events — always pages. Full triage runbook in the condition description.
  • newrelic/dashboards/admin-defense.json — new third page "customer-db DDL trap": unsanctioned-delta billboard, DDL-vs-sanctioned timeseries, raw trap lines table, sanctioned-drops + dropguard-refusals table. Purely additive (106 insertions, 0 deletions).
  • newrelic/CHANGES.md — entry with the upstream dependency.

Discovered trap log shape (live, read-only kubectl on do-nyc3-instant-prod)

  • Trap = ALTER SYSTEM SET log_statement='ddl' + log_connections=on (2026-06-03 incident response; persists in postgresql.auto.conf on the PVC — the connection received/authorized lines visible in the current pod log prove the settings survived the 2026-06-06 Recreate).
  • DDL line shape: standard postgres stderr format, e.g.
    2026-06-10 18:55:14.433 UTC [1704166] LOG: statement: DROP DATABASE "db_96edf9eed8ed42929036b63298ec5b2b" (extended-protocol clients log LOG: execute <name>: DROP ...) — the NRQL matches the DROP fragment, not the prefix. 26,777 pod log lines in the last 96h contain ZERO statement: lines — i.e. no DDL ran in that window; the trap is armed and quiet.

Ordering dependency

Provisioner PR #56 (InstaNode-dev/provisioner) adds the pool_reaper ledger entry + dropguard. Merge #56 before applying this alert, or hot-pool reaps of failed postgres items will false-positive.

Rule-17 coverage block

Symptom:        unaudited DROP DATABASE/ROLE on postgres-customers pages nobody (truehomie 2026-06-03)
Enumeration:    kubectl logs (96h) + rg over provisioner/api/worker/common for all DROP emitters; existing alert JSONs greped for ddl/drop coverage (only metric-based provisioner-drop-*.json existed — not live)
Sites found:    1 alert gap (no log-based DDL alert), 1 dashboard gap
Sites touched:  2 (alert JSON + dashboard page) + CHANGES.md
Coverage test:  newrelic JSONs validated with python json.tool (apply.test.sh is pre-existing-broken on master — committed merge-conflict markers at line 174 from PR #14 + stale dry-run count; flagged, not fixed here)
Live verified:  awaiting operator apply (newrelic/apply.sh — this repo has no auto-apply by design)

Pre-existing finding (not fixed here)

newrelic/tests/apply.test.sh on master contains unresolved merge-conflict markers (line 174, from PR #14 squash) and a stale expected-count baseline (33 vs 98 JSON files) — the NR test suite has been un-runnable since that merge. Recommend a follow-up PR.

🤖 Generated with Claude Code

… postgres-customers (truehomie DDL trap, task D3)

- alerts/customer-db-destructive-ddl.json: CRITICAL FROM Log (metric-based
  alerting has no live Prometheus pipeline in prod). Balances DROP
  DATABASE/ROLE/USER/OWNED lines from the postgres-customers pod
  (log_statement='ddl' trap set 2026-06-03, persists on the PVC) against the
  provisioner's sanctioned-drop ledger (event=provisioner.drop from
  server.guardedDrop + pool_reaper, provisioner PR #56), with a 4x DDL-budget
  per sanctioned shared-pg drop. The truehomie signature (drops with ZERO
  provisioner.drop events) always pages. Triage runbook in the description.
- dashboards/admin-defense.json: new "customer-db DDL trap" page (delta
  billboard, DDL-vs-sanctioned timeseries, raw trap lines, sanctioned +
  dropguard-refusal table). Purely additive.
- newrelic/CHANGES.md entry (upstream dependency: provisioner PR #56 —
  merge it first or pool reaps false-positive).

Operator apply required (no auto-apply in this repo): newrelic/apply.sh.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 10, 2026 19:38
@mastermanas805 mastermanas805 merged commit 58b7bca into master Jun 10, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant