obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill) by mastermanas805 · Pull Request #73 · InstaNode-dev/infra

mastermanas805 · 2026-06-11T05:09:53Z

Why

A real customer-facing failure — a customer backup failed AND a mongodb pod was OOMKilled (exit 137), losing a provisioned customer DB — went UNDETECTED for hours until a customer emailed a screenshot.

The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 authored it but it's operator-apply-pending), so every FROM Metric NR alert (~46 of them) queries an empty stream. Only FROM Log (the newrelic-logging Fluent Bit DaemonSet) + Synthetics are live. So the fixes must be LOG-based to work today.

FIX 1 — Log-based silent-failure backstops (the "never blind again" set)

5 new FROM Log alerts under newrelic/alerts/, each keyed on a real worker log line verified against the code (file:line cited in every alert description), following the exact structure of billing-charge-undeliverable.json:

Alert	Source log line (verified)	Severity / threshold
`customer-backup-failed-nonauth-log.json`	`jobs.customer_backup_runner.failed` w/ `reason != 'auth'` — `worker .../customer_backup_runner.go:729` (classifier `backupFailReason` :617)	WARNING ABOVE 3 / 15m (sustained — a single transient `dump` self-heals next run)
`backup-stuck-row-recovery-failed.json`	`jobs.customer_backup_runner.stuck_row_recovery_failed` — `:370`	CRITICAL ABOVE 0 / 10m (any occurrence = bug)
`deploy-failed-autopsy-log.json`	`jobs.deploy_failure_autopsy.captured` (bounded `reason` ∈ DeadlineExceeded/StartFailed/BuildFailed/BackoffLimitExceeded/ProgressDeadlineExceeded/OOMKilled/CrashLoopBackOff/ImagePullBackOff) — `deploy_failure_autopsy.go:402`, pairs with `audit_log kind='deploy.failed'`	CRITICAL ABOVE 0 / 5m
`propagation-dead-lettered-log.json`	`jobs.propagation_runner.dead_lettered` (:892) + `…unknown_kind_dead_lettered` (:985)	CRITICAL ABOVE 0 / 5m
`data-tier-pod-oomkill-restart.json`	image-native startup banner of each `instant-data` stateful pod reappearing = restart — postgres-customers `database system is ready to accept connections`, mongodb `Waiting for connections`, redis-provision `Ready to accept connections`, nats `Server is ready` (pinned images pgvector:pg16 / mongo:7 / redis:7-alpine / nats:2.10-alpine). FACET `k8s_label_app`.	CRITICAL ABOVE 0 / 5m per pod

Backup auth-credential drift (mongo SCRAM / redis WRONGPASS) is already covered by the pre-existing customer-backup-failed.json (reason='auth', CRITICAL ABOVE 0 / 5m) — worker #106 extended backupFailReason to classify all 3 dump-tool auth dialects into reason='auth', so that alert now fires on the new cases. Not duplicated.

All 6 documented in a new LOG-ALERTS section of observability/METRICS-CATALOG.md (source log line + severity + NRQL key per alert).

Flagged blind spots (couldn't fully close on a log line — need #72 / kube-events)

data-tier-pod-oomkill-restart.json is a restart detector, not a true OOMKill detector. It cannot read the exit code (137) or distinguish an involuntary OOMKill from a deliberate operator rollout — a planned restart / the maintenance-window apply will fire it once per pod (ack during the window). The authoritative reason='OOMKilled' event (K8sContainerSample / kube_pod_container_status_last_terminated_reason) needs kube-state-metrics + the NR Kubernetes/kube-events integration OR the feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline #72 Prometheus pipeline, neither in prod. Until then the banner detector is the alarm, paired with the FIX-2 protection manifest that prevents the OOMKill.
The deploy + propagation LOG alerts are backstops, not replacements — when feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline #72 lands, the FROM Metric originals carry the per-reason/kind faceting; keep both.

FIX 2 — Data-tier OOMKill PROTECTION manifest: ALREADY SHIPPED in #69 (not duplicated)

The eviction-protection work is already on master from #69 (11bc470), operator-apply-pending:

PriorityClass instant-data-critical (value 1000000, globalDefault=false) + PodDisruptionBudgets (minAvailable: 1) for postgres-customers/mongodb/redis-provision/nats — k8s/data/stateful-priority.yaml.
Memory + CPU requests/limits on all 4 pods (e.g. mongodb 256Mi/100m req, 1Gi limit → Burstable, no longer BestEffort) — k8s/data/{mongodb,redis-provision,postgres-customers,nats}.yaml.
The maintenance-window apply runbook (k8s/DATA-TIER-APPLY-RUNBOOK.md): apply one pod at a time, patch priorityClassName in the same roll, verify data intact after each restart, rollback steps. I validated these manifests kubectl apply --dry-run=client clean but did not modify or re-apply them (per instruction not to touch infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook #69 artifacts).

⚠️ MAINTENANCE-WINDOW APPLY WARNING (FIX 2, #69). Applying the data-tier protection RESTARTS the single-replica stateful pods. A careless restart on a single-replica pod with live data is exactly what lost the mongo DB this session. Apply in a maintenance window, ONE POD AT A TIME, and verify the data is intact after each (per k8s/DATA-TIER-APPLY-RUNBOOK.md). Do NOT auto-apply.

Validation

NR alert test suite green: bash newrelic/tests/apply.test.sh → 49 passed / 0 failed, "all 103 JSON files parse" (was 98).
apply.sh --dry-run lists all 5 new alerts (glob-derived count stays correct per fix(rbac,nr-test): grant worker jobs/pods/events read; resolve NR test conflict #71).
typos clean on all changed files. lychee is warn-only.
kubectl apply --dry-run=client clean on the FIX-2 manifests.
YAML/JSON/docs only — no code change. Operator-apply via apply.sh; no auto-apply, no kubectl apply performed.

🤖 Generated with Claude Code

…ploy/propagation/data-tier OOMKill) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mastermanas805 enabled auto-merge (squash) June 11, 2026 05:09

mastermanas805 merged commit 430667b into master Jun 11, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill)#73

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill)#73
mastermanas805 merged 1 commit into
masterfrom
obs/reliability-alert-backstops-and-data-tier-protection

mastermanas805 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 11, 2026

Why

FIX 1 — Log-based silent-failure backstops (the "never blind again" set)

Flagged blind spots (couldn't fully close on a log line — need #72 / kube-events)

FIX 2 — Data-tier OOMKill PROTECTION manifest: ALREADY SHIPPED in #69 (not duplicated)

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant