obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill)#73
Merged
mastermanas805 merged 1 commit intoJun 11, 2026
Conversation
…ploy/propagation/data-tier OOMKill) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A real customer-facing failure — a customer backup failed AND a
mongodbpod was OOMKilled (exit 137), losing a provisioned customer DB — went UNDETECTED for hours until a customer emailed a screenshot.The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (
newrelic-prometheus-agent/ #72 authored it but it's operator-apply-pending), so everyFROM MetricNR alert (~46 of them) queries an empty stream. OnlyFROM Log(thenewrelic-loggingFluent Bit DaemonSet) + Synthetics are live. So the fixes must be LOG-based to work today.FIX 1 — Log-based silent-failure backstops (the "never blind again" set)
5 new
FROM Logalerts undernewrelic/alerts/, each keyed on a real worker log line verified against the code (file:line cited in every alertdescription), following the exact structure ofbilling-charge-undeliverable.json:customer-backup-failed-nonauth-log.jsonjobs.customer_backup_runner.failedw/reason != 'auth'—worker .../customer_backup_runner.go:729(classifierbackupFailReason:617)dumpself-heals next run)backup-stuck-row-recovery-failed.jsonjobs.customer_backup_runner.stuck_row_recovery_failed—:370deploy-failed-autopsy-log.jsonjobs.deploy_failure_autopsy.captured(boundedreason∈ DeadlineExceeded/StartFailed/BuildFailed/BackoffLimitExceeded/ProgressDeadlineExceeded/OOMKilled/CrashLoopBackOff/ImagePullBackOff) —deploy_failure_autopsy.go:402, pairs withaudit_log kind='deploy.failed'propagation-dead-lettered-log.jsonjobs.propagation_runner.dead_lettered(:892) +…unknown_kind_dead_lettered(:985)data-tier-pod-oomkill-restart.jsoninstant-datastateful pod reappearing = restart — postgres-customersdatabase system is ready to accept connections, mongodbWaiting for connections, redis-provisionReady to accept connections, natsServer is ready(pinned images pgvector:pg16 / mongo:7 / redis:7-alpine / nats:2.10-alpine). FACETk8s_label_app.Backup auth-credential drift (mongo SCRAM / redis WRONGPASS) is already covered by the pre-existing
customer-backup-failed.json(reason='auth', CRITICAL ABOVE 0 / 5m) — worker #106 extendedbackupFailReasonto classify all 3 dump-tool auth dialects intoreason='auth', so that alert now fires on the new cases. Not duplicated.All 6 documented in a new LOG-ALERTS section of
observability/METRICS-CATALOG.md(source log line + severity + NRQL key per alert).Flagged blind spots (couldn't fully close on a log line — need #72 / kube-events)
data-tier-pod-oomkill-restart.jsonis a restart detector, not a true OOMKill detector. It cannot read the exit code (137) or distinguish an involuntary OOMKill from a deliberate operator rollout — a planned restart / the maintenance-window apply will fire it once per pod (ack during the window). The authoritativereason='OOMKilled'event (K8sContainerSample/kube_pod_container_status_last_terminated_reason) needs kube-state-metrics + the NR Kubernetes/kube-events integration OR the feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline #72 Prometheus pipeline, neither in prod. Until then the banner detector is the alarm, paired with the FIX-2 protection manifest that prevents the OOMKill.FROM Metricoriginals carry the per-reason/kindfaceting; keep both.FIX 2 — Data-tier OOMKill PROTECTION manifest: ALREADY SHIPPED in #69 (not duplicated)
The eviction-protection work is already on
masterfrom #69 (11bc470), operator-apply-pending:instant-data-critical(value 1000000, globalDefault=false) + PodDisruptionBudgets (minAvailable: 1) for postgres-customers/mongodb/redis-provision/nats —k8s/data/stateful-priority.yaml.k8s/data/{mongodb,redis-provision,postgres-customers,nats}.yaml.k8s/DATA-TIER-APPLY-RUNBOOK.md): apply one pod at a time, patchpriorityClassNamein the same roll, verify data intact after each restart, rollback steps. I validated these manifestskubectl apply --dry-run=clientclean but did not modify or re-apply them (per instruction not to touch infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook #69 artifacts).Validation
bash newrelic/tests/apply.test.sh→ 49 passed / 0 failed, "all 103 JSON files parse" (was 98).apply.sh --dry-runlists all 5 new alerts (glob-derived count stays correct per fix(rbac,nr-test): grant worker jobs/pods/events read; resolve NR test conflict #71).typosclean on all changed files.lycheeis warn-only.kubectl apply --dry-run=clientclean on the FIX-2 manifests.apply.sh; no auto-apply, nokubectl applyperformed.🤖 Generated with Claude Code