Skip to content

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill)#73

Merged
mastermanas805 merged 1 commit into
masterfrom
obs/reliability-alert-backstops-and-data-tier-protection
Jun 11, 2026
Merged

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill)#73
mastermanas805 merged 1 commit into
masterfrom
obs/reliability-alert-backstops-and-data-tier-protection

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Why

A real customer-facing failure — a customer backup failed AND a mongodb pod was OOMKilled (exit 137), losing a provisioned customer DB — went UNDETECTED for hours until a customer emailed a screenshot.

The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 authored it but it's operator-apply-pending), so every FROM Metric NR alert (~46 of them) queries an empty stream. Only FROM Log (the newrelic-logging Fluent Bit DaemonSet) + Synthetics are live. So the fixes must be LOG-based to work today.

FIX 1 — Log-based silent-failure backstops (the "never blind again" set)

5 new FROM Log alerts under newrelic/alerts/, each keyed on a real worker log line verified against the code (file:line cited in every alert description), following the exact structure of billing-charge-undeliverable.json:

Alert Source log line (verified) Severity / threshold
customer-backup-failed-nonauth-log.json jobs.customer_backup_runner.failed w/ reason != 'auth'worker .../customer_backup_runner.go:729 (classifier backupFailReason :617) WARNING ABOVE 3 / 15m (sustained — a single transient dump self-heals next run)
backup-stuck-row-recovery-failed.json jobs.customer_backup_runner.stuck_row_recovery_failed:370 CRITICAL ABOVE 0 / 10m (any occurrence = bug)
deploy-failed-autopsy-log.json jobs.deploy_failure_autopsy.captured (bounded reason ∈ DeadlineExceeded/StartFailed/BuildFailed/BackoffLimitExceeded/ProgressDeadlineExceeded/OOMKilled/CrashLoopBackOff/ImagePullBackOff) — deploy_failure_autopsy.go:402, pairs with audit_log kind='deploy.failed' CRITICAL ABOVE 0 / 5m
propagation-dead-lettered-log.json jobs.propagation_runner.dead_lettered (:892) + …unknown_kind_dead_lettered (:985) CRITICAL ABOVE 0 / 5m
data-tier-pod-oomkill-restart.json image-native startup banner of each instant-data stateful pod reappearing = restart — postgres-customers database system is ready to accept connections, mongodb Waiting for connections, redis-provision Ready to accept connections, nats Server is ready (pinned images pgvector:pg16 / mongo:7 / redis:7-alpine / nats:2.10-alpine). FACET k8s_label_app. CRITICAL ABOVE 0 / 5m per pod

Backup auth-credential drift (mongo SCRAM / redis WRONGPASS) is already covered by the pre-existing customer-backup-failed.json (reason='auth', CRITICAL ABOVE 0 / 5m) — worker #106 extended backupFailReason to classify all 3 dump-tool auth dialects into reason='auth', so that alert now fires on the new cases. Not duplicated.

All 6 documented in a new LOG-ALERTS section of observability/METRICS-CATALOG.md (source log line + severity + NRQL key per alert).

Flagged blind spots (couldn't fully close on a log line — need #72 / kube-events)

  • data-tier-pod-oomkill-restart.json is a restart detector, not a true OOMKill detector. It cannot read the exit code (137) or distinguish an involuntary OOMKill from a deliberate operator rollout — a planned restart / the maintenance-window apply will fire it once per pod (ack during the window). The authoritative reason='OOMKilled' event (K8sContainerSample / kube_pod_container_status_last_terminated_reason) needs kube-state-metrics + the NR Kubernetes/kube-events integration OR the feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline #72 Prometheus pipeline, neither in prod. Until then the banner detector is the alarm, paired with the FIX-2 protection manifest that prevents the OOMKill.
  • The deploy + propagation LOG alerts are backstops, not replacements — when feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline #72 lands, the FROM Metric originals carry the per-reason/kind faceting; keep both.

FIX 2 — Data-tier OOMKill PROTECTION manifest: ALREADY SHIPPED in #69 (not duplicated)

The eviction-protection work is already on master from #69 (11bc470), operator-apply-pending:

  • PriorityClass instant-data-critical (value 1000000, globalDefault=false) + PodDisruptionBudgets (minAvailable: 1) for postgres-customers/mongodb/redis-provision/nats — k8s/data/stateful-priority.yaml.
  • Memory + CPU requests/limits on all 4 pods (e.g. mongodb 256Mi/100m req, 1Gi limit → Burstable, no longer BestEffort) — k8s/data/{mongodb,redis-provision,postgres-customers,nats}.yaml.
  • The maintenance-window apply runbook (k8s/DATA-TIER-APPLY-RUNBOOK.md): apply one pod at a time, patch priorityClassName in the same roll, verify data intact after each restart, rollback steps. I validated these manifests kubectl apply --dry-run=client clean but did not modify or re-apply them (per instruction not to touch infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook #69 artifacts).

⚠️ MAINTENANCE-WINDOW APPLY WARNING (FIX 2, #69). Applying the data-tier protection RESTARTS the single-replica stateful pods. A careless restart on a single-replica pod with live data is exactly what lost the mongo DB this session. Apply in a maintenance window, ONE POD AT A TIME, and verify the data is intact after each (per k8s/DATA-TIER-APPLY-RUNBOOK.md). Do NOT auto-apply.

Validation

  • NR alert test suite green: bash newrelic/tests/apply.test.sh49 passed / 0 failed, "all 103 JSON files parse" (was 98).
  • apply.sh --dry-run lists all 5 new alerts (glob-derived count stays correct per fix(rbac,nr-test): grant worker jobs/pods/events read; resolve NR test conflict #71).
  • typos clean on all changed files. lychee is warn-only.
  • kubectl apply --dry-run=client clean on the FIX-2 manifests.
  • YAML/JSON/docs only — no code change. Operator-apply via apply.sh; no auto-apply, no kubectl apply performed.

🤖 Generated with Claude Code

…ploy/propagation/data-tier OOMKill)

A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a
provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot.
The metric-based alerts that should have caught these are INERT: prod has NO
Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending),
so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic-
logging Fluent Bit DaemonSet) + Synthetics are live.

Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines
(verified against worker code, file:line cited in each description):

- customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed
  reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals).
  Complements the pre-existing auth-only CRITICAL alert.
- backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL
  ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker
  #106 (which previously erred on EVERY tick, unalerted, for hours).
- deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL
  ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts
  (rule 27 silent-deploy-failure class).
- propagation-dead-lettered-log.json — propagation_runner.dead_lettered +
  unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert
  propagation-dead-lettered metric alert (paid customer regrade fell through).
- data-tier-pod-oomkill-restart.json — image-native startup banner of each
  instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET
  k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the
  customer DB this session. Flagged blind spot: a banner detector cannot read
  exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative
  reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline.

Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md
with the verified source log line + severity + NRQL key per alert, and the two
acknowledged blind spots.

FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs,
per-pod memory/cpu requests+limits, maintenance-window apply runbook) already
landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml +
k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here.

NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl
--dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs
only; operator-apply (apply.sh) — no auto-apply.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 11, 2026 05:09
@mastermanas805 mastermanas805 merged commit 430667b into master Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant