feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline#72
Merged
Merged
Conversation
…pipeline Prod (do-nyc3-instant-prod) shipped logs + APM + OTLP traces but had NO metrics ingestion pipeline — no Prometheus scraper pulled the services' /metrics. As a result ~46 New Relic alert conditions of the form `FROM Metric WHERE metricName LIKE 'instant_%'` were INERT: they queried an empty Metric stream. razorpay-webhook-sig-fail, the deploy-silent-failure triad, provisioner-circuit-open, pg-pool-saturation, email-delivery-ratio-low, auth/payment probes — all dark. Ship the official newrelic-prometheus-agent (configurator initContainer + Prometheus --agent) as one stateless 2-container pod in the `newrelic` ns. It scrapes the three backend /metrics via Kubernetes pod SD and remote-writes the instant_* / provisioner_* series to NR's US Prometheus endpoint, lighting up every FROM Metric alert at once. Verified-live scrape targets (2026-06-11): - instant-api podIP:8080/metrics (ns instant) — Bearer required - instant-worker podIP:8091/metrics (ns instant-infra) — open - instant-provisioner podIP:8092/metrics (ns instant-infra) — open Pod SD (not static Service-DNS) because the live topology rules out static targets: instant-worker has no Service; instant-provisioner's Service exposes only :50051 (gRPC), not the :8092 metrics sidecar. The METRICS_TOKEN bearer is sent to all three (required by api, ignored by worker/provisioner) so the config stays correct if they gain gating later. Secrets referenced by name only (never inlined): NEW_RELIC_LICENSE_KEY copied from instant-secrets/instant-infra-secrets, METRICS_TOKEN copied from the api Deployment's inline env value, into a `newrelic`-ns secret. Template ships with CHANGE_ME placeholders per repo convention. Chosen over nri-bundle (too heavy for the 6×2cpu/4GB pool) and self-hosted Prometheus (another stateful pod to back up). Validation: kubectl apply --dry-run=client passes (6 objects); kubeconform -strict -ignore-missing-schemas (CI validate.yml) — 6/6 valid; full-repo kubeconform — 121 resources, 0 invalid; promtool check config on the rendered scrape config (real prom v3.12.0) — SUCCESS. Docs: OBSERVABILITY-PIPELINE.md (apply + the post-apply NRQL verification gate + the high-value alerts that flip live), an APPLY-CHECKLIST.md entry, and a METRICS-CATALOG.md prerequisite note (the pipeline is a hard dependency of every Metric-based alert). No auto-apply — operator-gated (CLAUDE.md rule 15). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
Jun 11, 2026
…ploy/propagation/data-tier OOMKill) (#73) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The gap this closes
Prod (
do-nyc3-instant-prod) shipped logs + APM + OTLP traces to New Relic but had no metrics ingestion pipeline — nothing scraped the services'/metrics. Verified read-only this session:Every backend emits a rich
instant_*/provisioner_*Prometheus surface, but with no scraper theMetricevent type was empty in NR. So ~46 NR alert conditions of the formFROM Metric WHERE metricName LIKE 'instant_%'were INERT — querying a stream that didn't exist:What ships
The official newrelic-prometheus-agent (newrelic-prometheus-configurator
2.11.1initContainer + Prometheusv3.12.0 --agent) as one stateless 2-container pod in thenewrelicnamespace. It scrapes the three backend/metricsand remote-writesinstant_*series to NR's US Prometheus endpoint (metric-api.newrelic.com/prometheus/v1/write), lighting up everyFROM Metricalert at once.Chosen over the full nri-bundle (too heavy for the thin 6×2cpu/4GB pool) and over self-hosted Prometheus (another stateful pod to back up).
The 3 verified scrape targets (live, 2026-06-11)
/metricsauthinstant-apiinstant,app=instant-api8080/metricsMETRICS_TOKEN)instant-workerinstant-infra,app=instant-worker8091/metricsinstant-provisionerinstant-infra,app=instant-provisioner8092/metricsPod service discovery, not static Service-DNS (deviation from the brief, flagged for the operator): the live topology rules out static targets —
instant-workerhas no Service;instant-provisioner's Service exposes only :50051 (gRPC), not the :8092 metrics sidecar. Pod SD discovers all three byapplabel and scrapes podIP:port directly, surviving pod evictions.The
METRICS_TOKENbearer is sent to all three (required by api, harmlessly ignored by worker/provisioner's barepromhttphandlers) so the config stays correct if they gain gating later.Secret wiring (referenced by name, never inlined)
One secret in the
newrelicns,newrelic-prometheus-agent-secrets, with copies of values that already exist:NEW_RELIC_LICENSE_KEYinstant-secrets(nsinstant) +instant-infra-secrets(nsinstant-infra) — same INGEST key the logging DaemonSet + go-agents useMETRICS_TOKENvalue:env on all three Deployments (not a k8s secret key today) — same token across api/worker/provisionerThe manifest ships a Secret template with
CHANGE_ME(repo convention). The operator populates real values out-of-band; no secret value is committed.Validation (no real apply — operator-gated, CLAUDE.md rule 15)
kubectl apply --dry-run=client→ 6 objects "created (dry run)" ✓kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0(exact CIvalidate.yml) → 6/6 valid ✓promtool check configon the rendered scrape config (real Prometheus v3.12.0) → SUCCESS ✓ (validates the k8s pod SD + bearercredentials_file+__address__port-override relabels)Post-apply verification gate
Within ~5 min of the pod running, run in NR:
Non-zero → the 46
FROM Metricalerts go live. Highest-value ones that flip from inert to armed:razorpay-webhook-sig-fail,deploy-job-failed-detected,deploy-runtime-failed-detected,provisioner-circuit-open,pg-pool-saturation,email-delivery-ratio-low,auth-probe-fail,payment-probe-fail,redis-maxmemory-regrade-failed.Files
k8s/newrelic-prometheus-agent.yaml— SA + ClusterRole/Binding (least-priv pod SD read) + ConfigMap (configurator schema) + Deployment + Secret templateOBSERVABILITY-PIPELINE.md— apply + verify runbook, secret prerequisites, the 46-alert gatek8s/APPLY-CHECKLIST.md— apply entry (incl. the placeholder-Secret clobber hazard)observability/METRICS-CATALOG.md— prerequisite note: the pipeline is a hard dependency of every Metric-based alertRegion assumption (operator to confirm)
NR datacenter assumed US (logging →
log-api.newrelic.com, go-agents →otlp.nr-data.net, both US). The configurator auto-derives US vs EU from the license-key prefix, so no manifest change is needed if later confirmed EU.🤖 Generated with Claude Code