feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline by mastermanas805 · Pull Request #72 · InstaNode-dev/infra

mastermanas805 · 2026-06-11T04:52:09Z

The gap this closes

Prod (do-nyc3-instant-prod) shipped logs + APM + OTLP traces to New Relic but had no metrics ingestion pipeline — nothing scraped the services' /metrics. Verified read-only this session:

kubectl get pods -A | grep -iE 'prometheus|nri|otel-collector'   # → empty

Every backend emits a rich instant_* / provisioner_* Prometheus surface, but with no scraper the Metric event type was empty in NR. So ~46 NR alert conditions of the form FROM Metric WHERE metricName LIKE 'instant_%' were INERT — querying a stream that didn't exist:

grep -rl "FROM Metric" newrelic/alerts/ | wc -l   # → 46

What ships

The official newrelic-prometheus-agent (newrelic-prometheus-configurator 2.11.1 initContainer + Prometheus v3.12.0 --agent) as one stateless 2-container pod in the newrelic namespace. It scrapes the three backend /metrics and remote-writes instant_* series to NR's US Prometheus endpoint (metric-api.newrelic.com/prometheus/v1/write), lighting up every FROM Metric alert at once.

Chosen over the full nri-bundle (too heavy for the thin 6×2cpu/4GB pool) and over self-hosted Prometheus (another stateful pod to back up).

The 3 verified scrape targets (live, 2026-06-11)

Service	Reached via	Port / path	`/metrics` auth
`instant-api`	pod SD, ns `instant`, `app=instant-api`	`8080` `/metrics`	Bearer required (401 without `METRICS_TOKEN`)
`instant-worker`	pod SD, ns `instant-infra`, `app=instant-worker`	`8091` `/metrics`	open (200)
`instant-provisioner`	pod SD, ns `instant-infra`, `app=instant-provisioner`	`8092` `/metrics`	open (200)

Pod service discovery, not static Service-DNS (deviation from the brief, flagged for the operator): the live topology rules out static targets — instant-worker has no Service; instant-provisioner's Service exposes only :50051 (gRPC), not the :8092 metrics sidecar. Pod SD discovers all three by app label and scrapes podIP:port directly, surviving pod evictions.

The METRICS_TOKEN bearer is sent to all three (required by api, harmlessly ignored by worker/provisioner's bare promhttp handlers) so the config stays correct if they gain gating later.

Secret wiring (referenced by name, never inlined)

One secret in the newrelic ns, newrelic-prometheus-agent-secrets, with copies of values that already exist:

Key	Live source
`NEW_RELIC_LICENSE_KEY`	`instant-secrets` (ns `instant`) + `instant-infra-secrets` (ns `instant-infra`) — same INGEST key the logging DaemonSet + go-agents use
`METRICS_TOKEN`	an inline `value:` env on all three Deployments (not a k8s secret key today) — same token across api/worker/provisioner

The manifest ships a Secret template with CHANGE_ME (repo convention). The operator populates real values out-of-band; no secret value is committed.

Validation (no real apply — operator-gated, CLAUDE.md rule 15)

kubectl apply --dry-run=client → 6 objects "created (dry run)" ✓
kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0 (exact CI validate.yml) → 6/6 valid ✓
Full-repo kubeconform → 121 resources, 0 invalid, 0 errors ✓
promtool check config on the rendered scrape config (real Prometheus v3.12.0) → SUCCESS ✓ (validates the k8s pod SD + bearer credentials_file + __address__ port-override relabels)

Post-apply verification gate

Within ~5 min of the pod running, run in NR:

FROM Metric SELECT count(*) WHERE metricName LIKE 'instant_%' SINCE 10 minutes ago

Non-zero → the 46 FROM Metric alerts go live. Highest-value ones that flip from inert to armed: razorpay-webhook-sig-fail, deploy-job-failed-detected, deploy-runtime-failed-detected, provisioner-circuit-open, pg-pool-saturation, email-delivery-ratio-low, auth-probe-fail, payment-probe-fail, redis-maxmemory-regrade-failed.

Accuracy correction: backup-stale-36h and customer-backup-failed are FROM Log alerts (already firing), not metric alerts — the brief grouped backup-stale with the metric set; it reads the backup CronJob's log line. Noted in the docs.

Files

k8s/newrelic-prometheus-agent.yaml — SA + ClusterRole/Binding (least-priv pod SD read) + ConfigMap (configurator schema) + Deployment + Secret template
OBSERVABILITY-PIPELINE.md — apply + verify runbook, secret prerequisites, the 46-alert gate
k8s/APPLY-CHECKLIST.md — apply entry (incl. the placeholder-Secret clobber hazard)
observability/METRICS-CATALOG.md — prerequisite note: the pipeline is a hard dependency of every Metric-based alert

Region assumption (operator to confirm)

NR datacenter assumed US (logging → log-api.newrelic.com, go-agents → otlp.nr-data.net, both US). The configurator auto-derives US vs EU from the license-key prefix, so no manifest change is needed if later confirmed EU.

🤖 Generated with Claude Code

…pipeline Prod (do-nyc3-instant-prod) shipped logs + APM + OTLP traces but had NO metrics ingestion pipeline — no Prometheus scraper pulled the services' /metrics. As a result ~46 New Relic alert conditions of the form `FROM Metric WHERE metricName LIKE 'instant_%'` were INERT: they queried an empty Metric stream. razorpay-webhook-sig-fail, the deploy-silent-failure triad, provisioner-circuit-open, pg-pool-saturation, email-delivery-ratio-low, auth/payment probes — all dark. Ship the official newrelic-prometheus-agent (configurator initContainer + Prometheus --agent) as one stateless 2-container pod in the `newrelic` ns. It scrapes the three backend /metrics via Kubernetes pod SD and remote-writes the instant_* / provisioner_* series to NR's US Prometheus endpoint, lighting up every FROM Metric alert at once. Verified-live scrape targets (2026-06-11): - instant-api podIP:8080/metrics (ns instant) — Bearer required - instant-worker podIP:8091/metrics (ns instant-infra) — open - instant-provisioner podIP:8092/metrics (ns instant-infra) — open Pod SD (not static Service-DNS) because the live topology rules out static targets: instant-worker has no Service; instant-provisioner's Service exposes only :50051 (gRPC), not the :8092 metrics sidecar. The METRICS_TOKEN bearer is sent to all three (required by api, ignored by worker/provisioner) so the config stays correct if they gain gating later. Secrets referenced by name only (never inlined): NEW_RELIC_LICENSE_KEY copied from instant-secrets/instant-infra-secrets, METRICS_TOKEN copied from the api Deployment's inline env value, into a `newrelic`-ns secret. Template ships with CHANGE_ME placeholders per repo convention. Chosen over nri-bundle (too heavy for the 6×2cpu/4GB pool) and self-hosted Prometheus (another stateful pod to back up). Validation: kubectl apply --dry-run=client passes (6 objects); kubeconform -strict -ignore-missing-schemas (CI validate.yml) — 6/6 valid; full-repo kubeconform — 121 resources, 0 invalid; promtool check config on the rendered scrape config (real prom v3.12.0) — SUCCESS. Docs: OBSERVABILITY-PIPELINE.md (apply + the post-apply NRQL verification gate + the high-value alerts that flip live), an APPLY-CHECKLIST.md entry, and a METRICS-CATALOG.md prerequisite note (the pipeline is a hard dependency of every Metric-based alert). No auto-apply — operator-gated (CLAUDE.md rule 15). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ploy/propagation/data-tier OOMKill) (#73) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

mastermanas805 enabled auto-merge (squash) June 11, 2026 04:52

mastermanas805 merged commit 1d73c34 into master Jun 11, 2026
3 checks passed

mastermanas805 mentioned this pull request Jun 11, 2026

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill) #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline#72

feat(observability): add newrelic-prometheus-agent metrics ingestion pipeline#72
mastermanas805 merged 1 commit into
masterfrom
obs/newrelic-prometheus-agent-pipeline

mastermanas805 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 11, 2026

The gap this closes

What ships

The 3 verified scrape targets (live, 2026-06-11)

Secret wiring (referenced by name, never inlined)

Validation (no real apply — operator-gated, CLAUDE.md rule 15)

Post-apply verification gate

Files

Region assumption (operator to confirm)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant