InstaNode-dev · mastermanas805 · Jun 11, 2026 · Jun 11, 2026
diff --git a/OBSERVABILITY-PIPELINE.md b/OBSERVABILITY-PIPELINE.md
@@ -0,0 +1,264 @@
+# Observability Pipeline — New Relic Prometheus Agent
+
+> **Codified 2026-06-11.** Closes the single biggest observability gap on the
+> platform: there was **no metrics ingestion pipeline** in
+> `do-nyc3-instant-prod`. Until `k8s/newrelic-prometheus-agent.yaml` is
+> applied, **~46 New Relic alert conditions are INERT.**
+
+---
+
+## The gap
+
+The prod cluster shipped three telemetry streams to New Relic:
+
+| Stream | How | Status |
+|---|---|---|
+| **Logs** | `newrelic-logging` Fluent Bit DaemonSet (ns `newrelic`) → `log-api.newrelic.com` | working |
+| **APM** | Go `newrelic` agent in each service | working |
+| **Traces** | OTLP exporter → `otlp.nr-data.net:4317` | working |
+| **Metrics** | *(nothing)* | **MISSING** |
+
+There was **no Prometheus scraper** pulling the services' `/metrics`. Verify:
+
+```bash
+kubectl get pods -A | grep -iE 'prometheus|nri|otel-collector'   # → empty
+```
+
+Every service emits a rich `instant_*` / `provisioner_*` Prometheus surface
+(circuit-breaker state, conversion funnel, deploy autopsy counters, pg-pool
+gauges, billing-reconciler counters, …) on `/metrics`, but nothing scraped it.
+So the entire `Metric` event type was empty in NR, and **every alert of the
+form `FROM Metric WHERE metricName LIKE 'instant_%'` queried a stream that did
+not exist.** 46 alert files under `newrelic/alerts/` are `FROM Metric`:
+
+```bash
+grep -rl "FROM Metric" newrelic/alerts/ | wc -l   # → 46
+```
+
+Log-based alerts (`FROM Log`) were unaffected — notably `backup-stale-36h` and
+`customer-backup-failed` are **log-based** and already fired. (The brief
+listed backup-stale as a metric alert; it is not — it reads the backup
+CronJob's log line. The genuinely-inert high-value alerts are listed below.)
+
+---
+
+## What this ships
+
+`k8s/newrelic-prometheus-agent.yaml` — the **official newrelic-prometheus-agent**
+(newrelic-prometheus-configurator + Prometheus in `--agent` mode) as one
+stateless 2-container pod in the `newrelic` namespace:
+
+1. **initContainer `configurator`** (`newrelic/newrelic-prometheus-configurator:2.11.1`)
+   — reads the ConfigMap's configurator-schema `config.yaml`, injects the NR
+   license key from env, and renders a standard Prometheus scrape config into a
+   shared `emptyDir`.
+2. **container `prometheus`** (`quay.io/prometheus/prometheus:v3.12.0 --agent`)
+   — reads the rendered config, keeps a 30-min WAL only (no PVC), and
+   remote-writes to `https://metric-api.newrelic.com/prometheus/v1/write`
+   (**US datacenter** — auto-derived from the license-key region; the existing
+   logging + go-agents target US: `log-api.newrelic.com`).
+
+**Why this shape** (not nri-bundle, not self-hosted Prometheus): the prod pool
+is thin (6×2cpu/4GB). We need exactly the three app `/metrics`, not full
+cluster-state monitoring. Agent mode forwards everything to NR — where the
+dashboards + alerts already live — so there is no stateful pod to back up.
+
+### The three scrape targets (verified live 2026-06-11)
+
+| Service | Reached via | Port | `/metrics` auth | Sample series |
+|---|---|---|---|---|
+| `instant-api` | pod SD, ns `instant`, `app=instant-api` | **8080** | **Bearer required** (401 without `METRICS_TOKEN`) | `instant_circuit_breaker_*`, `instant_conversion_funnel_*` |
+| `instant-worker` | pod SD, ns `instant-infra`, `app=instant-worker` | **8091** | open (200) | `instant_auth_probe_*`, `instant_billing_reconciler_*` |
+| `instant-provisioner` | pod SD, ns `instant-infra`, `app=instant-provisioner` | **8092** | open (200) | `instant_provisioner_circuit_*`, `instant_provisioner_drop_*` |
+
+> **Why Kubernetes pod service discovery, not static Service-DNS targets:**
+> the live topology rules out static targets. `instant-worker` has **no
+> Service** in prod (it's a background-job Deployment) — `/metrics` on :8091 is
+> reachable only by pod IP. `instant-provisioner`'s Service exposes **only
+> :50051** (gRPC); the :8092 metrics sidecar is not in the Service. Pod SD
+> (role: pod) discovers all three by `app` label and scrapes podIP:port
+> directly, surviving pod evictions/reschedules. This is the only deviation
+> from the original brief's "static targets" assumption — flagged here for the
+> operator.
+
+> **Auth wiring:** the `METRICS_TOKEN` bearer is sent to **all three** targets
+> (mounted from the secret as a file, referenced via Prometheus
+> `credentials_file`). api **requires** it (401 otherwise); worker +
+> provisioner serve a bare `promhttp` handler that **ignores** the header — so
+> the config stays correct if they gain gating later. The token is a single
+> shared value across all three services today.
+
+---
+
+## Secret prerequisites
+
+The pod references **one secret** in the `newrelic` namespace,
+`newrelic-prometheus-agent-secrets`, with two keys. The agent will not start
+without both. **The secret holds copies of live values that already exist
+elsewhere** — it does NOT introduce new credentials:
+
+| Key | Source of truth (live) | Why a copy is needed |
+|---|---|---|
+| `NEW_RELIC_LICENSE_KEY` | `instant-secrets` (ns `instant`) **and** `instant-infra-secrets` (ns `instant-infra`) — same INGEST key the logging DaemonSet + go-agents use | k8s secrets are namespace-scoped; the agent runs in `newrelic`, so it needs a copy there |
+| `METRICS_TOKEN` | an **inline `value:` env** on all three Deployments (NOT a k8s secret key today) — same token on api/worker/provisioner | the scraper needs the bearer to read api `/metrics` |
+
+> The repo `k8s/newrelic-prometheus-agent.yaml` ships a **Secret template with
+> `CHANGE_ME` placeholders** (repo convention). The operator populates the real
+> values out-of-band (commands below) — no real secret value is ever committed.
+
+---
+
+## Operator apply + verify runbook
+
+This repo has **no auto-apply** (CLAUDE.md rule 15). Apply is a deliberate,
+human-driven step. Run against the prod context.
+
+### 0. Confirm context + the gap
+
+```bash
+kubectl config current-context                          # → do-nyc3-instant-prod
+kubectl get pods -A | grep -iE 'prometheus|nri|otel-collector'   # → empty (gap present)
+```
+
+### 1. Create the agent secret (real values, never committed)
+
+```bash
+# NR license key — copy from the existing instant-secrets:
+NR_LICENSE=$(kubectl get secret instant-secrets -n instant \
+  -o jsonpath='{.data.NEW_RELIC_LICENSE_KEY}' | base64 -d)
+
+# METRICS_TOKEN — copy the inline value off the api Deployment:
+METRICS_TOKEN=$(kubectl get deploy instant-api -n instant \
+  -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="METRICS_TOKEN")].value}')
+
+# Sanity-check both are non-empty before creating the secret:
+[ -n "$NR_LICENSE" ] && [ -n "$METRICS_TOKEN" ] && echo "both present" || echo "MISSING — STOP"
+
+kubectl create secret generic newrelic-prometheus-agent-secrets \
+  -n newrelic \
+  --from-literal=NEW_RELIC_LICENSE_KEY="$NR_LICENSE" \
+  --from-literal=METRICS_TOKEN="$METRICS_TOKEN" \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+
+> Do **not** `kubectl apply` the `Secret` document inside
+> `k8s/newrelic-prometheus-agent.yaml` directly — it carries `CHANGE_ME`
+> placeholders and would clobber the real values. Use the `create secret`
+> command above; it is the source of truth for the live secret. (When applying
+> the full manifest in step 3, the placeholder Secret is the only object you
+> skip — see the note there.)
+
+### 2. Dry-run the manifest (must pass — same checks as CI `validate.yml`)
+
+```bash
+kubectl apply --dry-run=client -f k8s/newrelic-prometheus-agent.yaml
+# → serviceaccount / clusterrole / clusterrolebinding / configmap /
+#   deployment / secret  ... all "created (dry run)"
+
+# Optional, exactly what CI runs:
+kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0 \
+  k8s/newrelic-prometheus-agent.yaml
+```
+
+### 3. Apply (everything EXCEPT the placeholder Secret)
+
+The real secret was created in step 1. Apply the rest:
+
+```bash
+# Apply the SA, RBAC, ConfigMap, and Deployment. The placeholder Secret in the
+# file is the only object to leave out — step 1 owns the live secret.
+kubectl apply -f k8s/newrelic-prometheus-agent.yaml \
+  --prune=false 2>/dev/null
+
+# If you prefer to never even submit the CHANGE_ME Secret, split it out:
+#   yq 'select(.kind != "Secret")' k8s/newrelic-prometheus-agent.yaml | kubectl apply -f -
+```
+
+> The placeholder Secret has `stringData` with `CHANGE_ME`. If applied AFTER
+> step 1 it WOULD overwrite the real key with `CHANGE_ME`. Either skip it (the
+> `yq` split above) or re-run step 1 immediately after. The pod will
+> CrashLoop / fail the configurator (`ErrNoLicenseKeyFound`) if the secret
+> holds `CHANGE_ME`.
+
+### 4. Confirm the pod is up + scraping
+
+```bash
+kubectl rollout status deploy/newrelic-prometheus-agent -n newrelic --timeout=120s
+kubectl get pods -n newrelic -l app=newrelic-prometheus-agent
+
+# Confirm all three targets are UP (port-forward the agent and read its
+# Prometheus targets API):
+kubectl port-forward -n newrelic deploy/newrelic-prometheus-agent 19090:9090 &
+sleep 3
+curl -s http://localhost:19090/api/v1/targets \
+  | python3 -c "import sys,json; [print(t['labels'].get('job'), t['health']) \
+      for t in json.load(sys.stdin)['data']['activeTargets']]"
+# Expect: instant-api up / instant-worker up / instant-provisioner up
+kill %1
+```
+
+### 5. **Verification gate — the inert alerts go live**
+
+Within **~5 minutes** of the pod running, `instant_*` series appear in New
+Relic. Run in the NR query builder (account = the prod ingest account):
+
+```sql
+FROM Metric SELECT count(*) WHERE metricName LIKE 'instant_%' SINCE 10 minutes ago
+```
+
+A non-zero result is the proof. Spot-check a high-value one:
+
+```sql
+FROM Metric SELECT latest(instant_provisioner_circuit_state)
+  WHERE metricName = 'instant_provisioner_circuit_state' FACET backend SINCE 10 minutes ago
+```
+
+Once `Metric` is populated, **all 46 `FROM Metric` alert conditions become
+live** (they were querying an empty stream before). The highest-value ones
+that flip from inert to armed:
+
+| Alert (`newrelic/alerts/…`) | Catches |
+|---|---|
+| `razorpay-webhook-sig-fail.json` | forged billing webhook / `RAZORPAY_WEBHOOK_SECRET` mismatch |
+| `deploy-job-failed-detected.json` | silent build-Job failure (rule 27 triad, Bug A) |
+| `deploy-runtime-failed-detected.json` | silent runtime start-failure (twin of rule 27) |
+| `provisioner-circuit-open.json` | a provisioning backend down, fail-fast active |
+| `pg-pool-saturation.json` | Postgres connection brownout (pool > 80%) |
+| `email-delivery-ratio-low.json` | Brevo delivery ratio < 95% (the rule-12 truth surface) |
+| `auth-probe-fail.json` | synthetic login loop broken in prod (AUTH-004) |
+| `payment-probe-fail.json` | Layer-3 money heartbeat — paid revenue path broken |
+| `redis-maxmemory-regrade-failed.json` | quota not enforced on regrade |
+
+> **Note on backup durability alerting:** `backup-stale-36h.json` and
+> `customer-backup-failed.json` are **`FROM Log`** alerts and already fired
+> before this pipeline — they read the backup CronJob's log lines, not a
+> metric. This pipeline does not change them. (Corrected from the original
+> brief, which grouped backup-stale with the metric alerts.)
+
+---
+
+## Rollback
+
+```bash
+kubectl delete -f k8s/newrelic-prometheus-agent.yaml --ignore-not-found
+kubectl delete secret newrelic-prometheus-agent-secrets -n newrelic --ignore-not-found
+```
+
+Removing the agent stops metric ingestion; the `FROM Metric` alerts go inert
+again (no data → no violation). No other telemetry stream is affected.
+
+---
+
+## Open items for operator confirmation
+
+- **NR datacenter region:** assumed **US** (the logging DaemonSet targets
+  `log-api.newrelic.com` and the go-agents target `otlp.nr-data.net` — both
+  US). The configurator auto-selects US vs EU from the license-key prefix, so
+  the US `metric-api.newrelic.com` endpoint is correct for our key. If the
+  account is later confirmed EU, no manifest change is needed (the configurator
+  re-derives the endpoint from the key).
+- **provisioner :8092 not a declared containerPort.** The live provisioner
+  Deployment listens on :8092 inside the pod but does not declare it as a
+  `containerPort`. Pod-SD scraping podIP:8092 works regardless (verified live).
+  Declaring it (and a metrics Service) is a nice-to-have cleanup in
+  `k8s/provisioner/deployment.yaml`, not a prerequisite for this pipeline.
diff --git a/k8s/APPLY-CHECKLIST.md b/k8s/APPLY-CHECKLIST.md
@@ -312,6 +312,60 @@ purely additive, so reverting only narrows the grant back to
 
 ---
 
+## New Relic Prometheus agent — metrics ingestion pipeline (2026-06-11)
+
+`k8s/newrelic-prometheus-agent.yaml` adds the **only metrics scraper** in the
+cluster. Before it, prod shipped logs + APM + OTLP traces but **no metrics**,
+so ~46 `FROM Metric WHERE metricName LIKE 'instant_%'` NR alerts were inert
+(querying an empty `Metric` stream). This manifest deploys the official
+newrelic-prometheus-agent (configurator initContainer + Prometheus `--agent`)
+in the `newrelic` namespace, scraping the three services' `/metrics` by pod
+SD and remote-writing to NR's US Prometheus endpoint.
+
+**This manifest is additive + safe to apply** — it creates net-new objects
+(SA, ClusterRole/Binding, ConfigMap, Deployment) in the `newrelic` namespace
+and touches **nothing** the api/worker/provisioner Deployments own. It does
+NOT have the `app.yaml` clobber hazards (no `:local` image, no
+`imagePullSecrets` strip).
+
+**One hazard — the placeholder Secret.** The file ships a Secret template
+(`newrelic-prometheus-agent-secrets`) with `CHANGE_ME` for
+`NEW_RELIC_LICENSE_KEY` + `METRICS_TOKEN`. Applying that document AFTER you've
+created the real secret would clobber it with `CHANGE_ME` and CrashLoop the
+agent (`ErrNoLicenseKeyFound`). Create the real secret first (copying the live
+NR license key from `instant-secrets` and the `METRICS_TOKEN` inline value off
+the api Deployment), then apply everything EXCEPT the Secret. This is the same
+`CHANGE_ME`-clobber class as `secrets.yaml` — use the same guardrail discipline
+(`scripts/safe-secret-apply.sh`).
+
+Full apply + the post-apply verification gate (the NRQL that proves
+`instant_*` series landed and the 46 alerts flipped live), plus the list of
+high-value alerts that go armed: **`infra/OBSERVABILITY-PIPELINE.md`**.
+
+Quick apply (real secret first, then the rest):
+
+```bash
+# real secret — values copied live, never committed:
+kubectl create secret generic newrelic-prometheus-agent-secrets -n newrelic \
+  --from-literal=NEW_RELIC_LICENSE_KEY="$(kubectl get secret instant-secrets -n instant -o jsonpath='{.data.NEW_RELIC_LICENSE_KEY}' | base64 -d)" \
+  --from-literal=METRICS_TOKEN="$(kubectl get deploy instant-api -n instant -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="METRICS_TOKEN")].value}')" \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# everything else (skip the placeholder Secret in the file):
+yq 'select(.kind != "Secret")' k8s/newrelic-prometheus-agent.yaml | kubectl apply -f -
+
+kubectl rollout status deploy/newrelic-prometheus-agent -n newrelic --timeout=120s
+```
+
+Verify (NR query builder): `FROM Metric SELECT count(*) WHERE metricName LIKE
+'instant_%' SINCE 10 minutes ago` returns non-zero within ~5 min → the 46
+`FROM Metric` alerts are live.
+
+Rollback: `kubectl delete -f k8s/newrelic-prometheus-agent.yaml
+--ignore-not-found` + delete the secret. No other telemetry stream affected.
+
+---
+
 ## Related files
 
 - `README.md` — secrets clobber warning (the same class of bug, but for
@@ -322,3 +376,7 @@ purely additive, so reverting only narrows the grant back to
   referenced by the `instanode.dev/image-pinned` labels
 - `apply-all.sh` — the bootstrap script (intended for fresh clusters,
   NOT for in-place prod updates)
+- `../OBSERVABILITY-PIPELINE.md` — the New Relic Prometheus agent apply +
+  verify runbook (metrics ingestion pipeline; the 46-alert gate)
+- `../observability/METRICS-CATALOG.md` — the metric catalog; the pipeline
+  above is its hard prerequisite (every `FROM Metric` alert depends on it)