diff --git a/k8s/APPLY-CHECKLIST.md b/k8s/APPLY-CHECKLIST.md index d449782..83aa26b 100644 --- a/k8s/APPLY-CHECKLIST.md +++ b/k8s/APPLY-CHECKLIST.md @@ -254,6 +254,64 @@ doesn't work for wildcards). The runbook has the verification commands. --- +## Worker deploy-status RBAC — jobs/pods/events read grant (2026-06-11) + +`k8s/worker-rbac.yaml`'s `instant-worker-deploy-reader` ClusterRole was +extended to grant the worker SA (`instant-worker` in `instant-infra`) the +read-only k8s access its deploy-status reconciler + failure-autopsy path +actually need. Before this, the SA could only `get apps/deployments`, so +the rule-27 silent-build-failure detector logged +`jobs.deploy_status_reconcile.job_query_failed … cannot get resource +"jobs" in API group "batch"` every ~30s and that whole detection path was +**disabled in prod**. + +The added verbs map 1:1 to live k8s calls (no over-grant — read-only, no +create/delete/patch/watch): + +| apiGroup / resource | verb | worker call (file:line) | +|---|---|---| +| `batch` / `jobs` | `get` | `deploy_status_reconcile.go:256` `BatchV1().Jobs(ns).Get` | +| `""` / `pods` | `list` | `deploy_failure_autopsy.go:208` `CoreV1().Pods(ns).List` | +| `""` / `pods/log` | `get` | `deploy_failure_autopsy.go:220,230` `CoreV1().Pods(ns).GetLogs` | +| `""` / `events` | `list` | `deploy_failure_autopsy.go:215` `CoreV1().Events(ns).List` | + +(`apps/deployments get` was already granted and is unchanged.) + +This is an RBAC-only change — no Deployment, no secrets, no image tag. It +is safe to `kubectl apply` directly (none of the app.yaml clobber hazards +above apply to a ClusterRole/ClusterRoleBinding/ServiceAccount file). + +```bash +# 1. Confirm context +kubectl config current-context # expect do-nyc3-instant-prod + +# 2. Dry-run server-side and read the diff (RBAC verbs only should change) +kubectl apply --dry-run=server -f k8s/worker-rbac.yaml + +# 3. Apply +kubectl apply -f k8s/worker-rbac.yaml + +# 4. Verify the SA can now reach jobs/pods/events in a deploy namespace. +# (Use any live instant-deploy-* ns; RBAC is cluster-scoped so the +# namespace just has to exist. All four must print "yes".) +NS=$(kubectl get ns -o name | grep -m1 'instant-deploy-' | cut -d/ -f2) +kubectl auth can-i get jobs.batch --as=system:serviceaccount:instant-infra:instant-worker -n "$NS" +kubectl auth can-i list pods --as=system:serviceaccount:instant-infra:instant-worker -n "$NS" +kubectl auth can-i get pods/log --as=system:serviceaccount:instant-infra:instant-worker -n "$NS" +kubectl auth can-i list events --as=system:serviceaccount:instant-infra:instant-worker -n "$NS" + +# 5. Confirm the error stops within one reconcile tick (~30s). This should +# print nothing after the apply: +kubectl logs -n instant-infra deploy/instant-worker --since=2m \ + | grep -i 'job_query_failed' || echo "ok — no job_query_failed in last 2m" +``` + +Rollback: `git revert` the merge commit and re-apply — the verbs are +purely additive, so reverting only narrows the grant back to +`apps/deployments get` (re-disables rule-27 detection, no other effect). + +--- + ## Related files - `README.md` — secrets clobber warning (the same class of bug, but for diff --git a/k8s/worker-rbac.yaml b/k8s/worker-rbac.yaml index cf88e82..5018f8c 100644 --- a/k8s/worker-rbac.yaml +++ b/k8s/worker-rbac.yaml @@ -47,15 +47,46 @@ subjects: namespace: instant-infra --- -# ClusterRole granting the worker permission to read Deployment objects across -# all "instant-deploy-*" namespaces. Used by DeployStatusReconciler (see -# worker/internal/jobs/deploy_status_reconcile.go) to reconcile the -# deployments.status DB column from live k8s state every 30s. +# ClusterRole granting the worker the read-only k8s access needed to reconcile +# deployment status AND capture failure autopsies across all "instant-deploy-*" +# (and "instant-stack-*") namespaces. Used by DeployStatusReconciler + +# the deploy-failure autopsy client, both wired into the same River worker via +# WithAutopsyK8s (see worker/internal/jobs/deploy_status_reconcile.go and +# worker/internal/jobs/deploy_failure_autopsy.go). # # This is intentionally cluster-scoped because per-app namespaces are created # on demand by the api process — we can't enumerate them at RBAC-bind time. -# Read-only get on deployments.apps is the tightest verb set that makes the -# reconciler work; the worker has no need to list/watch/patch and gets none. +# +# Every verb below maps to exactly one live k8s call; nothing is over-granted +# (no create/delete/patch/watch — the worker only reads): +# +# apps/deployments get ← deploy_status_reconcile.go:251 +# AppsV1().Deployments(ns).Get(...) — roll the +# deployments.status DB column forward every 30s. +# +# batch/jobs get ← deploy_status_reconcile.go:256 +# BatchV1().Jobs(ns).Get(...) — detect a Failed +# Kaniko build Job (BackoffLimitExceeded / +# DeadlineExceeded) so the row flips to `failed` +# even when the build pod is GC'd before the +# runtime Deployment exists. THIS is the grant +# whose absence disabled CLAUDE.md rule-27 +# silent-build-failure detection in prod +# (worker logged job_query_failed every tick). +# +# pods list ← deploy_failure_autopsy.go:208 +# CoreV1().Pods(ns).List(...) — find the failed +# build/runtime pods by label selector. +# +# pods/log get ← deploy_failure_autopsy.go:220,230 +# CoreV1().Pods(ns).GetLogs(...) — capture the +# last_lines of build-pod logs for the autopsy +# row + the deploy.failed audit_log / email. +# +# events list ← deploy_failure_autopsy.go:215 +# CoreV1().Events(ns).List(...) — read k8s +# Events (ImagePullBackOff, FailedScheduling, …) +# to enrich the autopsy reason/hint. apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: @@ -64,6 +95,18 @@ rules: - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get"] + - apiGroups: ["batch"] + resources: ["jobs"] + verbs: ["get"] + - apiGroups: [""] + resources: ["pods"] + verbs: ["list"] + - apiGroups: [""] + resources: ["pods/log"] + verbs: ["get"] + - apiGroups: [""] + resources: ["events"] + verbs: ["list"] --- # Bind the deploy-reader ClusterRole to the instant-worker ServiceAccount. diff --git a/newrelic/tests/apply.test.sh b/newrelic/tests/apply.test.sh index fc30cb2..6ed08ee 100755 --- a/newrelic/tests/apply.test.sh +++ b/newrelic/tests/apply.test.sh @@ -93,7 +93,7 @@ assert_eq "exit code is 1" "1" "$code" assert_contains "stderr mentions NEW_RELIC_ACCOUNT_ID" "NEW_RELIC_ACCOUNT_ID" "$out" # ---------------------------------------------------------------------------- -echo "test: --dry-run with both env vars set prints all 33 names, no API calls" +echo "test: --dry-run with both env vars set prints all 98 names, no API calls" # ---------------------------------------------------------------------------- out=$(env -i PATH="$PATH" NEW_RELIC_API_KEY=fake NEW_RELIC_ACCOUNT_ID=1234567 \ "$APPLY" --dry-run 2>&1) @@ -145,10 +145,12 @@ assert_contains "dry-run prints decrypt-burst alert" "connection_url.decrypte assert_contains "dry-run prints deploy-by-team alert" "deploy failure rate > 30% (1h) faceted" "$out" assert_contains "dry-run prints backup-stuck alert" "backup.requested with no follow-up" "$out" -# No real HTTP traffic — the [dry-run] tag must appear on every name -# 12 dashboards + 21 alerts = 33 total after W10 follow-up. +# No real HTTP traffic — the [dry-run] tag must appear on every name. +# Count is derived from the JSON on disk (15 dashboards + 83 alerts = 98), +# not a hand-typed list, so it stays correct as artifacts are added/removed. +expected_count=$(ls "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json 2>/dev/null | wc -l | tr -d ' ') dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true) -assert_eq "every name prefixed with [dry-run] (33 total)" "33" "$dryrun_count" +assert_eq "every name prefixed with [dry-run] (one per JSON file)" "$expected_count" "$dryrun_count" # ---------------------------------------------------------------------------- echo "test: --dry-run without env vars still validates JSON + warns" @@ -158,24 +160,22 @@ code=$? assert_eq "exit code is 0 with no env (dry-run)" "0" "$code" assert_contains "warns about unset env" "warning" "$out" dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true) -assert_eq "still prints 33 [dry-run] entries" "33" "$dryrun_count" +assert_eq "still prints one [dry-run] entry per JSON file" "$expected_count" "$dryrun_count" # ---------------------------------------------------------------------------- echo "test: every JSON file in dashboards/ and alerts/ parses cleanly" # ---------------------------------------------------------------------------- broken=0 +total_json=0 for f in "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json; do + total_json=$((total_json+1)) if ! jq empty "$f" >/dev/null 2>&1; then echo " FAIL invalid JSON: $f" broken=$((broken+1)) fi done if [ "$broken" -eq 0 ]; then -<<<<<<< HEAD - echo " ok all 26 JSON files parse" -======= - echo " ok all 16 JSON files parse" ->>>>>>> b41339a (newrelic: dashboards + alerts for W7/W8/W9 audit kinds (W10 follow-up)) + echo " ok all $total_json JSON files parse" PASS=$((PASS+1)) else FAIL=$((FAIL+1))