From 7e39c4868460f4451a0d4651e7f026f8020b9849 Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 23:34:14 +0000 Subject: [PATCH] =?UTF-8?q?[observability]=20PrometheusOperatorRejectedRes?= =?UTF-8?q?ources=20=E2=80=94=20ServiceMonitor=20Rejected=20for=20Using=20?= =?UTF-8?q?bearerTokenFile?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...itor_Rejected_for_Using_bearerTokenFile.md | 127 ++++++++++++++++++ 1 file changed, 127 insertions(+) create mode 100644 docs/en/solutions/PrometheusOperatorRejectedResources_ServiceMonitor_Rejected_for_Using_bearerTokenFile.md diff --git a/docs/en/solutions/PrometheusOperatorRejectedResources_ServiceMonitor_Rejected_for_Using_bearerTokenFile.md b/docs/en/solutions/PrometheusOperatorRejectedResources_ServiceMonitor_Rejected_for_Using_bearerTokenFile.md new file mode 100644 index 00000000..2d16c561 --- /dev/null +++ b/docs/en/solutions/PrometheusOperatorRejectedResources_ServiceMonitor_Rejected_for_Using_bearerTokenFile.md @@ -0,0 +1,127 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +On a cluster that ships the Cert Utils Operator (or any third-party operator that provides its own metrics scrape definition), the `PrometheusOperatorRejectedResources` alert begins firing after the operator is installed. The Prometheus Operator pod logs entries similar to: + +```text +level=warn caller=resource_selector.go:174 component=prometheusoperator + msg="skipping servicemonitor" + error="it accesses file system via bearer token file which Prometheus specification prohibits" + servicemonitor=cert-utils-operator/cert-utils-operator-controller-manager-metrics-monitor + namespace= prometheus=user-workload +``` + +The `ServiceMonitor` is never admitted into the generated scrape config, no targets are created for that operator, and its metrics are therefore missing from the user-workload Prometheus. + +## Root Cause + +In ACP `observability/monitor`, the user-workload Prometheus is configured with `arbitraryFSAccessThroughSMs.deny: true` on the `Prometheus` custom resource. That flag is a deliberate security boundary: any `ServiceMonitor` that uses a `bearerTokenFile` or a `tlsConfig.caFile` pointing at the host filesystem is rejected during rendering, because accepting it would let a tenant-owned `ServiceMonitor` arrive with a path like `/var/run/secrets/kubernetes.io/serviceaccount/token` and smuggle that file into the Prometheus scrape config. + +The Cert Utils Operator (and several other third-party operators) ship a `ServiceMonitor` that predates this tightening and still authenticates via `bearerTokenFile`: + +```yaml +# rejected by the user-workload Prometheus +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: cert-utils-operator-controller-manager-metrics-monitor +spec: + endpoints: + - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token +``` + +The modern, accepted alternative is `authorization.credentials` backed by a `Secret` reference (`bearerTokenSecret`), which the operator pulls at render time through the Kubernetes API rather than reading a file from disk. + +## Resolution + +Fix the upstream `ServiceMonitor` to use a Secret reference, or — while waiting for that fix to ship — silence the alert and work around the rejection. + +### 1. Correct fix: migrate the ServiceMonitor to authorization.credentials + +Replace the `bearerTokenFile` endpoint with a reference to a `Secret` that holds the token. The service account token secret is the natural source: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: cert-utils-operator-controller-manager-metrics-monitor + namespace: cert-utils-operator +spec: + endpoints: + - port: https + scheme: https + authorization: + credentials: + name: cert-utils-operator-metrics-token + key: token + tlsConfig: + insecureSkipVerify: true # or supply a CA via ca.secret +``` + +Create the backing Secret (a service-account-token type Secret linked to the operator's metrics-reader service account): + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: cert-utils-operator-metrics-token + namespace: cert-utils-operator + annotations: + kubernetes.io/service-account.name: cert-utils-operator-controller-manager-metrics-reader +type: kubernetes.io/service-account-token +``` + +The Prometheus Operator can now resolve the bearer token without accessing the filesystem; the `arbitraryFSAccessThroughSMs.deny` guard is satisfied and the target becomes visible. This is the permanent fix and should be sent upstream to the operator so the next release no longer trips the alert. + +### 2. If you cannot edit the upstream ServiceMonitor + +Two holding patterns are available: + +- **Silence the alert.** Create an Alertmanager silence scoped to `alertname=PrometheusOperatorRejectedResources` and the specific `servicemonitor` label, with a finite expiry and a note pointing at the upstream fix tracker. Silencing stops the paging noise without hiding the underlying rejection — the warning log line is still there. +- **Host the fixed ServiceMonitor alongside the operator.** Deploy a second, correctly-configured `ServiceMonitor` that scrapes the same Service (different `metadata.name`, same `selector`). The broken one is ignored; the new one is accepted. Remove it when the operator ships the fix. + +Do not disable `arbitraryFSAccessThroughSMs.deny` on the user-workload Prometheus. Lowering that guard cluster-wide to admit one misconfigured `ServiceMonitor` exposes every tenant namespace to the same file-smuggling risk the flag is there to prevent. + +## Diagnostic Steps + +1. Confirm the alert is firing for the user-workload Prometheus and not the platform one: + + ```bash + kubectl -n get prometheus \ + -o custom-columns=NAME:.metadata.name,FSDENY:.spec.arbitraryFSAccessThroughSMs.deny + ``` + + The alert fires on Prometheus instances with `FSDENY=true`. + +2. List Prometheus Operator pod logs and extract the offending ServiceMonitor names: + + ```bash + kubectl -n logs deploy/prometheus-operator \ + | grep 'bearer token file' \ + | awk '{for (i=1;i<=NF;i++) if ($i ~ /servicemonitor=/) print $i}' \ + | sort -u + ``` + +3. For each rejected ServiceMonitor, inspect the offending endpoint: + + ```bash + kubectl -n get servicemonitor -o json \ + | jq '.spec.endpoints[] | {port,bearerTokenFile,tlsConfig}' + ``` + +4. After applying the fix (either migrated `ServiceMonitor` or a parallel one), re-check that the target appears in Prometheus: + + ```bash + kubectl -n exec deploy/prometheus -- \ + wget -qO- 'http://localhost:9090/api/v1/targets' \ + | grep -A2 cert-utils-operator + ``` + + The target's `health` should flip from absent to `up`. The alert clears on its next evaluation cycle.