From bd2aea08ab87592e30579cb9706b44afd2ec0973 Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 07:55:08 +0000 Subject: [PATCH 1/2] =?UTF-8?q?[storage]=20Ceph=20PrometheusRule=20Evaluat?= =?UTF-8?q?ion=20Fails=20with=20"many-to-many=20matching"=20=E2=80=94=20Re?= =?UTF-8?q?move=20Stale=20Rule=20Versions=20and=20Duplicate=20Exporter=20P?= =?UTF-8?q?ods?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...le_Versions_and_Duplicate_Exporter_Pods.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md diff --git a/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md b/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md new file mode 100644 index 00000000..45f9328d --- /dev/null +++ b/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md @@ -0,0 +1,159 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +A Ceph-based storage cluster reports evaluation errors on one of its shipped alerting rules — typically the one that joins `ceph_disk_occupation` against device and pool metadata: + +```text +ceph_disk_occupation: Evaluating rule failed with + "many-to-many matching not allowed: matching labels must be unique on one side" +``` + +In parallel, the storage namespace may show **two** instances of the ceph metrics exporter pod running where only one should exist: + +```text +$ kubectl -n get pods | grep metrics +ocs-metrics-exporter-74c467b6c7-pkqls 2/3 Running 0 48m +ocs-metrics-exporter-85b6dd7776-jspzj 2/3 Running 0 24d +``` + +Both symptoms trace to the same kind of residue: old `PrometheusRule` objects left behind across previous operator versions, and older exporter deployments that were not cleanly garbage-collected. The alerts are either permanently noisy or permanently silent — the operator's latest rule expects a specific label shape that older residual rules break. + +## Root Cause + +The ceph storage operator ships its alerting rules through `PrometheusRule` objects. Historical versions shipped different rule names: `prometheus-ceph-v14-rules`, `prometheus-ceph-v16-rules`, etc. Across operator upgrades, newer builds produce `prometheus-ceph-rules` (without a version suffix) and the older ones are expected to be garbage-collected — but that collection does not always complete cleanly. + +When two rule sets coexist, each exports labels that overlap. When Prometheus evaluates a rule like `ceph_disk_occupation` that uses a vector match (join two metrics on a shared label), it sees multiple matching series from the different rule sets and aborts evaluation with `many-to-many matching not allowed: matching labels must be unique on one side`. No alerts fire, no dashboards render, until the ambiguity is resolved. + +Duplicate `ocs-metrics-exporter` pods are a related symptom: one from the current operator's Deployment and one from a previous Deployment whose reconciliation stopped mid-upgrade. The exporter publishes labelled metrics; two exporters publish two sets of (otherwise identical) series, which aggravates the many-to-many issue on the rule side. + +Both residues — stale PrometheusRules and duplicate exporter pods — need cleanup. The operator will reconcile the intended state on the next loop; the residue it cannot self-correct must be deleted manually. + +## Resolution + +Three steps: clean the stale rules, clean the duplicate exporter, confirm the evaluation error clears. + +### Step 1 — list and identify stale PrometheusRules in the storage namespace + +```bash +NS= # the namespace hosting the ceph/rook operator +kubectl -n "$NS" get prometheusrule -o \ + custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp' +``` + +Example output: + +```text +NAME AGE +noobaa-prometheus-rules 5y44d +ocs-prometheus-rules 3y61d +prometheus-ceph-rules 729d +prometheus-ceph-v14-rules 5y44d # <-- stale, from an old operator release +prometheus-ceph-v16-rules 3y5d # <-- stale, from an old operator release +s3bucket-nearfull-alert 96d +``` + +Versioned rules (`prometheus-ceph-v14-rules`, `prometheus-ceph-v16-rules`) are the residue. The unversioned `prometheus-ceph-rules` is the current operator's rule set. Keep the unversioned one; delete the versioned ones. + +### Step 2 — delete the stale rules + +```bash +kubectl -n "$NS" delete prometheusrule prometheus-ceph-v14-rules +kubectl -n "$NS" delete prometheusrule prometheus-ceph-v16-rules +# Repeat for any other versioned entry. +``` + +The operator will not recreate them — they are not in its desired state. Only the current rule set remains. + +### Step 3 — check the metrics exporter + +```bash +kubectl -n "$NS" get pod | grep metrics +``` + +Zero or one exporter pod is the expected state; two is the problem shape. If two are running: + +```bash +kubectl -n "$NS" scale deployment ocs-metrics-exporter --replicas=0 + +# Wait for pods to terminate completely. +kubectl -n "$NS" get pod -l app=ocs-metrics-exporter -w + +# Scale back up. +kubectl -n "$NS" scale deployment ocs-metrics-exporter --replicas=1 + +# Confirm a single exporter is running. +kubectl -n "$NS" get pod | grep metrics +``` + +The `scale to 0` forces all pods to terminate; the reconciler's next scale-up creates a single fresh pod. If two pods persist after the cycle, there may be two distinct `Deployment` objects (one from each operator generation) — inspect: + +```bash +kubectl -n "$NS" get deployment | grep metrics +``` + +and delete the older one if it exists. + +### Step 4 — verify the rule evaluation recovers + +Wait for Prometheus to re-evaluate the rule group (default interval is usually 30s-1m). Then check the platform's rule-evaluation view or query directly: + +```bash +# If the platform exposes Prometheus's API through a route/service. +PROM_URL= +curl -sk "$PROM_URL/api/v1/rules" | \ + jq -r '.data.groups[].rules[] | select(.name == "ceph_disk_occupation") | {name, state, lastError}' +``` + +`state: ok` and empty `lastError` indicates the rule now evaluates cleanly. Alerts on the rule should fire or clear as their underlying condition dictates, rather than being permanently stuck because the evaluator refused to run. + +### Preventive posture + +Follow the operator's upgrade notes when bumping versions to see if any stale rules need pre-clean. If the cluster has been through many upgrades, a periodic audit of `PrometheusRule` in the storage namespace against the operator's current documented rule names catches residue before it starts affecting alert fidelity. + +## Diagnostic Steps + +Inspect the Prometheus operator's log (or the user-workload Prometheus pod) for the exact rule evaluation error: + +```bash +kubectl -n cpaas-monitoring logs \ + -l app.kubernetes.io/name=prometheus \ + --tail=500 | grep -E 'many-to-many|ceph_disk_occupation' +``` + +Reports of the exact alert name and the `many-to-many` string confirm this pattern. + +Compare the label sets of the conflicting rules to confirm they genuinely overlap: + +```bash +for rule in prometheus-ceph-v14-rules prometheus-ceph-v16-rules prometheus-ceph-rules; do + echo "=== $rule ===" + kubectl -n "$NS" get prometheusrule "$rule" -o jsonpath='{.spec.groups[*].rules[*].record}' 2>/dev/null + echo +done | sort -u +``` + +If multiple `prometheus-ceph-*` rules produce the same recording-rule name (same `record` field), the evaluator sees duplicate inputs and fails. + +After Step 2 + Step 3, the listing shrinks to a single rule set: + +```bash +kubectl -n "$NS" get prometheusrule -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ + | grep -c '^prometheus-ceph' +# 1 +``` + +And the exporter: + +```bash +kubectl -n "$NS" get pod | grep -c ocs-metrics-exporter +# 1 +``` + +Both at one indicates the environment is now in the shape the operator expects, and no further evaluator errors should accrue. From d3369bfaebd9ab55a9eef8338f6cc40090c6e343 Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 23:37:07 +0000 Subject: [PATCH 2/2] [storage] Ceph PrometheusRule evaluation fails with "many-to-many matching not allowed" --- ...le_Versions_and_Duplicate_Exporter_Pods.md | 159 ------------------ ..._with_many_to_many_matching_not_allowed.md | 114 +++++++++++++ 2 files changed, 114 insertions(+), 159 deletions(-) delete mode 100644 docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md create mode 100644 docs/en/solutions/Ceph_PrometheusRule_evaluation_fails_with_many_to_many_matching_not_allowed.md diff --git a/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md b/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md deleted file mode 100644 index 45f9328d..00000000 --- a/docs/en/solutions/Ceph_PrometheusRule_Evaluation_Fails_with_many_to_many_matching_Remove_Stale_Rule_Versions_and_Duplicate_Exporter_Pods.md +++ /dev/null @@ -1,159 +0,0 @@ ---- -kind: - - Troubleshooting -products: - - Alauda Container Platform -ProductsVersion: - - 4.1.0,4.2.x ---- -## Issue - -A Ceph-based storage cluster reports evaluation errors on one of its shipped alerting rules — typically the one that joins `ceph_disk_occupation` against device and pool metadata: - -```text -ceph_disk_occupation: Evaluating rule failed with - "many-to-many matching not allowed: matching labels must be unique on one side" -``` - -In parallel, the storage namespace may show **two** instances of the ceph metrics exporter pod running where only one should exist: - -```text -$ kubectl -n get pods | grep metrics -ocs-metrics-exporter-74c467b6c7-pkqls 2/3 Running 0 48m -ocs-metrics-exporter-85b6dd7776-jspzj 2/3 Running 0 24d -``` - -Both symptoms trace to the same kind of residue: old `PrometheusRule` objects left behind across previous operator versions, and older exporter deployments that were not cleanly garbage-collected. The alerts are either permanently noisy or permanently silent — the operator's latest rule expects a specific label shape that older residual rules break. - -## Root Cause - -The ceph storage operator ships its alerting rules through `PrometheusRule` objects. Historical versions shipped different rule names: `prometheus-ceph-v14-rules`, `prometheus-ceph-v16-rules`, etc. Across operator upgrades, newer builds produce `prometheus-ceph-rules` (without a version suffix) and the older ones are expected to be garbage-collected — but that collection does not always complete cleanly. - -When two rule sets coexist, each exports labels that overlap. When Prometheus evaluates a rule like `ceph_disk_occupation` that uses a vector match (join two metrics on a shared label), it sees multiple matching series from the different rule sets and aborts evaluation with `many-to-many matching not allowed: matching labels must be unique on one side`. No alerts fire, no dashboards render, until the ambiguity is resolved. - -Duplicate `ocs-metrics-exporter` pods are a related symptom: one from the current operator's Deployment and one from a previous Deployment whose reconciliation stopped mid-upgrade. The exporter publishes labelled metrics; two exporters publish two sets of (otherwise identical) series, which aggravates the many-to-many issue on the rule side. - -Both residues — stale PrometheusRules and duplicate exporter pods — need cleanup. The operator will reconcile the intended state on the next loop; the residue it cannot self-correct must be deleted manually. - -## Resolution - -Three steps: clean the stale rules, clean the duplicate exporter, confirm the evaluation error clears. - -### Step 1 — list and identify stale PrometheusRules in the storage namespace - -```bash -NS= # the namespace hosting the ceph/rook operator -kubectl -n "$NS" get prometheusrule -o \ - custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp' -``` - -Example output: - -```text -NAME AGE -noobaa-prometheus-rules 5y44d -ocs-prometheus-rules 3y61d -prometheus-ceph-rules 729d -prometheus-ceph-v14-rules 5y44d # <-- stale, from an old operator release -prometheus-ceph-v16-rules 3y5d # <-- stale, from an old operator release -s3bucket-nearfull-alert 96d -``` - -Versioned rules (`prometheus-ceph-v14-rules`, `prometheus-ceph-v16-rules`) are the residue. The unversioned `prometheus-ceph-rules` is the current operator's rule set. Keep the unversioned one; delete the versioned ones. - -### Step 2 — delete the stale rules - -```bash -kubectl -n "$NS" delete prometheusrule prometheus-ceph-v14-rules -kubectl -n "$NS" delete prometheusrule prometheus-ceph-v16-rules -# Repeat for any other versioned entry. -``` - -The operator will not recreate them — they are not in its desired state. Only the current rule set remains. - -### Step 3 — check the metrics exporter - -```bash -kubectl -n "$NS" get pod | grep metrics -``` - -Zero or one exporter pod is the expected state; two is the problem shape. If two are running: - -```bash -kubectl -n "$NS" scale deployment ocs-metrics-exporter --replicas=0 - -# Wait for pods to terminate completely. -kubectl -n "$NS" get pod -l app=ocs-metrics-exporter -w - -# Scale back up. -kubectl -n "$NS" scale deployment ocs-metrics-exporter --replicas=1 - -# Confirm a single exporter is running. -kubectl -n "$NS" get pod | grep metrics -``` - -The `scale to 0` forces all pods to terminate; the reconciler's next scale-up creates a single fresh pod. If two pods persist after the cycle, there may be two distinct `Deployment` objects (one from each operator generation) — inspect: - -```bash -kubectl -n "$NS" get deployment | grep metrics -``` - -and delete the older one if it exists. - -### Step 4 — verify the rule evaluation recovers - -Wait for Prometheus to re-evaluate the rule group (default interval is usually 30s-1m). Then check the platform's rule-evaluation view or query directly: - -```bash -# If the platform exposes Prometheus's API through a route/service. -PROM_URL= -curl -sk "$PROM_URL/api/v1/rules" | \ - jq -r '.data.groups[].rules[] | select(.name == "ceph_disk_occupation") | {name, state, lastError}' -``` - -`state: ok` and empty `lastError` indicates the rule now evaluates cleanly. Alerts on the rule should fire or clear as their underlying condition dictates, rather than being permanently stuck because the evaluator refused to run. - -### Preventive posture - -Follow the operator's upgrade notes when bumping versions to see if any stale rules need pre-clean. If the cluster has been through many upgrades, a periodic audit of `PrometheusRule` in the storage namespace against the operator's current documented rule names catches residue before it starts affecting alert fidelity. - -## Diagnostic Steps - -Inspect the Prometheus operator's log (or the user-workload Prometheus pod) for the exact rule evaluation error: - -```bash -kubectl -n cpaas-monitoring logs \ - -l app.kubernetes.io/name=prometheus \ - --tail=500 | grep -E 'many-to-many|ceph_disk_occupation' -``` - -Reports of the exact alert name and the `many-to-many` string confirm this pattern. - -Compare the label sets of the conflicting rules to confirm they genuinely overlap: - -```bash -for rule in prometheus-ceph-v14-rules prometheus-ceph-v16-rules prometheus-ceph-rules; do - echo "=== $rule ===" - kubectl -n "$NS" get prometheusrule "$rule" -o jsonpath='{.spec.groups[*].rules[*].record}' 2>/dev/null - echo -done | sort -u -``` - -If multiple `prometheus-ceph-*` rules produce the same recording-rule name (same `record` field), the evaluator sees duplicate inputs and fails. - -After Step 2 + Step 3, the listing shrinks to a single rule set: - -```bash -kubectl -n "$NS" get prometheusrule -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ - | grep -c '^prometheus-ceph' -# 1 -``` - -And the exporter: - -```bash -kubectl -n "$NS" get pod | grep -c ocs-metrics-exporter -# 1 -``` - -Both at one indicates the environment is now in the shape the operator expects, and no further evaluator errors should accrue. diff --git a/docs/en/solutions/Ceph_PrometheusRule_evaluation_fails_with_many_to_many_matching_not_allowed.md b/docs/en/solutions/Ceph_PrometheusRule_evaluation_fails_with_many_to_many_matching_not_allowed.md new file mode 100644 index 00000000..fd3dc32b --- /dev/null +++ b/docs/en/solutions/Ceph_PrometheusRule_evaluation_fails_with_many_to_many_matching_not_allowed.md @@ -0,0 +1,114 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +On a cluster running the ACP Ceph storage system, the Prometheus instance responsible for platform monitoring begins logging rule-evaluation failures for the `ceph.rules` rule group, and the `PrometheusRuleFailures` alert fires. The Prometheus log entry includes the PromQL expression for a recording rule that joins `ceph_disk_occupation` against node-level disk metrics, and fails with: + +```text +level=warn group=ceph.rules msg="Evaluating rule failed" +rule="record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m +expr: avg by (namespace) (topk by (ceph_daemon, namespace) (1, + label_replace(label_replace(ceph_disk_occupation{job=\"rook-ceph-mgr\"}, ... +err="found duplicate series for the match group {device=\"sdb\"} on the +left hand-side of the operation: + [{__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.2\", device=\"sdb\", ...}, + {__name__=\"ceph_disk_occupation\", ceph_daemon=\"osd.0\", device=\"sdb\", ...}]; +many-to-many matching not allowed: matching labels must be unique on one side" +``` + +Because the rule never successfully evaluates, any Ceph alert or dashboard that depends on the `cluster:ceph_disk_latency:…` recording series is unavailable, and the platform-level `PrometheusRuleFailures` alert stays triggered. + +## Root Cause + +Two distinct bugs in the Ceph PrometheusRule shipped by older releases can each produce the same `many-to-many matching not allowed` symptom: + +1. **Erroneous query in `ceph_disk_occupation`.** The recording rule joins `ceph_disk_occupation` to node disk metrics using `group_right` vector matching. If the underlying metric emits more than one series that shares the matcher labels (for example, multiple OSDs advertising the same `device=sdb` because the label set is not unique per OSD), PromQL cannot pick a single left-hand series and refuses the evaluation. + +2. **Missing `managedBy` label.** The original `ceph.rules` definition relied on the `managedBy` label being present on every `ceph_disk_occupation` sample to disambiguate multiple Ceph clusters / mgr instances inside the same namespace. When that label is not emitted, `ceph_disk_occupation` samples from two OSDs collapse onto the same match group and trigger the same duplicate-series error. + +Both faults live in the PrometheusRule definition that the Rook / Ceph operator installs — not in Prometheus itself and not in the Ceph cluster health. Patched revisions of the rule set that correct the query and/or add the `managedBy` join key eliminate the evaluation failure. + +## Resolution + +Upgrade the ACP Ceph storage system to a release that ships the corrected `ceph.rules` PrometheusRule. There is no runtime workaround for the bad rule: Prometheus stores what the rule file defines, so until the rule text itself is replaced the evaluation will keep failing. Concretely: + +1. Check which revision of the PrometheusRule is currently installed and whether it contains the fixed query. Look at the `ceph.rules` group inside the installed `PrometheusRule` object in the Ceph storage namespace: + + ```bash + kubectl -n get prometheusrule \ + -l 'app.kubernetes.io/name=rook-ceph' -o yaml \ + | grep -A5 "name: ceph.rules" + ``` + + If the recording expression still joins without the `managedBy` label, the cluster is on an affected revision. + +2. Upgrade the ACP Ceph storage components via the supplied operator — the patched PrometheusRule is part of the operator's bundle and is reconciled into the namespace as soon as the operator rolls out the new version. After upgrade, verify the rule text contains the disambiguating label: + + ```bash + kubectl -n get prometheusrule \ + -l 'app.kubernetes.io/name=rook-ceph' -o yaml \ + | grep -A10 "record: cluster:ceph_disk_latency" + ``` + +3. Force a reload of the Prometheus configuration so the corrected rule is picked up without waiting for the next auto-reload interval. On the ACP monitor stack, deleting the Prometheus pod is safe — the StatefulSet recreates it with the new rule files mounted: + + ```bash + kubectl -n delete pod -l app.kubernetes.io/name=prometheus + ``` + +4. If the evaluation failure persists after the upgrade with the same duplicate-series error, the underlying Ceph mgr is emitting a label set that actually collides (for example, identical `device` labels on two different hosts). Inspect the raw metric to confirm: + + ```bash + kubectl -n exec -it -- \ + wget -qO- 'http://localhost:9090/api/v1/query?query=ceph_disk_occupation' \ + | head -n 200 + ``` + + Duplicate label sets that survive the fixed rule indicate a collector-level problem in the Rook / Ceph mgr configuration rather than a PrometheusRule bug; check the Ceph mgr module and any custom label relabelling in the `ServiceMonitor` pointing at `rook-ceph-mgr`. + +There is no safe "edit the rule in place" mitigation, because the operator reconciles the `PrometheusRule` back to its bundled definition on every cycle — manual edits to the rule YAML will revert. Upgrading the operator is the only durable fix. + +## Diagnostic Steps + +1. Confirm the alert and identify the failing rule group: + + ```bash + kubectl -n logs \ + | grep -E "group=ceph.rules|many-to-many" + ``` + +2. Inspect the `PrometheusRule` that the rule evaluator is reading from, and the associated `ServiceMonitor` that feeds `ceph_disk_occupation`: + + ```bash + kubectl -n get prometheusrule + kubectl -n get servicemonitor + ``` + +3. Reproduce the failure manually against the Prometheus query API to narrow down whether the duplicate series is in `ceph_disk_occupation` itself or in the join target (`node_disk_*` series): + + ```bash + kubectl -n port-forward svc/prometheus 9090:9090 + # In another shell: + curl -sG 'http://localhost:9090/api/v1/query' \ + --data-urlencode 'query=count by (ceph_daemon, device, namespace) (ceph_disk_occupation{job="rook-ceph-mgr"})' \ + | head -n 50 + ``` + + A count greater than one for any `(ceph_daemon, device, namespace)` tuple confirms the label set is not unique enough for the unpatched rule. + +4. After upgrading the operator, verify that the rule evaluation errors have stopped by querying the `prometheus_rule_evaluation_failures_total` counter: + + ```bash + curl -sG 'http://localhost:9090/api/v1/query' \ + --data-urlencode 'query=increase(prometheus_rule_evaluation_failures_total{rule_group=~".*ceph.*"}[10m])' + ``` + + A stable zero confirms the fix. The `PrometheusRuleFailures` alert should self-clear after its evaluation interval. + + \ No newline at end of file