Create critical-op PDB on-demand to avoid false monitoring alerts#3024
Create critical-op PDB on-demand to avoid false monitoring alerts#3024a-thomas-22 wants to merge 5 commits intozalando:masterfrom
Conversation
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched. Changes: - Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB - PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed - Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection - Removed automatic critical-op PDB creation from initial cluster setup - Added test to verify on-demand PDB creation and deletion behavior, including edge cases for idempotent create/delete operations The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations. Fixes zalando#3020
1caf79b to
513291c
Compare
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃 Unit tests are currently failing. Can you fix them, please? |
When the PDB creation fails with "already exists" error, the pdb variable is nil since the initial Get failed. Using pdb.ObjectMeta would cause a panic. Use the cluster method to get the PDB name instead.
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
I'm not familiar with the CI here, but the gha unit tests and e2e tests are passing I think. The failures are from the internal Zalando CI (pipeline and script/build-postgres-operator). Build and tests also pass locally for me. I cant see the details of the failing runs. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Is there anything that can be done to get this through? |
|
Can we somehow help to get this merged? We are living with dozens of alerts for months already. |
|
For those using kube-prometheus-stack, here's a workaround to exclude the zalando operator's Two pieces are needed: 1. Disable the default rule # kube-prometheus-stack values.yaml
disabledRules:
KubePdbNotEnoughHealthyPods: true2. Add a replacement rule that excludes # kube-prometheus-stack values.yaml
additionalPrometheusRulesMap:
k8s-rules:
groups:
- name: kubernetes-apps.rules.custom
rules:
- alert: KubePdbNotEnoughHealthyPods
annotations:
description: >-
PDB {{ $labels.namespace }}/{{ $labels.poddisruptionbudget }}
expects {{ $value }} more healthy pods. The desired number of
healthy pods has not been met for at least 15m.
runbook_url: >-
https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepdbnotenoughhealthypods
summary: PDB does not have enough healthy pods.
expr: >-
(
kube_poddisruptionbudget_status_desired_healthy{job="kube-state-metrics", poddisruptionbudget!~".*-critical-op-pdb"}
-
kube_poddisruptionbudget_status_current_healthy{job="kube-state-metrics", poddisruptionbudget!~".*-critical-op-pdb"}
) > 0
for: 15m
labels:
severity: warningThis disables the built-in rule and replaces it with an identical one that filters out Not sure what's needed to move this forward on the operator side, but wanted to share the workaround in the meantime. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched.
Changes:
The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations.
Fixes #3020