Create critical-op PDB on-demand to avoid false monitoring alerts by a-thomas-22 · Pull Request #3024 · zalando/postgres-operator

a-thomas-22 · 2026-01-02T20:59:07Z

The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched.

Changes:

Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB
PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed
Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection
Removed automatic critical-op PDB creation from initial cluster setup
Added test to verify on-demand PDB creation and deletion behavior

The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations.

Fixes #3020

zalando-robot · 2026-01-02T20:59:11Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched. Changes: - Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB - PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed - Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection - Removed automatic critical-op PDB creation from initial cluster setup - Added test to verify on-demand PDB creation and deletion behavior, including edge cases for idempotent create/delete operations The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations. Fixes zalando#3020

zalando-robot · 2026-01-02T21:05:53Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

FxKu · 2026-01-09T12:57:55Z

Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃

Unit tests are currently failing. Can you fix them, please?

When the PDB creation fails with "already exists" error, the pdb variable is nil since the initial Get failed. Using pdb.ObjectMeta would cause a panic. Use the cluster method to get the PDB name instead.

zalando-robot · 2026-01-18T07:10:07Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

a-thomas-22 · 2026-01-18T07:17:14Z

Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃

Unit tests are currently failing. Can you fix them, please?

I'm not familiar with the CI here, but the gha unit tests and e2e tests are passing I think. The failures are from the internal Zalando CI (pipeline and script/build-postgres-operator). Build and tests also pass locally for me. I cant see the details of the failing runs.

zalando-robot · 2026-01-18T07:17:36Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

vquie · 2026-02-02T19:39:00Z

Is there anything that can be done to get this through?

vquie · 2026-03-20T12:07:31Z

Can we somehow help to get this merged? We are living with dozens of alerts for months already.

a-thomas-22 · 2026-03-25T04:56:10Z

For those using kube-prometheus-stack, here's a workaround to exclude the zalando operator's *-critical-op-pdb PDBs from the KubePdbNotEnoughHealthyPods alert.

Two pieces are needed:

1. Disable the default rule

# kube-prometheus-stack values.yaml
disabledRules:
  KubePdbNotEnoughHealthyPods: true

2. Add a replacement rule that excludes *-critical-op-pdb

# kube-prometheus-stack values.yaml
additionalPrometheusRulesMap:
  k8s-rules:
    groups:
      - name: kubernetes-apps.rules.custom
        rules:
          - alert: KubePdbNotEnoughHealthyPods
            annotations:
              description: >-
                PDB {{ $labels.namespace }}/{{ $labels.poddisruptionbudget }}
                expects {{ $value }} more healthy pods. The desired number of
                healthy pods has not been met for at least 15m.
              runbook_url: >-
                https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepdbnotenoughhealthypods
              summary: PDB does not have enough healthy pods.
            expr: >-
              (
                kube_poddisruptionbudget_status_desired_healthy{job="kube-state-metrics", poddisruptionbudget!~".*-critical-op-pdb"}
                -
                kube_poddisruptionbudget_status_current_healthy{job="kube-state-metrics", poddisruptionbudget!~".*-critical-op-pdb"}
              ) > 0
            for: 15m
            labels:
              severity: warning

This disables the built-in rule and replaces it with an identical one that filters out poddisruptionbudget=~".*-critical-op-pdb" from the query.

Not sure what's needed to move this forward on the operator side, but wanted to share the workaround in the meantime.

zalando-robot · 2026-03-25T04:58:41Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

a-thomas-22 force-pushed the fix/critical-op-pdb-on-demand branch from 1caf79b to 513291c Compare January 2, 2026 21:05

a-thomas-22 marked this pull request as ready for review January 2, 2026 21:08

a-thomas-22 requested review from FxKu, Jan-M, hughcapet, idanovinda, jopadi, mikkeloscar and sdudoladov as code owners January 2, 2026 21:08

FxKu added the minor label Jan 8, 2026

FxKu added this to the 1.15.2 milestone Jan 8, 2026

FxKu added this to Postgres Operator Jan 8, 2026

FxKu moved this to Waiting for review in Postgres Operator Jan 8, 2026

Fix nil pointer dereference in syncCriticalOpPodDisruptionBudget

0f2cb12

When the PDB creation fails with "already exists" error, the pdb variable is nil since the initial Get failed. Using pdb.ObjectMeta would cause a panic. Use the cluster method to get the PDB name instead.

Merge branch 'master' into fix/critical-op-pdb-on-demand

b2f5bf0

Merge branch 'master' into fix/critical-op-pdb-on-demand

2ccbd66

Merge branch 'master' into fix/critical-op-pdb-on-demand

f213f53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create critical-op PDB on-demand to avoid false monitoring alerts#3024

Create critical-op PDB on-demand to avoid false monitoring alerts#3024
a-thomas-22 wants to merge 5 commits intozalando:masterfrom
a-thomas-22:fix/critical-op-pdb-on-demand

a-thomas-22 commented Jan 2, 2026 •

edited

Loading

Uh oh!

zalando-robot commented Jan 2, 2026

Uh oh!

zalando-robot commented Jan 2, 2026

Uh oh!

FxKu commented Jan 9, 2026

Uh oh!

zalando-robot commented Jan 18, 2026

Uh oh!

a-thomas-22 commented Jan 18, 2026

Uh oh!

zalando-robot commented Jan 18, 2026

Uh oh!

vquie commented Feb 2, 2026

Uh oh!

vquie commented Mar 20, 2026

Uh oh!

a-thomas-22 commented Mar 25, 2026 •

edited

Loading

Uh oh!

zalando-robot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

a-thomas-22 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zalando-robot commented Jan 2, 2026

Uh oh!

zalando-robot commented Jan 2, 2026

Uh oh!

FxKu commented Jan 9, 2026

Uh oh!

zalando-robot commented Jan 18, 2026

Uh oh!

a-thomas-22 commented Jan 18, 2026

Uh oh!

zalando-robot commented Jan 18, 2026

Uh oh!

vquie commented Feb 2, 2026

Uh oh!

vquie commented Mar 20, 2026

Uh oh!

a-thomas-22 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zalando-robot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

a-thomas-22 commented Jan 2, 2026 •

edited

Loading

a-thomas-22 commented Mar 25, 2026 •

edited

Loading