K8SPG-374: handle standby lag detection errors by pooknull · Pull Request #1462 · percona/percona-postgresql-operator

pooknull · 2026-02-26T12:24:33Z

https://perconadev.atlassian.net/browse/K8SPG-374

DESCRIPTION

This PR improves standby lag detection by handling 2 errors that can occur when the source cluster is paused.

If the primary pod cannot be identified during lag detection, the operator sets the following condition on the PerconaPGCluster resource:

 			Type:    postgrescluster.ConditionStandbyLagging,
 			Status:  metav1.ConditionUnknown,
 			Reason:  "PrimaryNotFound",
 			Message: "Cannot find primary for replication lag calculation",

If the lag detection query returns no rows/NULL (for example, when pg_stat_wal_receiver is empty), the operator sets the following condition on the PerconaPGCluster resource:

 			Type:    postgrescluster.ConditionStandbyLagging,
 			Status:  metav1.ConditionUnknown,
 			Reason:  "InvalidLagQueryOutput",
 			Message: "Invalid output from lag query. The WAL receiver is probably not active",

Additionally, this PR moves the log message "Requeuing standby cluster for lag check" from INFO to DEBUG.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPG-374

Copilot

Pull request overview

This pull request enhances error handling for standby lag detection in PostgreSQL cluster replication. It introduces sentinel errors and graceful error handling for transient conditions that can occur during cluster initialization or when replication is not yet established.

Changes:

Added sentinel errors ErrPrimaryPodNotFound and ErrInvalidLagQueryOutput for better error classification
Enhanced error handling in reconcileStandbyLag to set condition status to Unknown for recoverable error scenarios
Added empty string validation before parsing lag values from database queries
Reduced logging verbosity for periodic requeue operations
Removed unused fmt import from pgbackup controller

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
percona/controller/pgcluster/standby.go	Implements improved error handling for standby lag detection with sentinel errors, graceful handling of transient conditions, and empty string validation for query outputs
percona/controller/pgbackup/controller.go	Removes unused `fmt` import (cleanup)

percona/controller/pgcluster/standby.go

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

JNKPercona · 2026-02-26T17:34:15Z

Test Name	Result	Time
backup-enable-disable	passed	00:05:51
builtin-extensions	passed	00:06:19
cert-manager-tls	passed	00:05:00
custom-envs	passed	00:19:25
custom-extensions	failure	00:14:16
custom-tls	passed	00:07:34
database-init-sql	passed	00:04:08
demand-backup	passed	00:23:14
demand-backup-offline-snapshot	passed	00:13:27
dynamic-configuration	passed	00:04:07
finalizers	passed	00:06:43
init-deploy	passed	00:02:46
huge-pages	passed	00:02:57
monitoring	passed	00:07:05
monitoring-pmm3	passed	00:08:13
one-pod	passed	00:05:56
operator-self-healing	passed	00:10:15
pg-tde	passed	00:08:55
pitr	passed	00:12:10
scaling	passed	00:05:07
scheduled-backup	passed	00:27:14
self-healing	passed	00:08:47
sidecars	passed	00:02:34
standby-pgbackrest	passed	00:11:54
standby-streaming	passed	00:09:29
start-from-backup	passed	00:11:35
tablespaces	passed	00:07:21
telemetry-transfer	passed	00:04:38
upgrade-consistency	passed	00:06:28
upgrade-minor	passed	00:05:13
users	passed	00:04:55

Summary	Value
Tests Run	31/31
Job Duration	01:34:53
Total Test Time	04:33:52

commit: ba26542
image: perconalab/percona-postgresql-operator:PR-1462-ba26542ef

K8SPG-374: handle standby lag detection errors

4e93cf0

https://perconadev.atlassian.net/browse/K8SPG-374

Copilot AI review requested due to automatic review settings February 26, 2026 12:24

Copilot started reviewing on behalf of pooknull February 26, 2026 12:25 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

percona/controller/pgcluster/standby.go Show resolved Hide resolved

percona/controller/pgcluster/standby.go Show resolved Hide resolved

percona/controller/pgcluster/standby.go Show resolved Hide resolved

percona/controller/pgcluster/standby.go Show resolved Hide resolved

pooknull added 2 commits February 26, 2026 15:57

Merge branch 'main' into K8SPG-374-fix

4777d26

add unit-test

ab5d6a7

Copilot AI review requested due to automatic review settings February 26, 2026 14:11

pooknull marked this pull request as ready for review February 26, 2026 14:11

pooknull requested review from egegunes, gkech, hors, mayankshah1607, nmarukovich and oksana-grishchenko as code owners February 26, 2026 14:11

Copilot started reviewing on behalf of pooknull February 26, 2026 14:12 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Merge branch 'main' into K8SPG-374-fix

ba26542

egegunes approved these changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPG-374: handle standby lag detection errors#1462

K8SPG-374: handle standby lag detection errors#1462
pooknull wants to merge 4 commits intomainfrom
K8SPG-374-fix

pooknull commented Feb 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

JNKPercona commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pooknull commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION

CHECKLIST

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

JNKPercona commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pooknull commented Feb 26, 2026 •

edited

Loading