Skip to content

K8SPG-374: handle standby lag detection errors#1462

Open
pooknull wants to merge 4 commits intomainfrom
K8SPG-374-fix
Open

K8SPG-374: handle standby lag detection errors#1462
pooknull wants to merge 4 commits intomainfrom
K8SPG-374-fix

Conversation

@pooknull
Copy link
Contributor

@pooknull pooknull commented Feb 26, 2026

https://perconadev.atlassian.net/browse/K8SPG-374

DESCRIPTION

This PR improves standby lag detection by handling 2 errors that can occur when the source cluster is paused.

  1. If the primary pod cannot be identified during lag detection, the operator sets the following condition on the PerconaPGCluster resource:

     			Type:    postgrescluster.ConditionStandbyLagging,
     			Status:  metav1.ConditionUnknown,
     			Reason:  "PrimaryNotFound",
     			Message: "Cannot find primary for replication lag calculation",
  2. If the lag detection query returns no rows/NULL (for example, when pg_stat_wal_receiver is empty), the operator sets the following condition on the PerconaPGCluster resource:

     			Type:    postgrescluster.ConditionStandbyLagging,
     			Status:  metav1.ConditionUnknown,
     			Reason:  "InvalidLagQueryOutput",
     			Message: "Invalid output from lag query. The WAL receiver is probably not active",

Additionally, this PR moves the log message "Requeuing standby cluster for lag check" from INFO to DEBUG.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Copilot AI review requested due to automatic review settings February 26, 2026 12:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances error handling for standby lag detection in PostgreSQL cluster replication. It introduces sentinel errors and graceful error handling for transient conditions that can occur during cluster initialization or when replication is not yet established.

Changes:

  • Added sentinel errors ErrPrimaryPodNotFound and ErrInvalidLagQueryOutput for better error classification
  • Enhanced error handling in reconcileStandbyLag to set condition status to Unknown for recoverable error scenarios
  • Added empty string validation before parsing lag values from database queries
  • Reduced logging verbosity for periodic requeue operations
  • Removed unused fmt import from pgbackup controller

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
percona/controller/pgcluster/standby.go Implements improved error handling for standby lag detection with sentinel errors, graceful handling of transient conditions, and empty string validation for query outputs
percona/controller/pgbackup/controller.go Removes unused fmt import (cleanup)

Copilot AI review requested due to automatic review settings February 26, 2026 14:11
@pooknull pooknull marked this pull request as ready for review February 26, 2026 14:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@JNKPercona
Copy link
Collaborator

Test Name Result Time
backup-enable-disable passed 00:05:51
builtin-extensions passed 00:06:19
cert-manager-tls passed 00:05:00
custom-envs passed 00:19:25
custom-extensions failure 00:14:16
custom-tls passed 00:07:34
database-init-sql passed 00:04:08
demand-backup passed 00:23:14
demand-backup-offline-snapshot passed 00:13:27
dynamic-configuration passed 00:04:07
finalizers passed 00:06:43
init-deploy passed 00:02:46
huge-pages passed 00:02:57
monitoring passed 00:07:05
monitoring-pmm3 passed 00:08:13
one-pod passed 00:05:56
operator-self-healing passed 00:10:15
pg-tde passed 00:08:55
pitr passed 00:12:10
scaling passed 00:05:07
scheduled-backup passed 00:27:14
self-healing passed 00:08:47
sidecars passed 00:02:34
standby-pgbackrest passed 00:11:54
standby-streaming passed 00:09:29
start-from-backup passed 00:11:35
tablespaces passed 00:07:21
telemetry-transfer passed 00:04:38
upgrade-consistency passed 00:06:28
upgrade-minor passed 00:05:13
users passed 00:04:55
Summary Value
Tests Run 31/31
Job Duration 01:34:53
Total Test Time 04:33:52

commit: ba26542
image: perconalab/percona-postgresql-operator:PR-1462-ba26542ef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants