Add PostgreSQL observability telemetry exposure by DmytroPI-dev · Pull Request #1808 · splunk/splunk-operator

DmytroPI-dev · 2026-04-01T11:06:42Z

Description

Adds PostgreSQL observability telemetry for PostgresCluster using Prometheus pod-annotation-based scraping. Metrics are exposed by CNPG's built-in exporters on PostgreSQL pods (port 9187) and PgBouncer pooler pods (port 9127). The operator controls whether annotations are injected via class- and cluster-level configuration, with no dedicated metrics Service or ServiceMonitor required for PostgreSQL or PgBouncer scraping.

A ServiceMonitor is still supported for operator-controller metrics as an optional step.

Key Changes

api/v4/postgresclusterclass_types.go
Added class-level observability configuration (monitoring.postgresqlMetrics.enabled, monitoring.connectionPoolerMetrics.enabled) that controls whether scrape annotations are injected into CNPG pods.

api/v4/postgrescluster_types.go
Added cluster-level disable-only overrides (spec.monitoring.postgresqlMetrics.disabled, spec.monitoring.connectionPoolerMetrics.disabled) allowing per-cluster opt-out without changing the class.

pkg/postgresql/cluster/core/cluster.go
Wired observability flag resolution into PostgresCluster reconciliation. When enabled, sets InheritedMetadata.Annotations on the CNPG Cluster (for PostgreSQL pods) and Template.ObjectMeta.Annotations on CNPG Pooler resources (for PgBouncer pods).

pkg/postgresql/cluster/core/monitoring.go
Added isPostgreSQLMetricsEnabled / isConnectionPoolerMetricsEnabled flag resolution helpers.
Added buildPostgresScrapeAnnotations / buildPoolerScrapeAnnotations annotation builders.
Added removeScrapeAnnotations for the disable path.

pkg/postgresql/cluster/core/monitoring_unit_test.go
Added unit tests for flag resolution, scrape annotation builders, and annotation removal.

internal/controller/postgrescluster_controller_test.go
Added integration tests verifying that InheritedMetadata annotations are set on the CNPG Cluster when monitoring is enabled and removed when disabled by cluster override.

docs/PostgreSQLObservabilityDashboard.json
Reference Grafana dashboard covering PostgreSQL target count, RW/RO PgBouncer availability, WAL activity, database sizes, PgBouncer client load, controller reconcile metrics, and domain fleet metrics.

docs/postgresSQLMonitoring-e2e.md
End-to-end validation guide for the annotation-based scraping flow on KIND.

Testing and Verification

Added unit tests in pkg/postgresql/cluster/core/monitoring_unit_test.go for:

class/cluster observability enablement logic
scrape annotation builders for PostgreSQL (port 9187) and PgBouncer (port 9127)
annotation removal on the disable path

Added integration tests in internal/controller/postgrescluster_controller_test.go verifying:

InheritedMetadata.Annotations presence when monitoring is enabled
annotation removal when disabled by cluster-level override

Related Issues

CPI-1853 — related JIRA ticket.

Grafana screenshot:

PR Checklist

Code changes adhere to the project's coding standards.
Relevant unit and integration tests are included.
Documentation has been updated accordingly.
All tests pass locally.
The PR description follows the project's guidelines.

docs/PostgreSQLObservabilityDashboard.md

pkg/postgresql/cluster/core/cluster.go

docs/PostgreSQLObservabilityDashboard.md

pkg/postgresql/cluster/core/monitoring.go

pkg/postgresql/cluster/core/cluster.go

limak9182 · 2026-04-10T08:54:28Z

pkg/postgresql/cluster/core/cluster.go

+	); err != nil {
+		return ctrl.Result{}, err
+	}
+


It's another block for our reconciliation metric, maybe it's worth to emit event in case of success? or issue?

also what about extending our status with information if this failed/succeeded i.e add new condition?

Updated with adding new conditions on that.

github-actions · 2026-04-10T09:54:36Z

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.

I have read the CLA Document and I hereby sign the CLA

1 out of 2 committers have signed the CLA.
✅ @DmytroPI-dev
❌ @limak9182
_{You can retrigger this bot by commenting recheck in this Pull Request}

mploski · 2026-04-10T12:31:47Z

api/v4/postgrescluster_types.go

+}
+
+// PostgresObservabilityOverride overrides observability configuration options for PostgresClusterClass.
+type PostgresObservabilityOverride struct {


PostgresObservabilityOverride we should follow the same pattern we have for ConnectionPoolerEnabled
So maybe ConnectionPoolerMetricsEnabled and PostgreSQLMetricsEnabled?

mploski · 2026-04-10T12:34:18Z

api/v4/postgrescluster_types.go

+	PostgreSQL *FeatureDisableOverride `json:"postgresql,omitempty"`
+
+	// +optional
+	PgBouncer *FeatureDisableOverride `json:"pgbouncer,omitempty"`


in other provider we might not have pgbouncer ( aws for example) lets call it generic way ( connectionPooler). Also we should probably have CEL logic that doesnt allow connection pooler metrics enabled if connection pooler itself is disabled

mploski · 2026-04-10T12:42:41Z

api/v4/postgresclusterclass_types.go

+	// Can be overridden in PostgresCluster CR.
+	// +kubebuilder:default={}
+	// +optional
+	Observability *PostgresObservabilityClassConfig `json:"observability,omitempty"`


Similar to previous comment :-)

Didn't get you correctly 😞 , what previous comment do you mean? This one?

mploski · 2026-04-10T12:58:54Z

pkg/postgresql/cluster/core/monitoring.go

+}
+
+func isConnectionPoolerMetricsEnabled(cluster *enterprisev4.PostgresCluster, class *enterprisev4.PostgresClusterClass) bool {
+	if !isConnectionPoolerEnabled(cluster, class) {


this check shouldnt be a part of this function I believe

mploski · 2026-04-10T12:59:40Z

pkg/postgresql/cluster/core/monitoring.go

+	return override == nil || !*override
+}
+
+func isConnectionPoolerEnabled(cluster *enterprisev4.PostgresCluster, class *enterprisev4.PostgresClusterClass) bool {


should this function be a part of connection pooler not monitoring?

We don't need it at all, as we have poolerEnabled in cluster.go

mploski · 2026-04-11T10:39:00Z

pkg/postgresql/cluster/core/monitoring.go

+	return override == nil || !*override
+}
+
+func buildPostgreSQLMetricsService(scheme *runtime.Scheme, cluster *enterprisev4.PostgresCluster) (*corev1.Service, error) {


Out of curiosity why we need to create k8s service to expose those information? Service is effectively a load balancer that use round robin. If we have many postgres instances every call to that endpoint can fetch metrics from different instance, which can be different depending how users are connected. Is my understanding correct?

As per the Prometheus configuration docs , the dedicated Service here is mainly a stable discovery contract for Prometheus Operator, not a client-style load balancer for metrics consumers.

Prometheus’ Kubernetes service discovery docs distinguish between service-level and endpoint-level discovery:

For service discovery: “The address will be set to the Kubernetes DNS name of the service and respective service port.”

For endpoints discovery: “The endpoints role discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port.”

For endpointslice discovery: “The endpointslice role discovers targets from existing endpointslices. For each endpoint address referenced in the endpointslice object one target is discovered.”

So the intent here is not “scrape one Service and round-robin across instances”. The ServiceMonitor uses the Service as the discovery entry point, and Prometheus then discovers/scrapes the backing endpoints separately. That gives us per-instance targets, which is what we want for PostgreSQL metrics.

That is also why we'd need to create new Services instead using the existing client-facing Services by default: they are built for application traffic, may not expose the metrics port at all, and don’t provide an explicit observability contract.
Does this makes sense?

It does thank you! This reveal different discussion though. We cannot enforce usage of prometheus operator and it seems we are adding this as a dependency by using Service Monitor. The reason is that, the platform will be responsible for scrapping AND SOK aim to use otel collector for this. In that model we discover endpoints to be scrapped by providing appropiate annotiation to the PODs: https://www.dash0.com/guides/opentelemetry-prometheus-receiver#scrapeconfigs. So it seems we could simplify this code, by removing ServiceMonitor/Service but only populate requested annotiations so otel collector can scrape it. Let's discuss it offline :-)

mploski · 2026-04-11T10:46:30Z

pkg/postgresql/cluster/core/monitoring.go

+		return fmt.Errorf("building PostgreSQL metrics Service: %w", err)
+	}
+
+	live := &corev1.Service{


why do we need this, cant we use desired directly?

Using desired directly would mix desired state with server-populated/immutable Service fields like clusterIP ipFamilies, resourceVersion and related defaults, which can cause unnecessary diffs or update failures, so we should use a minimal object instead. For Service specifically, this matters more than for ConfigMap or ServiceMonitor, because Service has immutable/defaulted networking fields we'd not want to stomp.

DmytroPI-dev requested review from M4KIF, limak9182 and mploski April 1, 2026 11:06

DmytroPI-dev force-pushed the postgres-operator-monitoring branch from a1b796f to 976ecd1 Compare April 2, 2026 14:08

DmytroPI-dev changed the title ~~Create ServiceMonitor and basic Grafana dashboard for metrics~~ Add PostgreSQL observability telemetry exposure via ServiceMonitors Apr 2, 2026

limak9182 reviewed Apr 8, 2026

View reviewed changes

limak9182 reviewed Apr 10, 2026

View reviewed changes

mploski reviewed Apr 10, 2026

View reviewed changes

mploski reviewed Apr 11, 2026

View reviewed changes

DmytroPI-dev force-pushed the postgres-operator-monitoring branch from d710f58 to 63b5937 Compare April 13, 2026 09:54

DmytroPI-dev changed the title ~~Add PostgreSQL observability telemetry exposure via ServiceMonitors~~ Add PostgreSQL observability telemetry exposure Apr 15, 2026

DmytroPI-dev and others added 9 commits April 15, 2026 14:43

add basic grafana

d3092c4

removed grafana dashboard from code

2822a8d

add grafana sample to docs

016b1aa

style: fix links to docs.

2cf1ca7

changed metrics

aa983fc

Update monitoring reconciliation logic, tests

ed25336

remove prometeush, add annotations.

d353d05

update metrics and dashboard

31e0fa0

fix rebase errors

08dfa16

DmytroPI-dev force-pushed the postgres-operator-monitoring branch from 988138d to 08dfa16 Compare April 15, 2026 12:59

fix CR

8ed740c

Conversation

DmytroPI-dev commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Testing and Verification

Related Issues

Grafana screenshot:

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

mploski Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmytroPI-dev Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmytroPI-dev Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DmytroPI-dev commented Apr 1, 2026 •

edited

Loading

mploski Apr 10, 2026 •

edited

Loading

mploski Apr 10, 2026 •

edited

Loading

DmytroPI-dev Apr 12, 2026 •

edited

Loading

mploski Apr 11, 2026 •

edited

Loading

DmytroPI-dev Apr 12, 2026 •

edited

Loading

mploski Apr 13, 2026 •

edited

Loading