Add PostgreSQL observability telemetry exposure#1808
Add PostgreSQL observability telemetry exposure#1808DmytroPI-dev wants to merge 10 commits intofeature/database-controllersfrom
Conversation
a1b796f to
976ecd1
Compare
| ); err != nil { | ||
| return ctrl.Result{}, err | ||
| } | ||
|
|
There was a problem hiding this comment.
It's another block for our reconciliation metric, maybe it's worth to emit event in case of success? or issue?
There was a problem hiding this comment.
also what about extending our status with information if this failed/succeeded i.e add new condition?
There was a problem hiding this comment.
Updated with adding new conditions on that.
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
api/v4/postgrescluster_types.go
Outdated
| } | ||
|
|
||
| // PostgresObservabilityOverride overrides observability configuration options for PostgresClusterClass. | ||
| type PostgresObservabilityOverride struct { |
There was a problem hiding this comment.
PostgresObservabilityOverride we should follow the same pattern we have for ConnectionPoolerEnabled
So maybe ConnectionPoolerMetricsEnabled and PostgreSQLMetricsEnabled?
api/v4/postgrescluster_types.go
Outdated
| PostgreSQL *FeatureDisableOverride `json:"postgresql,omitempty"` | ||
|
|
||
| // +optional | ||
| PgBouncer *FeatureDisableOverride `json:"pgbouncer,omitempty"` |
There was a problem hiding this comment.
in other provider we might not have pgbouncer ( aws for example) lets call it generic way ( connectionPooler). Also we should probably have CEL logic that doesnt allow connection pooler metrics enabled if connection pooler itself is disabled
api/v4/postgresclusterclass_types.go
Outdated
| // Can be overridden in PostgresCluster CR. | ||
| // +kubebuilder:default={} | ||
| // +optional | ||
| Observability *PostgresObservabilityClassConfig `json:"observability,omitempty"` |
There was a problem hiding this comment.
Similar to previous comment :-)
There was a problem hiding this comment.
Didn't get you correctly 😞 , what previous comment do you mean? This one?
| } | ||
|
|
||
| func isConnectionPoolerMetricsEnabled(cluster *enterprisev4.PostgresCluster, class *enterprisev4.PostgresClusterClass) bool { | ||
| if !isConnectionPoolerEnabled(cluster, class) { |
There was a problem hiding this comment.
this check shouldnt be a part of this function I believe
| return override == nil || !*override | ||
| } | ||
|
|
||
| func isConnectionPoolerEnabled(cluster *enterprisev4.PostgresCluster, class *enterprisev4.PostgresClusterClass) bool { |
There was a problem hiding this comment.
should this function be a part of connection pooler not monitoring?
There was a problem hiding this comment.
We don't need it at all, as we have poolerEnabled in cluster.go
| return override == nil || !*override | ||
| } | ||
|
|
||
| func buildPostgreSQLMetricsService(scheme *runtime.Scheme, cluster *enterprisev4.PostgresCluster) (*corev1.Service, error) { |
There was a problem hiding this comment.
Out of curiosity why we need to create k8s service to expose those information? Service is effectively a load balancer that use round robin. If we have many postgres instances every call to that endpoint can fetch metrics from different instance, which can be different depending how users are connected. Is my understanding correct?
There was a problem hiding this comment.
As per the Prometheus configuration docs , the dedicated Service here is mainly a stable discovery contract for Prometheus Operator, not a client-style load balancer for metrics consumers.
Prometheus’ Kubernetes service discovery docs distinguish between service-level and endpoint-level discovery:
- For
servicediscovery: “The address will be set to the Kubernetes DNS name of the service and respective service port.” - For
endpointsdiscovery: “The endpoints role discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port.” - For
endpointslicediscovery: “The endpointslice role discovers targets from existing endpointslices. For each endpoint address referenced in the endpointslice object one target is discovered.”
So the intent here is not “scrape one Service and round-robin across instances”. The ServiceMonitor uses the Service as the discovery entry point, and Prometheus then discovers/scrapes the backing endpoints separately. That gives us per-instance targets, which is what we want for PostgreSQL metrics.
That is also why we'd need to create new Services instead using the existing client-facing Services by default: they are built for application traffic, may not expose the metrics port at all, and don’t provide an explicit observability contract.
Does this makes sense?
There was a problem hiding this comment.
It does thank you! This reveal different discussion though. We cannot enforce usage of prometheus operator and it seems we are adding this as a dependency by using Service Monitor. The reason is that, the platform will be responsible for scrapping AND SOK aim to use otel collector for this. In that model we discover endpoints to be scrapped by providing appropiate annotiation to the PODs: https://www.dash0.com/guides/opentelemetry-prometheus-receiver#scrapeconfigs. So it seems we could simplify this code, by removing ServiceMonitor/Service but only populate requested annotiations so otel collector can scrape it. Let's discuss it offline :-)
| return fmt.Errorf("building PostgreSQL metrics Service: %w", err) | ||
| } | ||
|
|
||
| live := &corev1.Service{ |
There was a problem hiding this comment.
why do we need this, cant we use desired directly?
There was a problem hiding this comment.
Using desired directly would mix desired state with server-populated/immutable Service fields like clusterIP ipFamilies, resourceVersion and related defaults, which can cause unnecessary diffs or update failures, so we should use a minimal object instead. For Service specifically, this matters more than for ConfigMap or ServiceMonitor, because Service has immutable/defaulted networking fields we'd not want to stomp.
d710f58 to
63b5937
Compare
988138d to
08dfa16
Compare
Description
Adds PostgreSQL observability telemetry for
PostgresClusterusing Prometheus pod-annotation-based scraping. Metrics are exposed by CNPG's built-in exporters on PostgreSQL pods (port9187) and PgBouncer pooler pods (port9127). The operator controls whether annotations are injected via class- and cluster-level configuration, with no dedicated metricsServiceorServiceMonitorrequired for PostgreSQL or PgBouncer scraping.A
ServiceMonitoris still supported for operator-controller metrics as an optional step.Key Changes
api/v4/postgresclusterclass_types.goAdded class-level observability configuration (
monitoring.postgresqlMetrics.enabled,monitoring.connectionPoolerMetrics.enabled) that controls whether scrape annotations are injected into CNPG pods.api/v4/postgrescluster_types.goAdded cluster-level disable-only overrides (
spec.monitoring.postgresqlMetrics.disabled,spec.monitoring.connectionPoolerMetrics.disabled) allowing per-cluster opt-out without changing the class.pkg/postgresql/cluster/core/cluster.goWired observability flag resolution into
PostgresClusterreconciliation. When enabled, setsInheritedMetadata.Annotationson the CNPGCluster(for PostgreSQL pods) andTemplate.ObjectMeta.Annotationson CNPGPoolerresources (for PgBouncer pods).pkg/postgresql/cluster/core/monitoring.goAdded
isPostgreSQLMetricsEnabled/isConnectionPoolerMetricsEnabledflag resolution helpers.Added
buildPostgresScrapeAnnotations/buildPoolerScrapeAnnotationsannotation builders.Added
removeScrapeAnnotationsfor the disable path.pkg/postgresql/cluster/core/monitoring_unit_test.goAdded unit tests for flag resolution, scrape annotation builders, and annotation removal.
internal/controller/postgrescluster_controller_test.goAdded integration tests verifying that
InheritedMetadataannotations are set on the CNPGClusterwhen monitoring is enabled and removed when disabled by cluster override.docs/PostgreSQLObservabilityDashboard.jsonReference Grafana dashboard covering PostgreSQL target count, RW/RO PgBouncer availability, WAL activity, database sizes, PgBouncer client load, controller reconcile metrics, and domain fleet metrics.
docs/postgresSQLMonitoring-e2e.mdEnd-to-end validation guide for the annotation-based scraping flow on KIND.
Testing and Verification
Added unit tests in
pkg/postgresql/cluster/core/monitoring_unit_test.gofor:9187) and PgBouncer (port9127)Added integration tests in
internal/controller/postgrescluster_controller_test.goverifying:InheritedMetadata.Annotationspresence when monitoring is enabledRelated Issues
CPI-1853 — related JIRA ticket.
Grafana screenshot:
PR Checklist