Skip to content

[DBMON-6664] Optimize Postgres query metrics#23823

Draft
eric-weaver wants to merge 7 commits into
masterfrom
eric.weaver/DBMON-6664
Draft

[DBMON-6664] Optimize Postgres query metrics#23823
eric-weaver wants to merge 7 commits into
masterfrom
eric.weaver/DBMON-6664

Conversation

@eric-weaver
Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a new PostgresStatementMetricsV2 collection pipeline for Postgres query metrics that replaces the naive full-table scan with a change-detection and cached-obfuscation approach. Enabled via the hidden config flag query_metrics.use_v2: true.

How the V2 algorithm works:

  1. Query pg_stat_statements(false) for integer counters only (no query text pulled from disk).
  2. DeltaDetector diffs against the previous snapshot to find rows where calls incremented.
  3. For changed queryids, ObfuscationLookup checks a two-tier LRU cache (queryid -> query_signature -> ObfuscationResult). On cache miss, fetch raw text from PG, obfuscate via FFI, cache, and discard the raw text.
  4. Merge derivative rows by (query_signature, datname, rolname) and emit.

Key design properties:

  • No raw query text in steady-state memory -- only obfuscated results and signatures are cached.
  • Obfuscation is called only on cache miss -- typically only for newly appeared queries.
  • pg_stat_statements(false) skips text from disk -- reduces PG-side I/O and wire bytes.
  • Multiple queryids sharing a query_signature share one ObfuscationResult -- deduplicates cache entries.
  • Bounded memory -- both LRU tiers are capped to pg_stat_statements.max.

New files:

File Purpose
delta_detector.py Diffs consecutive pgss snapshots, producing derivative rows and changed/vanished queryid sets
obfuscation_lookup.py Two-tier LRU: queryid -> query_signature -> ObfuscationResult
statements_v2.py Full V2 pipeline orchestrating the above components

Test coverage:

  • 18 unit tests (test_statements_v2.py) covering DeltaDetector and ObfuscationLookup in isolation
  • 13 integration tests (test_statements_v2_integration.py) mirroring key V1 integration tests: end-to-end collection, cold start, duplicate merging, error handling, pgss dealloc, WAL metrics, and internal telemetry

Output contract is identical to V1 -- same payload structure, same dbm-metrics and dbm-samples event formats, same metric names. V1 code (statements.py) is untouched.

Benchmark Results

Benchmarks were run on a local-dev Postgres stack with 4 agent variants (V1, V1-incremental, V2, V3) across Low Churn, High Churn, Cold Start, and Eviction Pressure workloads. V3 was an intermediate prototype whose logic is now collapsed into V2.

Full 4-Way Comparison (blended 60-minute averages)

Metric V1 V1-Inc V2 V3
Collection time (avg) 318ms ~200ms ~80ms ~80ms
Container RSS 231MB 230MB 212MB 210MB
Container CPU 103m cores 88m cores 75m cores 76m cores
Go TotalAlloc (lifetime) 10,049MB 3,100MB 1,490MB 1,485MB
Go Mallocs (lifetime) 34.4M 14M 10.6M 10.5M
Go GC cycles 374 106 58 58
Go GC Pause Total 83.3ms 40ms 17ms 23ms
Python alloc (lifetime) 17.5GB ~1.3GB ~1.1GB ~1.1GB

Key Improvements (V2 vs V1)

Metric Improvement
Collection latency 4x faster (318ms -> ~80ms)
Python allocations 16x less (17.5GB -> 1.1GB lifetime)
Go GC cycles 6.4x fewer (374 -> 58)
Go TotalAlloc 6.7x less (10GB -> 1.5GB)
Container CPU 27% lower (103m -> 75m cores)
Container RSS 8% lower (231MB -> 212MB)

The primary driver is eliminating redundant obfuscation: V1 obfuscates every row every 10-second interval (5-10k FFI calls), while V2 only obfuscates on cache miss (typically < 100 per interval in steady state).

Motivation

The existing V1 statement metrics collection pulls the full pg_stat_statements table (including query text) every 10 seconds, obfuscates every row through a Python-to-Go FFI bridge, then discards ~95% of rows that haven't changed. This wastes PostgreSQL I/O, network bandwidth, CPU on redundant obfuscation, and memory on transient string allocations. The experimental incremental_query_metrics flag partially addressed this but was never fully productionized.

V2 was designed from scratch to be efficient by default: detect what changed first (integers only, no text), then resolve only the queries that need it. The architecture also lays groundwork for future batch FFI obfuscation support.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented May 22, 2026

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 92.37%
Overall Coverage: 93.66% (+6.30%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f2b31c7 | Docs | Datadog PR Page | Give us feedback!

@eric-weaver eric-weaver added the qa/required QA is required for this PR and will generate a QA card label May 22, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 22, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 92.74809% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.51%. Comparing base (505fbd1) to head (f2b31c7).
⚠️ Report is 43 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant