[DBMON-6664] Optimize Postgres query metrics by eric-weaver · Pull Request #23823 · DataDog/integrations-core

eric-weaver · 2026-05-22T21:03:05Z

What does this PR do?

Adds a new PostgresStatementMetricsV2 collection pipeline for Postgres query metrics that replaces the naive full-table scan with a change-detection and cached-obfuscation approach. Enabled via the hidden config flag query_metrics.use_v2: true.

How the V2 algorithm works:

Query pg_stat_statements(false) for integer counters only (no query text pulled from disk).
DeltaDetector diffs against the previous snapshot to find rows where calls incremented.
For changed queryids, ObfuscationLookup checks a two-tier LRU cache (queryid -> query_signature -> ObfuscationResult). On cache miss, fetch raw text from PG, obfuscate via FFI, cache, and discard the raw text.
Merge derivative rows by (query_signature, datname, rolname) and emit.

Key design properties:

No raw query text in steady-state memory -- only obfuscated results and signatures are cached.
Obfuscation is called only on cache miss -- typically only for newly appeared queries.
pg_stat_statements(false) skips text from disk -- reduces PG-side I/O and wire bytes.
Multiple queryids sharing a query_signature share one ObfuscationResult -- deduplicates cache entries.
Bounded memory -- both LRU tiers are capped to pg_stat_statements.max.

New files:

File	Purpose
`delta_detector.py`	Diffs consecutive pgss snapshots, producing derivative rows and changed/vanished queryid sets
`obfuscation_lookup.py`	Two-tier LRU: `queryid -> query_signature -> ObfuscationResult`
`statements_v2.py`	Full V2 pipeline orchestrating the above components

Test coverage:

18 unit tests (test_statements_v2.py) covering DeltaDetector and ObfuscationLookup in isolation
13 integration tests (test_statements_v2_integration.py) mirroring key V1 integration tests: end-to-end collection, cold start, duplicate merging, error handling, pgss dealloc, WAL metrics, and internal telemetry

Output contract is identical to V1 -- same payload structure, same dbm-metrics and dbm-samples event formats, same metric names. V1 code (statements.py) is untouched.

Benchmark Results

Benchmarks were run on a local-dev Postgres stack with 4 agent variants (V1, V1-incremental, V2, V3) across Low Churn, High Churn, Cold Start, and Eviction Pressure workloads. V3 was an intermediate prototype whose logic is now collapsed into V2.

Full 4-Way Comparison (blended 60-minute averages)

Metric	V1	V1-Inc	V2	V3
Collection time (avg)	318ms	~200ms	~80ms	~80ms
Container RSS	231MB	230MB	212MB	210MB
Container CPU	103m cores	88m cores	75m cores	76m cores
Go TotalAlloc (lifetime)	10,049MB	3,100MB	1,490MB	1,485MB
Go Mallocs (lifetime)	34.4M	14M	10.6M	10.5M
Go GC cycles	374	106	58	58
Go GC Pause Total	83.3ms	40ms	17ms	23ms
Python alloc (lifetime)	17.5GB	~1.3GB	~1.1GB	~1.1GB

Key Improvements (V2 vs V1)

Metric	Improvement
Collection latency	4x faster (318ms -> ~80ms)
Python allocations	16x less (17.5GB -> 1.1GB lifetime)
Go GC cycles	6.4x fewer (374 -> 58)
Go TotalAlloc	6.7x less (10GB -> 1.5GB)
Container CPU	27% lower (103m -> 75m cores)
Container RSS	8% lower (231MB -> 212MB)

The primary driver is eliminating redundant obfuscation: V1 obfuscates every row every 10-second interval (5-10k FFI calls), while V2 only obfuscates on cache miss (typically < 100 per interval in steady state).

Motivation

The existing V1 statement metrics collection pulls the full pg_stat_statements table (including query text) every 10 seconds, obfuscates every row through a Python-to-Go FFI bridge, then discards ~95% of rows that haven't changed. This wastes PostgreSQL I/O, network bandwidth, CPU on redundant obfuscation, and memory on transient string allocations. The experimental incremental_query_metrics flag partially addressed this but was never fully productionized.

V2 was designed from scratch to be efficient by default: detect what changed first (integers only, no text), then resolve only the queries that need it. The architecture also lays groundwork for future batch FFI obfuscation support.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

datadog-prod-us1-6 · 2026-05-22T21:03:58Z

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
• Patch Coverage: 92.37%
• Overall Coverage: 93.66% (+6.30%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: f2b31c7 | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-05-22T21:07:59Z

Validation Report

All 21 validations passed.

Show details

Validation	Description	Status
`agent-reqs`	Verify check versions match the Agent requirements file	✅
`ci`	Validate CI configuration and Codecov settings	✅
`codeowners`	Validate every integration has a CODEOWNERS entry	✅
`config`	Validate default configuration files against spec.yaml	✅
`dep`	Verify dependency pins are consistent and Agent-compatible	✅
`http`	Validate integrations use the HTTP wrapper correctly	✅
`imports`	Validate check imports do not use deprecated modules	✅
`integration-style`	Validate check code style conventions	✅
`jmx-metrics`	Validate JMX metrics definition files and config	✅
`labeler`	Validate PR labeler config matches integration directories	✅
`legacy-signature`	Validate no integration uses the legacy Agent check signature	✅
`license-headers`	Validate Python files have proper license headers	✅
`licenses`	Validate third-party license attribution list	✅
`metadata`	Validate metadata.csv metric definitions	✅
`models`	Validate configuration data models match spec.yaml	✅
`openmetrics`	Validate OpenMetrics integrations disable the metric limit	✅
`package`	Validate Python package metadata and naming	✅
`qa-label`	Validate the pull request declares whether it needs QA for the next Agent release	✅
`readmes`	Validate README files have required sections	✅
`saved-views`	Validate saved view JSON file structure and fields	✅
`version`	Validate version consistency between package and changelog	✅

View full run

codecov · 2026-05-22T21:15:05Z

Codecov Report

❌ Patch coverage is 92.74809% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.51%. Comparing base (505fbd1) to head (f2b31c7).
⚠️ Report is 43 commits behind head on master.

Additional details and impacted files

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

eric-weaver added 5 commits May 22, 2026 11:46

Implement query metrics v2

410c4ce

check call counts before checking negative metrics

27ee672

Cleanup comments

9c0a245

Cleanup debug logs

c57742d

Add integration tests

d8c4469

temporal-github-worker-1 Bot added agent/review-requested ecosystems/review-requested product/review-requested labels May 22, 2026

dd-octo-sts Bot added documentation integration/postgres labels May 22, 2026

Add changelog

7377ef5

eric-weaver added the qa/required QA is required for this PR and will generate a QA card label May 22, 2026

Fix license headers

f2b31c7

dd-octo-sts Bot added dev/testing dev/tooling labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DBMON-6664] Optimize Postgres query metrics#23823

[DBMON-6664] Optimize Postgres query metrics#23823
eric-weaver wants to merge 7 commits into
masterfrom
eric.weaver/DBMON-6664

eric-weaver commented May 22, 2026

Uh oh!

datadog-prod-us1-6 Bot commented May 22, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

dd-octo-sts Bot commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-weaver commented May 22, 2026

What does this PR do?

Benchmark Results

Full 4-Way Comparison (blended 60-minute averages)

Key Improvements (V2 vs V1)

Motivation

Review checklist (to be filled by reviewers)

Uh oh!

datadog-prod-us1-6 Bot commented May 22, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dd-octo-sts Bot commented May 22, 2026

Validation Report

Uh oh!

codecov Bot commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-prod-us1-6 Bot commented May 22, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading