Skip to content

Conversation

@shinsj4653
Copy link

Aurora PostgreSQL 17+ crashes with Bus Error (signal 7) when querying pg_buffercache. This adds a guard clause to skip collect_buffercache_metrics on Aurora PostgreSQL 17+, with a warning log for visibility.

Fixes: #21633

What does this PR do?

Adds a guard clause to skip collect_buffercache_metrics on Aurora PostgreSQL 17+ to prevent database instance crash.

Changes in postgres/datadog_checks/postgres/postgres.py:

  • Added version check (self.is_aurora and self.version >= V17) before adding BUFFERCACHE_METRICS query
  • When the check is skipped, a warning log is emitted with a reference to the issue for user visibility
  • Uses >= V17 to cover potential future versions with the same issue

Tests added in postgres/tests/test_pg_integration.py:

  • test_buffercache_metrics_skipped_on_aurora_17: Verifies buffercache metrics are NOT collected on Aurora PostgreSQL 17+
  • test_buffercache_metrics_collected_on_non_aurora_17: Verifies buffercache metrics ARE still collected on non-Aurora PostgreSQL

Motivation

Aurora PostgreSQL 17 crashes with a Bus Error (signal 7) when the Datadog Agent queries pg_buffercache. This appears to be an Aurora-specific issue with the buffer cache extension in version 17.

Error from AWS logs (reported in #21633):

LOG: server process (PID 1031) was terminated by signal 7: Bus error
DETAIL: Failed process was running: /* service='datadog-agent' */ WITH buffer_by_relfilenode AS (
    SELECT reldatabase, relfilenode, ...
    FROM pg_buffercache
    GROUP BY reldatabase, relfilenode
)

This is a workaround until AWS resolves the underlying Aurora issue.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Andrew added 2 commits December 21, 2025 23:23
…nt crash

Aurora PostgreSQL 17+ crashes with Bus Error (signal 7) when querying
pg_buffercache. This adds a guard clause to skip collect_buffercache_metrics
on Aurora PostgreSQL 17+, with a warning log for visibility.

Fixes: DataDog#21633
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 407 to 411
# Simulate Aurora PostgreSQL 17+
check.is_aurora = True
check.version = V17

check.run()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Constrain Aurora 17 skip test to matching server versions

test_buffercache_metrics_skipped_on_aurora_17 sets check.is_aurora = True and check.version = V17 before check.run(), but PostgreSql.load_version (postgres/datadog_checks/postgres/postgres.py:561) overwrites self.version with the real server version during check initialization. On any CI run targeting Postgres ≤16 (as configured in postgres/hatch.toml), the guard self.version >= V17 never triggers, buffercache metrics are collected, and the subsequent assert_metric(..., count=0) expectations fail. The test should be limited to Postgres 17+ or stub out version loading so it does not depend on the runtime version.

Useful? React with 👍 / 👎.

@codecov
Copy link

codecov bot commented Dec 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.35%. Comparing base (a6520fa) to head (39af0c3).

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The test was failing on Postgres 10-16 because load_version() overwrites
check.version during check.run(). Changed @requires_over_10 to @requires_over_17
so the test only runs where the actual version satisfies the skip condition.
@shinsj4653 shinsj4653 force-pushed the fix/issue-21633-aurora-buffercache-crash branch from 8030b76 to 6af41fd Compare December 22, 2025 12:07
@shinsj4653
Copy link
Author

shinsj4653 commented Dec 22, 2025

@DataDog/database-monitoring-agent

Hi team! 👋

This PR is ready for review. Here's a quick summary:

Issue: collect_buffercache_metrics crashes the DB on Aurora PostgreSQL 17+ (#21633)

Fix:

  • Skip buffercache metrics collection when version >= V17 AND is_aurora == True
  • Fixed the test to use @requires_over_17 decorator (was incorrectly running on PG 10-16)

Changes:

  • postgres.py: Added Aurora 17+ skip logic in buffercache collection
  • test_pg_integration.py: Fixed test decorator from @requires_over_10@requires_over_17

All CI checks are now passing ✅

Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

collect_buffercache_metrics crashes DB on PostgresSQL Aurora 17

1 participant