Skip inaccessible databases in DBM schema collection instead of aborting#23807
Skip inaccessible databases in DBM schema collection instead of aborting#23807pierreln-dd wants to merge 10 commits into
Conversation
When a database is unreachable (e.g. version mismatch, restricted state), _get_cursor raises and the outer collect_schemas re-raises, discarding all schemas already collected for other databases. Wrap per-database iteration in a try/except so that a single failing database is skipped with a warning and collection continues for the rest. Move the final flush outside the loop to ensure queued rows are submitted regardless of which database was last. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
- Add _is_connection_error hook so subclasses can restrict which errors are treated as recoverable (defaults to True to preserve behavior) - Remove redundant continue in except block - Add exc_info=True to warning so stack trace is preserved in logs - Track _skipped_databases_count and emit it as a gauge metric so partial collection failures are observable - Add unit tests for error isolation and the _is_connection_error hook Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…add sqlserver override - Guard final maybe_flush so no empty payload is emitted when all databases are skipped or no data was collected (fixes new behavioral regression) - Add SQLServerSchemaCollector._is_connection_error that only catches pyodbc errors, preventing internal errors like uninitialized pre-2017 cursor from being silently swallowed and misclassified as per-database skips - Add test asserting no payload is emitted when all databases fail Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
collect_schemas was reaching 6 levels of nesting. Extracting per-database logic into _collect_database_schemas brings the max depth to 4 and makes the high-level flow readable without scrolling past error handling. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…erminal flush - Add PostgresSchemaCollector._is_connection_error to narrow recovery to psycopg.Error only, matching the SQL Server approach - Remove the empty-payload guard: always emit the terminal flush so the backend snapshotting protocol receives collection_payloads_count regardless of how many databases were collected - Emit status:partial on schema metrics when one or more databases were skipped, distinguishing partial collection from full success or hard error - Update tests to assert status:partial tag and terminal payload presence Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… customer-facing Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Rewrite sqlserver changelog to customer-facing language without implementation examples - Remove example details from _is_connection_error docstrings - Narrow PostgresSchemaCollector._is_connection_error to psycopg.OperationalError so only connection-level failures skip a database (ProgrammingError and other non-access errors propagate as hard failures) - Add unit tests in sqlserver and postgres verifying the override catches driver errors but not internal exceptions Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2231e1168f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while next_row: | ||
| self._queued_rows.append(self._map_row(database, next_row)) | ||
| self._total_rows_count += 1 | ||
| next_row = self._get_next(cursor) | ||
| self.maybe_flush(is_last_payload=False) |
There was a problem hiding this comment.
Avoid emitting rows for a database later skipped
When a recoverable driver error happens after this loop has already appended rows for the current database (or after maybe_flush has sent a chunk), the except below counts the database as skipped but leaves those partial rows in the snapshot. This can happen for SQL Server legacy collection because _map_row runs additional per-table queries, or for any cursor that fails during fetch, so an inaccessible/flaky database may still replace backend metadata with an incomplete schema. Buffer per-database rows and only merge/flush them after the database completes, or only apply the skip path to cursor-open failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Addressed in the latest commit: unflushed rows accumulated for the current database are now discarded from _queued_rows in the except block before the skip is recorded. For the common case (database fails before or early in collection, no chunk flush yet) the snapshot stays clean. For large databases where a mid-loop maybe_flush already sent a chunk, those rows can't be recalled — a separate warning is emitted to make the incomplete snapshot visible in logs.
If an error fires after some rows have already been appended to the queue (e.g. _map_row failure on the nth table), roll back the unflushed rows for that database so they don't pollute the snapshot. Rows already sent via a mid-loop chunk flush cannot be recalled; log a warning in that case so the incomplete snapshot is visible. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Validation ReportAll 21 validations passed. Show details
|
What does this PR do?
When a database cannot be opened during schema collection, the per-database error is now caught and the database is skipped with a warning instead of aborting collection for all databases. All other databases on the instance continue to be collected and flushed normally.
Changes across
datadog_checks_base,sqlserver, andpostgres:_collect_database_schemas(extracted helper) wraps each database in atry/exceptguarded by_is_connection_error. SQL Server catchespyodbc.Error; Postgres catchespsycopg.OperationalError. Internal errors (logic bugs, uninitialized state) propagate as before.dd.<dbms>.schema.skipped_databases_countgauge emitted each run;status:partialtag on all schema metrics when at least one database was skipped.exc_info=True).maybe_flush(is_last_payload=True)always fires so the backend receivescollection_payloads_countregardless of skip count._collect_database_schemasto reduce nesting depth incollect_schemas.Motivation
When
database_autodiscoverydiscovers a database that cannot be opened,_get_cursor()raises a driver-level error. The outercollect_schemas()previously caught and re-raised this, discarding all schemas collected from databases processed earlier in the same run. Accessible databases on the same instance reported no schema data.This affects SQL Server and Postgres integrations that inherit from the shared
SchemaCollectorbase class.Related escalation: SDBM-2634.
Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged