Db ingestion improvements by harshach · Pull Request #27609 · open-metadata/OpenMetadata

harshach · 2026-04-22T06:20:05Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Memory optimization:
- Added _clear_thread_reflection_cache to CommonDbSourceService to flush SQLAlchemy's unbounded info_cache between schema boundaries.
- Mitigates OOM risks during heavy metadata ingestion for connectors like Snowflake by reclaiming memory per-schema.
Testing:
- Added TestSchemaBoundaryReflectionCacheClear to verify cache clearing logic and ensure thread-safe operation by isolating inspector cache purges.

_{This will update automatically on new commits.}

MultiDBSource connectors (Snowflake, BigQuery, Redshift, ...) call CommonDbSourceService.set_inspector() once per database to re-point the Inspector. The previous engine was simply reassigned without .dispose(), leaking its QueuePool, dialect metadata, Inspector info_cache, and any per-engine `event.listens_for(engine, "connect")` closures that capture self (e.g. SnowflakeSource.set_session_query_tag). Across N databases this accumulates linearly — a significant contributor to long-running- pipeline memory growth. kill_active_connections() was already in place but only disposes when the pool has active/idle checkouts, which misses the idle-between- switches case. Now set_inspector unconditionally disposes the old engine, guarded for None (first call) and for dispose failures (a teardown error must not block the new engine assignment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

self.context.get_global().deleted_tables is a process-wide list extended per-schema by _get_table_names_and_types and _get_stream_names_and_types. In incremental mode, ACCOUNT_USAGE.TABLES returns soft-deleted rows historically, so the list grows across every schema of every database and is consumed only at the end of each DB's topology by mark_tables_as_deleted. Two consequences this fix addresses: 1. Unbounded memory growth. In deployments with ~39 databases and tens of thousands of ghost tables, the list accumulates hundreds of MB of FQN strings, held for the whole run. 2. Cross-database delete replay. Entries from database N linger in the list while database N+1 is being processed, so delete-by-name calls at the end of DB N+1 receive FQNs belonging to DB N — potentially marking tables as deleted that still exist. Adds _reset_per_database_state() called at the start of each DB cycle in both configured-DB and multi-DB branches of get_database_names. Uses list.clear() in place (rather than reassignment) so any future code that caches a local reference to the list sees the cleared state consistently. Thread safety is preserved by the topology runner's ThreadPoolExecutor context manager, which joins all schema-level workers before returning to the producer generator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-04-22T06:22:04Z

@@ -153,6 +153,16 @@ def set_inspector(self, database_name: str) -> None:
        """

        kill_active_connections(self.engine)


💡 Performance: Engine may be disposed twice per database switch

kill_active_connections() already calls engine.dispose() when there are active/idle connections (see connections_utils.py). The new code at line 163 then calls self.engine.dispose() unconditionally, leading to a double-dispose on the same engine. SQLAlchemy's dispose() is idempotent so this is safe, but the redundant pool teardown is unnecessary work.

Suggested fix:

Move `kill_active_connections` into the `if self.engine is not None` guard and call `dispose()` only once, or remove the conditional dispose from `kill_active_connections` since the caller now handles it: if self.engine is not None: kill_active_connections(self.engine) try: self.engine.dispose() except Exception as exc: logger.warning(...)

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

Copilot

Pull request overview

This PR targets long-running, multi-database ingestion memory growth by (1) scoping Snowflake incremental deleted_tables tracking to the current database and (2) disposing SQLAlchemy engines when switching databases, with unit tests to prevent regressions.

Changes:

Reset Snowflake per-database state (deleted_tables) before yielding each database from get_database_names().
Dispose the previous SQLAlchemy engine during CommonDbSourceService.set_inspector() DB switches.
Add unit tests covering deleted-table resets (Snowflake) and engine disposal behavior (common DB source).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`ingestion/src/metadata/ingestion/source/database/snowflake/metadata.py`	Adds a per-database reset hook to clear `deleted_tables` before each DB is processed.
`ingestion/src/metadata/ingestion/source/database/common_db_source.py`	Disposes the previous engine on DB switch to reduce resource retention across multi-DB runs.
`ingestion/tests/unit/topology/database/test_snowflake.py`	Adds regression tests asserting `deleted_tables` is cleared between DB yields (configured DB and discovery paths).
`ingestion/tests/unit/topology/database/test_common_db_source.py`	Adds tests verifying engine disposal (and resilience to dispose failures) on DB switch.

Comments suppressed due to low confidence (1)

ingestion/src/metadata/ingestion/source/database/common_db_source.py:173

set_inspector replaces self.engine but does not rebind/replace self.session (created in __init__ via create_and_bind_thread_safe_session(self.engine)). This keeps a strong reference to the old Engine (and its dialect/event listeners), undermining the intent of disposing the old engine and potentially causing any session usage to target the wrong DB. Consider calling self.session.remove() (if present) and recreating self.session bound to the new engine after get_connection(...).

        new_service_connection = deepcopy(self.service_connection)
        new_service_connection.database = database_name
        self.engine = get_connection(new_service_connection)

        self._connection_map = {}  # Lazy init as well
        self._inspector_map = {}

Copilot · 2026-04-22T06:24:11Z

        kill_active_connections(self.engine)
+        # Release the previous engine's pool, dialect state, Inspector
+        # info_cache, and any per-engine event listeners that capture `self`
+        # (e.g. SnowflakeSource.set_session_query_tag). Without this, each
+        # database switch leaks ~tens of MB → linear growth on multi-DB
+        # pipelines.
+        if self.engine is not None:
+            try:


kill_active_connections(self.engine) is executed unconditionally. If set_inspector is ever called before an engine is initialized (or after it has been cleared), kill_active_connections will log a warning due to engine being None. Consider guarding this call with if self.engine is not None: (similar to the new dispose guard) to avoid noisy logs and unnecessary exception handling in normal flows.

Copilot · 2026-04-22T06:24:11Z

+            try:
+                self.engine.dispose()
+            except Exception as exc:
+                logger.warning(f"Failed to dispose previous engine on DB switch: {exc}")


The warning on dispose failure drops the traceback, which will make diagnosing pool teardown issues harder in production. Consider logging with exc_info=True (or logger.exception(...)) so the stack trace is captured while still proceeding with the DB switch.

Suggested change

logger.warning(f"Failed to dispose previous engine on DB switch: {exc}")

logger.warning(

f"Failed to dispose previous engine on DB switch: {exc}",

exc_info=True,

)

SQLAlchemy's Inspector.info_cache is an unbounded dict populated by every @reflection.cache-decorated method (get_schema_columns, get_pk_constraint, get_foreign_keys, get_unique_constraints, get_columns, get_view_definition, get_table_ddl, ...). Inspectors are retained per-thread in _inspector_map until close(), so the cache accumulates every schema's entries for the full duration of each database. On Snowflake specifically, get_schema_columns materializes the full column catalog of a schema into a single cache entry — measured at ~50 MB per wide schema. In the 52k-tables / 50-cols-per-table worst case observed in production, one inspector's cache crosses 1 GB before any DB-boundary cleanup can run, OOM-killing the pod mid-database. Multi-threaded ingestion multiplies this by the worker count. Adds _clear_thread_reflection_cache() which clears info_cache for the CURRENT thread's inspector only — other threads may be mid-reflection and clearing their cache would race with in-flight reads. Wired into get_tables_name_and_type(), the per-schema producer hook in CommonDbSourceService, so prior-schema entries are reclaimed before the next schema's entries are populated. The only runtime cost is re-querying _current_database_schema once per schema, which is a cheap single-row query. Benefits every SQL-based connector (Snowflake, BigQuery, Redshift, ...); Snowflake sees the biggest improvement given its dense reflection path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-04-22T07:08:54Z

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Database ingestion logic has been improved to streamline resource handling. Ensure the redundant engine disposal in kill_active_connections() is addressed to prevent potential double-disposal issues during database switches.

💡 Performance: Engine may be disposed twice per database switch

📄 ingestion/src/metadata/ingestion/source/database/common_db_source.py:155 📄 ingestion/src/metadata/ingestion/source/database/common_db_source.py:161-163

kill_active_connections() already calls engine.dispose() when there are active/idle connections (see connections_utils.py). The new code at line 163 then calls self.engine.dispose() unconditionally, leading to a double-dispose on the same engine. SQLAlchemy's dispose() is idempotent so this is safe, but the redundant pool teardown is unnecessary work.

Suggested fix

Move `kill_active_connections` into the `if self.engine is not None` guard and call `dispose()` only once, or remove the conditional dispose from `kill_active_connections` since the caller now handles it:

    if self.engine is not None:
        kill_active_connections(self.engine)
        try:
            self.engine.dispose()
        except Exception as exc:
            logger.warning(...)

🤖 Prompt for agents

Code Review: Database ingestion logic has been improved to streamline resource handling. Ensure the redundant engine disposal in `kill_active_connections()` is addressed to prevent potential double-disposal issues during database switches.

1. 💡 Performance: Engine may be disposed twice per database switch
   Files: ingestion/src/metadata/ingestion/source/database/common_db_source.py:155, ingestion/src/metadata/ingestion/source/database/common_db_source.py:161-163

   `kill_active_connections()` already calls `engine.dispose()` when there are active/idle connections (see `connections_utils.py`). The new code at line 163 then calls `self.engine.dispose()` unconditionally, leading to a double-dispose on the same engine. SQLAlchemy's `dispose()` is idempotent so this is safe, but the redundant pool teardown is unnecessary work.

   Suggested fix:
   Move `kill_active_connections` into the `if self.engine is not None` guard and call `dispose()` only once, or remove the conditional dispose from `kill_active_connections` since the caller now handles it:
   
       if self.engine is not None:
           kill_active_connections(self.engine)
           try:
               self.engine.dispose()
           except Exception as exc:
               logger.warning(...)

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-22T08:09:47Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
86.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-22T09:16:12Z

🟡 Playwright Results — all passed (13 flaky)

✅ 3032 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 88 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	480	1	4
🟡 Shard 2	655	1	7
🟡 Shard 4	644	4	27
✅ Shard 5	611	0	42
🟡 Shard 6	642	7	8

🟡 13 flaky test(s) (passed on retry)

Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for Pipeline (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
Pages/Domains.spec.ts › Domain owner should able to edit description of domain (shard 4, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for table -> dashboard (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
Pages/ServiceEntity.spec.ts › Announcement create, edit & delete (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

harshach and others added 3 commits April 21, 2026 23:18

Fix pycheck

a6cab91

harshach requested a review from a team as a code owner April 22, 2026 06:20

Copilot AI review requested due to automatic review settings April 22, 2026 06:20

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026

Copilot started reviewing on behalf of harshach April 22, 2026 06:20 View session

gitar-bot Bot reviewed Apr 22, 2026

View reviewed changes

Copilot AI reviewed Apr 22, 2026

View reviewed changes

harshach had a problem deploying to test April 22, 2026 06:29 — with GitHub Actions Error

harshach and others added 2 commits April 22, 2026 00:01

fix pycheck

506740f

harshach had a problem deploying to test April 22, 2026 07:17 — with GitHub Actions Failure

harshach temporarily deployed to test April 22, 2026 07:17 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Db ingestion improvements#27609

Db ingestion improvements#27609
harshach wants to merge 5 commits intomainfrom
db_ingestion_improvements

harshach commented Apr 22, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

gitar-bot Bot Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

gitar-bot Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -153,6 +153,16 @@ def set_inspector(self, database_name: str) -> None:
		"""

		kill_active_connections(self.engine)

Conversation

harshach commented Apr 22, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

gitar-bot Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 22, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented Apr 22, 2026

🟡 Playwright Results — all passed (13 flaky)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshach commented Apr 22, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Apr 22, 2026 •

edited

Loading