Fix #26265: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause#27622
Fix #26265: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause#27622Jtss-ux wants to merge 2 commits intoopen-metadata:mainfrom
Conversation
…: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause Added regex for extracting schema and table from ClickHouse MATERIALIZED VIEW DDL. Implemented a function to extract target table information from the DDL.
…lickHouse MATERIALIZED VIEW TO clausekHouse views Refactor regex for ClickHouse MATERIALIZED VIEW DDL parsing and enhance lineage extraction for downstream links.
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| def _extract_mv_to_table(ddl: str) -> Optional[tuple]: | ||
| """ | ||
| Given the DDL of a ClickHouse MATERIALIZED VIEW, return (schema, table) | ||
| for the TO clause target, or None if the DDL does not use the TO syntax. | ||
| """ | ||
| if not re.search(r"\bMATERIALIZED\s+VIEW\b", ddl, re.IGNORECASE): | ||
| return None | ||
| match = _CLICKHOUSE_MV_TO_RE.search(ddl) | ||
| if match: | ||
| return match.group("schema"), match.group("table") | ||
| return None |
There was a problem hiding this comment.
🚨 Bug: Fatally broken indentation — file will not parse as Python
The entire file has wildly inconsistent indentation that will cause IndentationError or SyntaxError at import time, making this code completely non-functional.
Examples of broken indentation:
_extract_mv_to_table(line 51): function body indented 8 spaces, butmatch = ...at 16 spaces,if match:back to 4 spaces,returnat 16 spaces. This is syntactically invalid Python._get_mv_downstream_lineage: method body starts at 16 spaces (line 100),try:drops to 8 spaces (line 113), content insidetryjumps to 24 spaces (line 114),if from_entity and to_entity:at 12 spaces (line 123),else:at column 0 (line 145),exceptat column 0 (line 156). Every block boundary is misaligned.- Docstrings use escalating indentation (each line indented further than the last) instead of consistent block indentation.
- The
filtersclass variable string content has cascading indentation that changes its whitespace value.
This appears to be a tab/space or editor misconfiguration issue. The entire file needs to be reformatted with consistent 4-space indentation.
Suggested fix:
Reformat the entire file using a standard Python formatter (black, ruff format). As a quick reference, all class method bodies should be indented at 8 spaces (4 for class + 4 for method), all `try`/`except`/`else` blocks must align with each other, and `if`/`else` blocks must align with each other. The module-level function `_extract_mv_to_table` should use 4-space indentation for its body.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
|
|
||
| try: | ||
| from_entity = self.metadata.get_by_name( | ||
| entity=self.metadata._get_entity_class("table"), # noqa: SLF001 |
There was a problem hiding this comment.
💡 Quality: Use public Table class instead of private _get_entity_class
The code calls self.metadata._get_entity_class("table") (lines 115, 119) which accesses a private method. The standard pattern across all other lineage sources is to import Table directly and pass it:
from metadata.generated.schema.entity.data.table import Table
self.metadata.get_by_name(entity=Table, fqn=...)Using the private API is fragile and inconsistent with the codebase.
Suggested fix:
Add `from metadata.generated.schema.entity.data.table import Table` to imports, then replace:
entity=self.metadata._get_entity_class("table")
with:
entity=Table
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| _CLICKHOUSE_MV_TO_RE = re.compile( | ||
| r""" | ||
| \bTO\s+ # literal TO keyword | ||
| (?:`?(?P<schema>[^`.\s]+)`?\.)? # optional schema (backtick-quoted or plain) | ||
| `?(?P<table>[^`\s(,]+)`? # table name | ||
| """, | ||
| re.IGNORECASE | re.VERBOSE, | ||
| ) |
There was a problem hiding this comment.
⚠️ Bug: Regex may false-match TO keyword inside SELECT body of MV
The _CLICKHOUSE_MV_TO_RE regex matches any TO <schema>.<table> pattern anywhere in the DDL string. In a ClickHouse MATERIALIZED VIEW, the SELECT body can contain subqueries or expressions with a TO keyword (e.g., toDate(), CAST(... TO ...), etc.). While the \b word boundary helps, function names like toDate won't match due to no space, but CAST(x AS type) TO ... or comments containing TO schema.table could produce false positives.
A more robust approach would be to anchor the regex to match TO only in the DDL preamble (before the AS SELECT or AS ( portion), for example by splitting the DDL at the AS keyword first, or by using a regex that matches the full CREATE MATERIALIZED VIEW ... TO ... AS structure.
Suggested fix:
Split the DDL before searching:
# Only search the preamble (before AS SELECT)
as_match = re.search(r'\bAS\s+(?:SELECT|\()', ddl, re.IGNORECASE)
preamble = ddl[:as_match.start()] if as_match else ddl
match = _CLICKHOUSE_MV_TO_RE.search(preamble)
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| def yield_view_lineage(self) -> Iterable[Either[AddLineageRequest]]: | ||
| """ | ||
| Extends the base view lineage processing with ClickHouse-specific | ||
| MATERIALIZED VIEW TO <schema>.<table> downstream link generation. | ||
| """ | ||
| yield from super().yield_view_lineage() | ||
|
|
||
| logger.info( | ||
| "Processing ClickHouse MATERIALIZED VIEW downstream lineage (TO clause)" | ||
| ) | ||
| for view in self.view_lineage_producer(): | ||
| yield from self._get_mv_downstream_lineage(view) |
There was a problem hiding this comment.
💡 Performance: view_lineage_producer() called twice, duplicating API calls
yield_view_lineage() first calls super().yield_view_lineage() which internally invokes view_lineage_producer() to fetch all view definitions from Elasticsearch, and then calls self.view_lineage_producer() again (line 173), making a second round of API calls to fetch the same data. For services with many views, this doubles the I/O.
Consider caching the producer results or restructuring to iterate once.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 🚫 Blocked 0 resolved / 4 findingsAdds downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses, but fatal indentation errors prevent Python execution. The implementation also contains a regex false-positive risk, redundant API calls, and improper use of private internal methods. 🚨 Bug: Fatally broken indentation — file will not parse as Python📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:51-61 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:89-103 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:145-156 The entire file has wildly inconsistent indentation that will cause Examples of broken indentation:
This appears to be a tab/space or editor misconfiguration issue. The entire file needs to be reformatted with consistent 4-space indentation. Suggested fix
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
Description
Fixes #26265
ClickHouse
MATERIALIZED VIEWdefinitions using theTO <schema>.<table>syntax were generating correct upstream lineage (via theFROMclause) but missing the downstream link to the table specified in theTOclause.Root Cause
The existing view lineage processor parses the view DDL using the generic SQL lineage parser, which understands standard
FROMsources but does not recognise the ClickHouse-specificTO <schema>.<table>syntax as a downstream target.Changes
ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py_extract_mv_to_table(ddl)- a regex helper that extracts theTOtarget from a ClickHouseMATERIALIZED VIEWDDL._get_mv_downstream_lineage(view)- yields anAddLineageRequestfrom the materialized view entity -> the TO-table entity.yield_view_lineage()to emit the extra downstream links.Checklist