Skip to content

Fix #26265: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause#27622

Open
Jtss-ux wants to merge 2 commits intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-mv-to-downstream-lineage
Open

Fix #26265: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause#27622
Jtss-ux wants to merge 2 commits intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-mv-to-downstream-lineage

Conversation

@Jtss-ux
Copy link
Copy Markdown

@Jtss-ux Jtss-ux commented Apr 22, 2026

Description

Fixes #26265
ClickHouse MATERIALIZED VIEW definitions using the TO <schema>.<table> syntax were generating correct upstream lineage (via the FROM clause) but missing the downstream link to the table specified in the TO clause.

Root Cause

The existing view lineage processor parses the view DDL using the generic SQL lineage parser, which understands standard FROM sources but does not recognise the ClickHouse-specific TO <schema>.<table> syntax as a downstream target.

Changes

  • ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py
    • Added _extract_mv_to_table(ddl) - a regex helper that extracts the TO target from a ClickHouse MATERIALIZED VIEW DDL.
    • Added _get_mv_downstream_lineage(view) - yields an AddLineageRequest from the materialized view entity -> the TO-table entity.
    • Overrode yield_view_lineage() to emit the extra downstream links.

Checklist

  • The code follows OpenMetadata coding standards
  • - [x] Self-review completed
  • - [x] No breaking changes to existing behavior

Jtss-ux added 2 commits April 22, 2026 11:50
…: Add downstream lineage for ClickHouse MATERIALIZED VIEW TO clause

Added regex for extracting schema and table from ClickHouse MATERIALIZED VIEW DDL. Implemented a function to extract target table information from the DDL.
…lickHouse MATERIALIZED VIEW TO clausekHouse views

Refactor regex for ClickHouse MATERIALIZED VIEW DDL parsing and enhance lineage extraction for downstream links.
@Jtss-ux Jtss-ux requested a review from a team as a code owner April 22, 2026 09:41
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment on lines +51 to +61
def _extract_mv_to_table(ddl: str) -> Optional[tuple]:
"""
Given the DDL of a ClickHouse MATERIALIZED VIEW, return (schema, table)
for the TO clause target, or None if the DDL does not use the TO syntax.
"""
if not re.search(r"\bMATERIALIZED\s+VIEW\b", ddl, re.IGNORECASE):
return None
match = _CLICKHOUSE_MV_TO_RE.search(ddl)
if match:
return match.group("schema"), match.group("table")
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Bug: Fatally broken indentation — file will not parse as Python

The entire file has wildly inconsistent indentation that will cause IndentationError or SyntaxError at import time, making this code completely non-functional.

Examples of broken indentation:

  • _extract_mv_to_table (line 51): function body indented 8 spaces, but match = ... at 16 spaces, if match: back to 4 spaces, return at 16 spaces. This is syntactically invalid Python.
  • _get_mv_downstream_lineage: method body starts at 16 spaces (line 100), try: drops to 8 spaces (line 113), content inside try jumps to 24 spaces (line 114), if from_entity and to_entity: at 12 spaces (line 123), else: at column 0 (line 145), except at column 0 (line 156). Every block boundary is misaligned.
  • Docstrings use escalating indentation (each line indented further than the last) instead of consistent block indentation.
  • The filters class variable string content has cascading indentation that changes its whitespace value.

This appears to be a tab/space or editor misconfiguration issue. The entire file needs to be reformatted with consistent 4-space indentation.

Suggested fix:

Reformat the entire file using a standard Python formatter (black, ruff format). As a quick reference, all class method bodies should be indented at 8 spaces (4 for class + 4 for method), all `try`/`except`/`else` blocks must align with each other, and `if`/`else` blocks must align with each other. The module-level function `_extract_mv_to_table` should use 4-space indentation for its body.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion


try:
from_entity = self.metadata.get_by_name(
entity=self.metadata._get_entity_class("table"), # noqa: SLF001
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Use public Table class instead of private _get_entity_class

The code calls self.metadata._get_entity_class("table") (lines 115, 119) which accesses a private method. The standard pattern across all other lineage sources is to import Table directly and pass it:

from metadata.generated.schema.entity.data.table import Table
self.metadata.get_by_name(entity=Table, fqn=...)

Using the private API is fragile and inconsistent with the codebase.

Suggested fix:

Add `from metadata.generated.schema.entity.data.table import Table` to imports, then replace:
  entity=self.metadata._get_entity_class("table")
with:
  entity=Table

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +41 to +48
_CLICKHOUSE_MV_TO_RE = re.compile(
r"""
\bTO\s+ # literal TO keyword
(?:`?(?P<schema>[^`.\s]+)`?\.)? # optional schema (backtick-quoted or plain)
`?(?P<table>[^`\s(,]+)`? # table name
""",
re.IGNORECASE | re.VERBOSE,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Regex may false-match TO keyword inside SELECT body of MV

The _CLICKHOUSE_MV_TO_RE regex matches any TO <schema>.<table> pattern anywhere in the DDL string. In a ClickHouse MATERIALIZED VIEW, the SELECT body can contain subqueries or expressions with a TO keyword (e.g., toDate(), CAST(... TO ...), etc.). While the \b word boundary helps, function names like toDate won't match due to no space, but CAST(x AS type) TO ... or comments containing TO schema.table could produce false positives.

A more robust approach would be to anchor the regex to match TO only in the DDL preamble (before the AS SELECT or AS ( portion), for example by splitting the DDL at the AS keyword first, or by using a regex that matches the full CREATE MATERIALIZED VIEW ... TO ... AS structure.

Suggested fix:

Split the DDL before searching:

  # Only search the preamble (before AS SELECT)
  as_match = re.search(r'\bAS\s+(?:SELECT|\()', ddl, re.IGNORECASE)
  preamble = ddl[:as_match.start()] if as_match else ddl
  match = _CLICKHOUSE_MV_TO_RE.search(preamble)

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +163 to +174
def yield_view_lineage(self) -> Iterable[Either[AddLineageRequest]]:
"""
Extends the base view lineage processing with ClickHouse-specific
MATERIALIZED VIEW TO <schema>.<table> downstream link generation.
"""
yield from super().yield_view_lineage()

logger.info(
"Processing ClickHouse MATERIALIZED VIEW downstream lineage (TO clause)"
)
for view in self.view_lineage_producer():
yield from self._get_mv_downstream_lineage(view)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Performance: view_lineage_producer() called twice, duplicating API calls

yield_view_lineage() first calls super().yield_view_lineage() which internally invokes view_lineage_producer() to fetch all view definitions from Elasticsearch, and then calls self.view_lineage_producer() again (line 173), making a second round of API calls to fetch the same data. For services with many views, this doubles the I/O.

Consider caching the producer results or restructuring to iterate once.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review 🚫 Blocked 0 resolved / 4 findings

Adds downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses, but fatal indentation errors prevent Python execution. The implementation also contains a regex false-positive risk, redundant API calls, and improper use of private internal methods.

🚨 Bug: Fatally broken indentation — file will not parse as Python

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:51-61 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:89-103 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:145-156

The entire file has wildly inconsistent indentation that will cause IndentationError or SyntaxError at import time, making this code completely non-functional.

Examples of broken indentation:

  • _extract_mv_to_table (line 51): function body indented 8 spaces, but match = ... at 16 spaces, if match: back to 4 spaces, return at 16 spaces. This is syntactically invalid Python.
  • _get_mv_downstream_lineage: method body starts at 16 spaces (line 100), try: drops to 8 spaces (line 113), content inside try jumps to 24 spaces (line 114), if from_entity and to_entity: at 12 spaces (line 123), else: at column 0 (line 145), except at column 0 (line 156). Every block boundary is misaligned.
  • Docstrings use escalating indentation (each line indented further than the last) instead of consistent block indentation.
  • The filters class variable string content has cascading indentation that changes its whitespace value.

This appears to be a tab/space or editor misconfiguration issue. The entire file needs to be reformatted with consistent 4-space indentation.

Suggested fix
Reformat the entire file using a standard Python formatter (black, ruff format). As a quick reference, all class method bodies should be indented at 8 spaces (4 for class + 4 for method), all `try`/`except`/`else` blocks must align with each other, and `if`/`else` blocks must align with each other. The module-level function `_extract_mv_to_table` should use 4-space indentation for its body.
⚠️ Bug: Regex may false-match TO keyword inside SELECT body of MV

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:41-48 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:56-60

The _CLICKHOUSE_MV_TO_RE regex matches any TO <schema>.<table> pattern anywhere in the DDL string. In a ClickHouse MATERIALIZED VIEW, the SELECT body can contain subqueries or expressions with a TO keyword (e.g., toDate(), CAST(... TO ...), etc.). While the \b word boundary helps, function names like toDate won't match due to no space, but CAST(x AS type) TO ... or comments containing TO schema.table could produce false positives.

A more robust approach would be to anchor the regex to match TO only in the DDL preamble (before the AS SELECT or AS ( portion), for example by splitting the DDL at the AS keyword first, or by using a regex that matches the full CREATE MATERIALIZED VIEW ... TO ... AS structure.

Suggested fix
Split the DDL before searching:

  # Only search the preamble (before AS SELECT)
  as_match = re.search(r'\bAS\s+(?:SELECT|\()', ddl, re.IGNORECASE)
  preamble = ddl[:as_match.start()] if as_match else ddl
  match = _CLICKHOUSE_MV_TO_RE.search(preamble)
💡 Quality: Use public Table class instead of private _get_entity_class

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:115 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:119

The code calls self.metadata._get_entity_class("table") (lines 115, 119) which accesses a private method. The standard pattern across all other lineage sources is to import Table directly and pass it:

from metadata.generated.schema.entity.data.table import Table
self.metadata.get_by_name(entity=Table, fqn=...)

Using the private API is fragile and inconsistent with the codebase.

Suggested fix
Add `from metadata.generated.schema.entity.data.table import Table` to imports, then replace:
  entity=self.metadata._get_entity_class("table")
with:
  entity=Table
💡 Performance: view_lineage_producer() called twice, duplicating API calls

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:163-174

yield_view_lineage() first calls super().yield_view_lineage() which internally invokes view_lineage_producer() to fetch all view definitions from Elasticsearch, and then calls self.view_lineage_producer() again (line 173), making a second round of API calls to fetch the same data. For services with many views, this doubles the I/O.

Consider caching the producer results or restructuring to iterate once.

🤖 Prompt for agents
Code Review: Adds downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses, but fatal indentation errors prevent Python execution. The implementation also contains a regex false-positive risk, redundant API calls, and improper use of private internal methods.

1. 🚨 Bug: Fatally broken indentation — file will not parse as Python
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:51-61, ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:89-103, ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:145-156

   The entire file has wildly inconsistent indentation that will cause `IndentationError` or `SyntaxError` at import time, making this code completely non-functional.
   
   Examples of broken indentation:
   - `_extract_mv_to_table` (line 51): function body indented 8 spaces, but `match = ...` at 16 spaces, `if match:` back to 4 spaces, `return` at 16 spaces. This is syntactically invalid Python.
   - `_get_mv_downstream_lineage`: method body starts at 16 spaces (line 100), `try:` drops to 8 spaces (line 113), content inside `try` jumps to 24 spaces (line 114), `if from_entity and to_entity:` at 12 spaces (line 123), `else:` at column 0 (line 145), `except` at column 0 (line 156). Every block boundary is misaligned.
   - Docstrings use escalating indentation (each line indented further than the last) instead of consistent block indentation.
   - The `filters` class variable string content has cascading indentation that changes its whitespace value.
   
   This appears to be a tab/space or editor misconfiguration issue. The entire file needs to be reformatted with consistent 4-space indentation.

   Suggested fix:
   Reformat the entire file using a standard Python formatter (black, ruff format). As a quick reference, all class method bodies should be indented at 8 spaces (4 for class + 4 for method), all `try`/`except`/`else` blocks must align with each other, and `if`/`else` blocks must align with each other. The module-level function `_extract_mv_to_table` should use 4-space indentation for its body.

2. 💡 Quality: Use public Table class instead of private _get_entity_class
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:115, ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:119

   The code calls `self.metadata._get_entity_class("table")` (lines 115, 119) which accesses a private method. The standard pattern across all other lineage sources is to import `Table` directly and pass it:
   
   ```python
   from metadata.generated.schema.entity.data.table import Table
   self.metadata.get_by_name(entity=Table, fqn=...)
   ```
   
   Using the private API is fragile and inconsistent with the codebase.

   Suggested fix:
   Add `from metadata.generated.schema.entity.data.table import Table` to imports, then replace:
     entity=self.metadata._get_entity_class("table")
   with:
     entity=Table

3. ⚠️ Bug: Regex may false-match TO keyword inside SELECT body of MV
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:41-48, ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:56-60

   The `_CLICKHOUSE_MV_TO_RE` regex matches any `TO <schema>.<table>` pattern anywhere in the DDL string. In a ClickHouse MATERIALIZED VIEW, the `SELECT` body can contain subqueries or expressions with a `TO` keyword (e.g., `toDate()`, `CAST(... TO ...)`, etc.). While the `\b` word boundary helps, function names like `toDate` won't match due to no space, but `CAST(x AS type) TO ...` or comments containing `TO schema.table` could produce false positives.
   
   A more robust approach would be to anchor the regex to match `TO` only in the DDL preamble (before the `AS SELECT` or `AS (` portion), for example by splitting the DDL at the `AS` keyword first, or by using a regex that matches the full `CREATE MATERIALIZED VIEW ... TO ... AS` structure.

   Suggested fix:
   Split the DDL before searching:
   
     # Only search the preamble (before AS SELECT)
     as_match = re.search(r'\bAS\s+(?:SELECT|\()', ddl, re.IGNORECASE)
     preamble = ddl[:as_match.start()] if as_match else ddl
     match = _CLICKHOUSE_MV_TO_RE.search(preamble)

4. 💡 Performance: view_lineage_producer() called twice, duplicating API calls
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:163-174

   `yield_view_lineage()` first calls `super().yield_view_lineage()` which internally invokes `view_lineage_producer()` to fetch all view definitions from Elasticsearch, and then calls `self.view_lineage_producer()` again (line 173), making a second round of API calls to fetch the same data. For services with many views, this doubles the I/O.
   
   Consider caching the producer results or restructuring to iterate once.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Clickhouse Linage] Missing downsteam for MATERIALIZED VIEW

1 participant