Skip to content

Fix: ClickHouse MATERIALIZED VIEW TO lineage extraction#27628

Open
Jtss-ux wants to merge 3 commits intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-materialized-view-lineage
Open

Fix: ClickHouse MATERIALIZED VIEW TO lineage extraction#27628
Jtss-ux wants to merge 3 commits intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-materialized-view-lineage

Conversation

@Jtss-ux
Copy link
Copy Markdown

@Jtss-ux Jtss-ux commented Apr 22, 2026

Summary

Fixes #26265

ClickHouse supports CREATE MATERIALIZED VIEW mv_name TO target_table AS SELECT ... syntax, where the TO clause designates the actual destination table for the data. The existing LineageParser passes this raw query directly to collate-sqllineage, which does not understand the TO clause and therefore misidentifies the Materialized View itself as the target — resulting in no lineage edge being drawn to the actual destination table.

Root Cause

collate-sqllineage (the underlying parser used by LineageParser) treats CREATE MATERIALIZED VIEW as a DDL statement without any special handling for the ClickHouse-specific TO clause. As a result:

  • The MV is identified as the target instead of the table specified in TO
  • No lineage edge is created from source table → destination table

Fix

Added a pre-processing step inside LineageParser.clean_raw_query that detects the ClickHouse MATERIALIZED VIEW ... TO ... AS SELECT pattern and rewrites it as a standard CREATE TABLE ... AS SELECT ... statement before it reaches LineageRunner.

This normalization approach:

  • Is minimal and non-invasive — only runs when the MATERIALIZED VIEW + TO pattern is present
  • Handles quoted and schema-qualified table names (e.g. \db.\ able)
  • Preserves the full SELECT clause for complete source table discovery
  • Uses only stdlib
    e, no new dependencies

Changes

  • \ingestion/src/metadata/ingestion/lineage/parser.py\ — added regex transformation in \clean_raw_query\

Testing

Verified locally using \collate-sqllineage's \LineageRunner:

Before fix:
\
query = 'CREATE MATERIALIZED VIEW default.my_mv TO default.my_target AS SELECT * FROM default.my_source'

source_tables: {default.my_source}

target_tables: {default.my_mv} <-- WRONG

\\

After fix (query rewritten to):
\
CREATE TABLE default.my_target AS SELECT * FROM default.my_source

source_tables: {default.my_source}

target_tables: {default.my_target} <-- CORRECT

\\

Related Issue

Closes #26265

@Jtss-ux Jtss-ux requested a review from a team as a code owner April 22, 2026 11:26
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/lineage/parser.py Outdated
Comment thread ingestion/src/metadata/ingestion/lineage/parser.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Refactors ClickHouse MATERIALIZED VIEW lineage extraction by moving imports to the module level, refining regex patterns to exclude engine clauses, and adding unit tests. All findings have been resolved.

✅ 3 resolved
Quality: Move import re to module level

📄 ingestion/src/metadata/ingestion/lineage/parser.py:507
The import re statement is placed inside clean_raw_query. Since re is a stdlib module and the method is a classmethod that may be called frequently, the import should be at the top of the file with the other imports. This aligns with PEP 8 and the rest of the codebase's import style.

Edge Case: Regex captures extra clauses (ENGINE/POPULATE) as target table

📄 ingestion/src/metadata/ingestion/lineage/parser.py:521
ClickHouse CREATE MATERIALIZED VIEW supports optional clauses between the TO <target> and AS SELECT, such as ENGINE = MergeTree() ORDER BY id or POPULATE. The non-greedy (.*?) in group 2 will absorb these into the target table name.

Example: CREATE MATERIALIZED VIEW mv TO target POPULATE AS SELECT * FROM src
→ group 2 = target POPULATE
→ rewritten as CREATE TABLE target POPULATE AS SELECT * FROM src

This would cause sqllineage to fail to parse the rewritten query, producing no lineage at all. While POPULATE is deprecated, ENGINE clauses are common in production ClickHouse schemas.

Quality: No unit tests for ClickHouse MV TO rewrite logic

The existing test file ingestion/tests/unit/test_query_parser.py has tests for other clean_raw_query transformations (COPY GRANTS, MERGE INTO, COPY FROM, CREATE TRIGGER, etc.), but no tests were added for the new ClickHouse MATERIALIZED VIEW TO rewrite. This regex has non-trivial matching behavior that should be covered.

Suggested test cases:

  • Basic: CREATE MATERIALIZED VIEW db.mv TO db.target AS SELECT * FROM db.src
  • With IF NOT EXISTS
  • Without TO clause (should pass through unchanged)
  • With schema-qualified and backtick-quoted names
  • With ON CLUSTER clause before TO
Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Jtss-ux Jtss-ux force-pushed the fix/clickhouse-materialized-view-lineage branch from 6e4e90b to 0f3b173 Compare April 22, 2026 11:46
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@Jtss-ux
Copy link
Copy Markdown
Author

Jtss-ux commented Apr 22, 2026

@harshach @nikhilchennam - I've addressed the edge case flagged by Gitar (tightened the regex to stop at ENGINE/POPULATE/SETTINGS) and added 6 unit tests covering the main variants (simple TO, ENGINE clause, IF NOT EXISTS, ON CLUSTER, no-TO passthrough, and end-to-end lineage validation). The branch is also rebased on latest main. Happy to make any further adjustments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Clickhouse Linage] Missing downsteam for MATERIALIZED VIEW

1 participant