Skip to content

branch-4.1: [feat] add Parquet metadata TVF #58972#61357

Closed
xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
xylaaaaa:auto-pick-58972-branch-4.1
Closed

branch-4.1: [feat] add Parquet metadata TVF #58972#61357
xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
xylaaaaa:auto-pick-58972-branch-4.1

Conversation

@xylaaaaa
Copy link
Contributor

@xylaaaaa xylaaaaa commented Mar 16, 2026

Cherry-pick #58972 to branch-4.1

What problem does this PR solve?

Add Parquet metadata TVF.

Cherry-pick commit

@xylaaaaa xylaaaaa requested a review from yiguolei as a code owner March 16, 2026 03:10
Copilot AI review requested due to automatic review settings March 16, 2026 03:10
@Thearas
Copy link
Contributor

Thearas commented Mar 16, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@xylaaaaa
Copy link
Contributor Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Parquet metadata table-valued function (TVF) pipeline (parquet_meta + convenience aliases) that lets users query Parquet footer/schema/statistics/bloom info via the existing metadata scan framework (FE thrift params → BE MetaScanner reader), along with regression + unit tests.

Changes:

  • Add TMetadataType.PARQUET and a new thrift payload TParquetMetadataParams, then wire a new BE ParquetMetadataReader into MetaScanner.
  • Add FE catalog + Nereids bindings for parquet_meta and aliases (parquet_file_metadata, parquet_kv_metadata, parquet_bloom_probe) and parameter validation / glob expansion.
  • Add regression coverage (S3/HDFS/local) and BE unit tests for shared Parquet utility helpers.

Reviewed changes

Copilot reviewed 18 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy New regression suite covering parquet_meta modes across S3/HDFS/local + error cases
regression-test/data/external_table_p0/tvf/test_parquet_meta_tvf.out Golden output for the new regression suite
regression-test/data/external_table_p0/tvf/meta.parquet Test Parquet file used by regression suite
regression-test/data/external_table_p0/tvf/kvmeta.parquet Test Parquet file containing KV metadata
regression-test/data/external_table_p0/tvf/empty.parquet Empty Parquet test file
regression-test/data/external_table_p0/tvf/bloommeta.parquet Parquet file with bloom filter metadata for probe mode
gensrc/thrift/Types.thrift Add TMetadataType.PARQUET
gensrc/thrift/PlanNodes.thrift Add TParquetMetadataParams and attach to TMetaScanRange
fe/fe-core/src/main/java/org/apache/doris/tablefunction/TableValuedFunctionIf.java Register parquet_meta + alias function names in FE TVF factory
fe/fe-core/src/main/java/org/apache/doris/tablefunction/ParquetMetadataTableValuedFunction.java Implement FE-side parquet_meta TVF param validation + glob expansion + scan-range building
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/TableValuedFunctionVisitor.java Add Nereids visitor hook for ParquetMeta
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetMeta.java Nereids table function binding for parquet_meta
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetFileMetadata.java Nereids binding for parquet_file_metadata alias
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetKvMetadata.java Nereids binding for parquet_kv_metadata alias
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetBloomProbe.java Nereids binding for parquet_bloom_probe alias
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinTableValuedFunctions.java Register new Parquet TVFs as built-ins for Nereids
be/test/vec/exec/format/parquet/parquet_utils_test.cpp Unit tests for new parquet_utils helpers
be/src/vec/exec/scan/meta_scanner.cpp Route TMetadataType::PARQUET to the new BE reader
be/src/vec/exec/format/table/parquet_utils.h Shared constants + helper APIs for Parquet metadata formatting
be/src/vec/exec/format/table/parquet_utils.cpp Implement helper routines (type strings, stats decode, path mapping, etc.)
be/src/vec/exec/format/table/parquet_metadata_reader.h New BE GenericReader that reads Parquet footers and emits rows per mode
be/src/vec/exec/format/table/parquet_metadata_reader.cpp Implementation of the Parquet metadata scan (schema/metadata/file/kv/bloom modes)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +45
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_file_metadata");
return new ParquetMetadataTableValuedFunction(arguments);
Comment on lines +41 to +45
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_kv_metadata");
return new ParquetMetadataTableValuedFunction(arguments);
Comment on lines +41 to +45
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_bloom_probe");
return new ParquetMetadataTableValuedFunction(arguments);
Comment on lines +200 to +204
String scheme = null;
try {
scheme = new URI(parsedPath).getScheme();
} catch (URISyntaxException e) {
scheme = null;
Comment on lines +654 to +658
case tparquet::Type::INT32: {
int64_t v = std::stoll(literal);
int32_t v32 = static_cast<int32_t>(v);
out->assign(reinterpret_cast<const char*>(&v32), sizeof(int32_t));
return Status::OK();
Comment on lines +40 to +44
namespace doris::vectorized {
class Block;

// Lightweight reader that surfaces Parquet footer metadata as a table-valued scan.
// It reads only file footers (no data pages) and emits either schema rows or
} else if (params.__isset.paths && !params.paths.empty()) {
resolved_paths.assign(params.paths.begin(), params.paths.end());
} else {
return Status::InvalidArgument("Property 'path' must be set for parquet_meta");
Comment on lines +49 to +53
namespace doris::vectorized {

using namespace parquet_utils;

class ParquetMetadataReader::ModeHandler {
@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.53% (31962/49530)
Region Coverage 65.33% (15986/24468)
Branch Coverage 55.95% (8512/15214)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 2.39% (5/209) 🎉
Increment coverage report
Complete coverage report

@xylaaaaa xylaaaaa force-pushed the auto-pick-58972-branch-4.1 branch from 96798c6 to 92ba67f Compare March 16, 2026 05:12
@xylaaaaa
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.47% (31933/49530)
Region Coverage 65.31% (15979/24468)
Branch Coverage 55.88% (8502/15214)

@yiguolei
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 2.39% (5/209) 🎉
Increment coverage report
Complete coverage report

xylaaaaa and others added 3 commits March 16, 2026 09:59
### What problem does this PR solve?

Expose Parquet file metadata via a table-valued function for inspection
and debugging.

### Issue Number: close #xxx

### Related PR: #xxx

### Problem Summary:
- Add a Parquet metadata TVF so users can query Parquet file metadata
via SQL.
- Backend adds a Parquet metadata reader and scan path; frontend wires
the TVF definition.
- Enables easy inspection of partitions/row groups/column stats to aid
troubleshooting.
…kets (apache#60938)

## Summary
- adjust `test_parquet_meta_tvf` S3-mode checks to compare only stable
columns
- avoid asserting `file_name` / full S3 URI fields that vary by pipeline
bucket
- update the corresponding `.out` baseline for the changed query
projections

## Why
Different CI pipelines may use different bucket names, which causes
false failures when full URI/file name columns are compared.

## Test
- attempted: `./run-regression-test.sh --run -f
external_table_p0/tvf/test_parquet_meta_tvf -forceGenOut`
- in this environment it failed with S3 `FORBIDDEN` while reading
regression parquet files
@morningman morningman force-pushed the auto-pick-58972-branch-4.1 branch from 6a7770f to 7760f61 Compare March 16, 2026 17:00
@morningman
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.50% (31947/49530)
Region Coverage 65.35% (15989/24468)
Branch Coverage 55.88% (8501/15214)

@xylaaaaa
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 16.06% (169/1052) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.85% (19305/36531)
Line Coverage 36.16% (180307/498666)
Region Coverage 32.69% (139424/426491)
Branch Coverage 33.72% (60733/180100)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 16.16% (170/1052) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.34% (25514/35766)
Line Coverage 54.02% (268894/497782)
Region Coverage 51.40% (221421/430766)
Branch Coverage 52.95% (95713/180745)

@morningman
Copy link
Contributor

Picked here: #61446

@morningman morningman closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants