branch-4.1: [feat] add Parquet metadata TVF #58972#61357
branch-4.1: [feat] add Parquet metadata TVF #58972#61357xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Parquet metadata table-valued function (TVF) pipeline (parquet_meta + convenience aliases) that lets users query Parquet footer/schema/statistics/bloom info via the existing metadata scan framework (FE thrift params → BE MetaScanner reader), along with regression + unit tests.
Changes:
- Add
TMetadataType.PARQUETand a new thrift payloadTParquetMetadataParams, then wire a new BEParquetMetadataReaderintoMetaScanner. - Add FE catalog + Nereids bindings for
parquet_metaand aliases (parquet_file_metadata,parquet_kv_metadata,parquet_bloom_probe) and parameter validation / glob expansion. - Add regression coverage (S3/HDFS/local) and BE unit tests for shared Parquet utility helpers.
Reviewed changes
Copilot reviewed 18 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy | New regression suite covering parquet_meta modes across S3/HDFS/local + error cases |
| regression-test/data/external_table_p0/tvf/test_parquet_meta_tvf.out | Golden output for the new regression suite |
| regression-test/data/external_table_p0/tvf/meta.parquet | Test Parquet file used by regression suite |
| regression-test/data/external_table_p0/tvf/kvmeta.parquet | Test Parquet file containing KV metadata |
| regression-test/data/external_table_p0/tvf/empty.parquet | Empty Parquet test file |
| regression-test/data/external_table_p0/tvf/bloommeta.parquet | Parquet file with bloom filter metadata for probe mode |
| gensrc/thrift/Types.thrift | Add TMetadataType.PARQUET |
| gensrc/thrift/PlanNodes.thrift | Add TParquetMetadataParams and attach to TMetaScanRange |
| fe/fe-core/src/main/java/org/apache/doris/tablefunction/TableValuedFunctionIf.java | Register parquet_meta + alias function names in FE TVF factory |
| fe/fe-core/src/main/java/org/apache/doris/tablefunction/ParquetMetadataTableValuedFunction.java | Implement FE-side parquet_meta TVF param validation + glob expansion + scan-range building |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/TableValuedFunctionVisitor.java | Add Nereids visitor hook for ParquetMeta |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetMeta.java | Nereids table function binding for parquet_meta |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetFileMetadata.java | Nereids binding for parquet_file_metadata alias |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetKvMetadata.java | Nereids binding for parquet_kv_metadata alias |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetBloomProbe.java | Nereids binding for parquet_bloom_probe alias |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinTableValuedFunctions.java | Register new Parquet TVFs as built-ins for Nereids |
| be/test/vec/exec/format/parquet/parquet_utils_test.cpp | Unit tests for new parquet_utils helpers |
| be/src/vec/exec/scan/meta_scanner.cpp | Route TMetadataType::PARQUET to the new BE reader |
| be/src/vec/exec/format/table/parquet_utils.h | Shared constants + helper APIs for Parquet metadata formatting |
| be/src/vec/exec/format/table/parquet_utils.cpp | Implement helper routines (type strings, stats decode, path mapping, etc.) |
| be/src/vec/exec/format/table/parquet_metadata_reader.h | New BE GenericReader that reads Parquet footers and emits rows per mode |
| be/src/vec/exec/format/table/parquet_metadata_reader.cpp | Implementation of the Parquet metadata scan (schema/metadata/file/kv/bloom modes) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_file_metadata"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); |
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_kv_metadata"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); |
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_bloom_probe"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); |
| String scheme = null; | ||
| try { | ||
| scheme = new URI(parsedPath).getScheme(); | ||
| } catch (URISyntaxException e) { | ||
| scheme = null; |
| case tparquet::Type::INT32: { | ||
| int64_t v = std::stoll(literal); | ||
| int32_t v32 = static_cast<int32_t>(v); | ||
| out->assign(reinterpret_cast<const char*>(&v32), sizeof(int32_t)); | ||
| return Status::OK(); |
| namespace doris::vectorized { | ||
| class Block; | ||
|
|
||
| // Lightweight reader that surfaces Parquet footer metadata as a table-valued scan. | ||
| // It reads only file footers (no data pages) and emits either schema rows or |
| } else if (params.__isset.paths && !params.paths.empty()) { | ||
| resolved_paths.assign(params.paths.begin(), params.paths.end()); | ||
| } else { | ||
| return Status::InvalidArgument("Property 'path' must be set for parquet_meta"); |
| namespace doris::vectorized { | ||
|
|
||
| using namespace parquet_utils; | ||
|
|
||
| class ParquetMetadataReader::ModeHandler { |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
96798c6 to
92ba67f
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve? Expose Parquet file metadata via a table-valued function for inspection and debugging. ### Issue Number: close #xxx ### Related PR: #xxx ### Problem Summary: - Add a Parquet metadata TVF so users can query Parquet file metadata via SQL. - Backend adds a Parquet metadata reader and scan path; frontend wires the TVF definition. - Enables easy inspection of partitions/row groups/column stats to aid troubleshooting.
…kets (apache#60938) ## Summary - adjust `test_parquet_meta_tvf` S3-mode checks to compare only stable columns - avoid asserting `file_name` / full S3 URI fields that vary by pipeline bucket - update the corresponding `.out` baseline for the changed query projections ## Why Different CI pipelines may use different bucket names, which causes false failures when full URI/file name columns are compared. ## Test - attempted: `./run-regression-test.sh --run -f external_table_p0/tvf/test_parquet_meta_tvf -forceGenOut` - in this environment it failed with S3 `FORBIDDEN` while reading regression parquet files
6a7770f to
7760f61
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
Picked here: #61446 |
Cherry-pick #58972 to branch-4.1
What problem does this PR solve?
Add Parquet metadata TVF.
Cherry-pick commit
96798c6520e- [feat] add Parquet metadata TVF ([feat] add Parquet metadata TVF #58972)