[feature](be) Add file scanner v2 readers #65046
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
6fa6952 to
47c9df5
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Automated review for PR 65046. I found four issues that should be addressed before merge: a FileScannerV2 runtime-filter rewrite correctness bug, a default-on rollout mismatch for enable_file_scanner_v2, a removed timezone regression assertion, and generated output whitespace that fails git diff --check.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add the file scanner v2 reader stack for external file scans, including native readers for Parquet, CSV/TEXT, JSON, JNI-backed table readers, schema projection, column mapping, predicate handling, reader statistics, page cache support, and related BE/FE integration. This also restores affected Parquet LZO regression cases by adding Doris thirdparty Arrow LZO page decompression support for file scanner v2. The change keeps VDirectInPredicate source-compatible with existing ordinary two-argument construction by defaulting the new HybridSet child-type flag to true. Dictionary-code rewrites can still pass false explicitly, while existing runtime filter tests continue to compile with the old call shape. Review follow-up fixes make RuntimeFilterExpr global-index slot rewriting update the executable _impl tree, document enable_file_scanner_v2 as default-on to match the FE default, and trim generated regression outputs so diff hygiene passes. ### Release note Support file scanner v2 readers for external file scan paths, including LZO-compressed Parquet reads in the new Parquet reader path. ### Check List (For Author) - Test: Manual test - Verified apache-arrow-17.0.0-lzo.patch applies with patch -p1 --dry-run against Arrow 17 column_reader.cc - Ran bash -n thirdparty/build-thirdparty.sh thirdparty/download-thirdparty.sh - Ran build-support/clang-format.sh - Ran git diff --check - Attempted ./run-be-ut.sh --run --filter='RuntimeFilterExprSamplingTest.deep_clone_clones_impl_tree'; local sandboxed run could not complete because the BE UT script required JDK 17 setup first, then needed submodule metadata writes and GitHub access for thirdparty dependencies. Escalated retry was not approved before timeout. - Attempted ./run-be-ut.sh --run --filter='FileScannerV2Test.RewriteSlotRefsToGlobalIndexMatrix'; local sandboxed run could not complete because the BE UT script needed submodule metadata writes and GitHub access for thirdparty dependencies. Escalated retry was not approved before timeout. - Full BE unit tests and external regression tests were not run in this local environment - Behavior changed: Yes. Adds file scanner v2 reader behavior and enables LZO-compressed Parquet reads through the new reader path - Does this need documentation: No
47c9df5 to
8fe2181
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review complete. I did not find a new in-scope, non-duplicate issue to add beyond the existing inline review threads.
Scope and duplicate handling: I checked the provided review prompt, repo review skill, required AGENTS files, existing review threads/comments, focus file, changed-file list, and shared ledger before submitting. The already-raised scanner V2 runtime-filter rewrite, scanner default/docs, timezone-test, and whitespace comments were treated as duplicate-suppression context. One initially suspicious local diff in StmtExecutor was dismissed because it is absent from the authoritative live PR file list and GitHub rejected the path as unresolved.
Critical checkpoints: scanner V2 selection/session propagation, FE format planning, runtime-filter rewrite handling, aggregate/count pushdown, file-reader paths, changed tests/outputs, and style hygiene were reviewed statically. git diff --check 9d7d3a2c4f437e0c2962bde1a49848ed0535063c..HEAD is clean. I did not run builds or test suites because this worktree is not initialized and thirdparty/installed is missing in the runner.
Subagent conclusions: optimizer-rewrite and tests-session-config both completed the final convergence round on the corrected no-inline-comment set with NO_NEW_VALUABLE_FINDINGS.
| int64_t conversion_failure_null_map_offset = 0; | ||
| }; | ||
|
|
||
| inline bool decoded_column_view_row_is_null(const DecodedColumnView& view, int64_t row) { |
There was a problem hiding this comment.
这里为什么不把DecodedColumnView ,变成一个class? 然后底下这堆方法是这个class的member?
| std::dynamic_pointer_cast<VSlotRef>(get_child(0)) != nullptr; | ||
| } | ||
|
|
||
| Status clone_node(VExprSPtr* cloned_expr) const override { |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
| return slot_info.is_file_slot; | ||
| } | ||
|
|
||
| Status rewrite_slot_refs_to_global_index( |
There was a problem hiding this comment.
- 这个方法需要加注释,并且注释里得有例子
- 这个runtime filter 会不会被共享,所以我们修改他的结构是有风险的?
| Status FileScannerV2::_build_table_conjuncts(VExprContextSPtrs* conjuncts) const { | ||
| DORIS_CHECK(conjuncts != nullptr); | ||
| conjuncts->clear(); | ||
| conjuncts->reserve(_conjuncts.size()); |
There was a problem hiding this comment.
这里为啥拿到conjuncts 之后要rewrite 一下? 为啥rewrite 是在每个scanner 都rewrite 一次,而不是file scan operator rewrite 一次?
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Problem Summary: Add the file scanner v2 reader stack for external file scans, including native readers for Parquet, CSV/TEXT, JSON, JNI-backed table readers, schema projection, column mapping, predicate handling, reader statistics, page cache support, and related BE/FE integration. This also restores affected Parquet LZO regression cases by adding Doris thirdparty Arrow LZO page decompression support for file scanner v2.
Support file scanner v2 readers for external file scan paths, including LZO-compressed Parquet reads in the new Parquet reader path.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)