Skip to content

[feature](be) Add file scanner v2 readers #65046

Open
Gabriel39 wants to merge 1 commit into
masterfrom
refact_reader_branch
Open

[feature](be) Add file scanner v2 readers #65046
Gabriel39 wants to merge 1 commit into
masterfrom
refact_reader_branch

Conversation

@Gabriel39

@Gabriel39 Gabriel39 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Problem Summary: Add the file scanner v2 reader stack for external file scans, including native readers for Parquet, CSV/TEXT, JSON, JNI-backed table readers, schema projection, column mapping, predicate handling, reader statistics, page cache support, and related BE/FE integration. This also restores affected Parquet LZO regression cases by adding Doris thirdparty Arrow LZO page decompression support for file scanner v2.
Support file scanner v2 readers for external file scan paths, including LZO-compressed Parquet reads in the new Parquet reader path.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.39% (1896/2450)
Line Coverage 64.44% (34063/52862)
Region Coverage 64.84% (17528/27032)
Branch Coverage 54.02% (9394/17390)

@Gabriel39 Gabriel39 force-pushed the refact_reader_branch branch from 6fa6952 to 47c9df5 Compare June 30, 2026 14:04
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 60.00% (3/5) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 84.17% (13996/16628) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.58% (29571/39652)
Line Coverage 58.53% (323352/552450)
Region Coverage 55.34% (271359/490351)
Branch Coverage 56.60% (119468/211082)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 40.00% (2/5) 🎉
Increment coverage report
Complete coverage report

@Gabriel39

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review for PR 65046. I found four issues that should be addressed before merge: a FileScannerV2 runtime-filter rewrite correctness bug, a default-on rollout mismatch for enable_file_scanner_v2, a removed timezone regression assertion, and generated output whitespace that fails git diff --check.

Comment thread be/src/exec/scan/file_scanner_v2.cpp
Comment thread be/src/exec/operator/file_scan_operator.cpp
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #63893

Problem Summary: Add the file scanner v2 reader stack for external file scans, including native readers for Parquet, CSV/TEXT, JSON, JNI-backed table readers, schema projection, column mapping, predicate handling, reader statistics, page cache support, and related BE/FE integration. This also restores affected Parquet LZO regression cases by adding Doris thirdparty Arrow LZO page decompression support for file scanner v2.

The change keeps VDirectInPredicate source-compatible with existing ordinary two-argument construction by defaulting the new HybridSet child-type flag to true. Dictionary-code rewrites can still pass false explicitly, while existing runtime filter tests continue to compile with the old call shape.

Review follow-up fixes make RuntimeFilterExpr global-index slot rewriting update the executable _impl tree, document enable_file_scanner_v2 as default-on to match the FE default, and trim generated regression outputs so diff hygiene passes.

### Release note

Support file scanner v2 readers for external file scan paths, including LZO-compressed Parquet reads in the new Parquet reader path.

### Check List (For Author)

- Test: Manual test
    - Verified apache-arrow-17.0.0-lzo.patch applies with patch -p1 --dry-run against Arrow 17 column_reader.cc
    - Ran bash -n thirdparty/build-thirdparty.sh thirdparty/download-thirdparty.sh
    - Ran build-support/clang-format.sh
    - Ran git diff --check
    - Attempted ./run-be-ut.sh --run --filter='RuntimeFilterExprSamplingTest.deep_clone_clones_impl_tree'; local sandboxed run could not complete because the BE UT script required JDK 17 setup first, then needed submodule metadata writes and GitHub access for thirdparty dependencies. Escalated retry was not approved before timeout.
    - Attempted ./run-be-ut.sh --run --filter='FileScannerV2Test.RewriteSlotRefsToGlobalIndexMatrix'; local sandboxed run could not complete because the BE UT script needed submodule metadata writes and GitHub access for thirdparty dependencies. Escalated retry was not approved before timeout.
    - Full BE unit tests and external regression tests were not run in this local environment
- Behavior changed: Yes. Adds file scanner v2 reader behavior and enables LZO-compressed Parquet reads through the new reader path
- Does this need documentation: No
@Gabriel39 Gabriel39 force-pushed the refact_reader_branch branch from 47c9df5 to 8fe2181 Compare July 1, 2026 02:02
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@Gabriel39

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete. I did not find a new in-scope, non-duplicate issue to add beyond the existing inline review threads.

Scope and duplicate handling: I checked the provided review prompt, repo review skill, required AGENTS files, existing review threads/comments, focus file, changed-file list, and shared ledger before submitting. The already-raised scanner V2 runtime-filter rewrite, scanner default/docs, timezone-test, and whitespace comments were treated as duplicate-suppression context. One initially suspicious local diff in StmtExecutor was dismissed because it is absent from the authoritative live PR file list and GitHub rejected the path as unresolved.

Critical checkpoints: scanner V2 selection/session propagation, FE format planning, runtime-filter rewrite handling, aggregate/count pushdown, file-reader paths, changed tests/outputs, and style hygiene were reviewed statically. git diff --check 9d7d3a2c4f437e0c2962bde1a49848ed0535063c..HEAD is clean. I did not run builds or test suites because this worktree is not initialized and thirdparty/installed is missing in the runner.

Subagent conclusions: optimizer-rewrite and tests-session-config both completed the final convergence round on the corrected no-inline-comment set with NO_NEW_VALUABLE_FINDINGS.

int64_t conversion_failure_null_map_offset = 0;
};

inline bool decoded_column_view_row_is_null(const DecodedColumnView& view, int64_t row) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么不把DecodedColumnView ,变成一个class? 然后底下这堆方法是这个class的member?

std::dynamic_pointer_cast<VSlotRef>(get_child(0)) != nullptr;
}

Status clone_node(VExprSPtr* cloned_expr) const override {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个clone node 是干啥用的?

@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.39% (1896/2450)
Line Coverage 64.50% (34097/52862)
Region Coverage 64.91% (17546/27032)
Branch Coverage 54.07% (9402/17390)

return slot_info.is_file_slot;
}

Status rewrite_slot_refs_to_global_index(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这个方法需要加注释,并且注释里得有例子
  2. 这个runtime filter 会不会被共享,所以我们修改他的结构是有风险的?

Status FileScannerV2::_build_table_conjuncts(VExprContextSPtrs* conjuncts) const {
DORIS_CHECK(conjuncts != nullptr);
conjuncts->clear();
conjuncts->reserve(_conjuncts.size());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥拿到conjuncts 之后要rewrite 一下? 为啥rewrite 是在每个scanner 都rewrite 一次,而不是file scan operator rewrite 一次?

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants