[refine](be) backport array and operator updates to 4.1#65066
Open
Mryange wants to merge 6 commits into
Open
Conversation
…s in array functions (apache#63386) Issue Number: N/A Problem Summary: Array functions like `array_distance` and `array_join` previously required hand-written boilerplate to unwrap `Const`, `Nullable`, and plain `ColumnArray` variants before accessing element data. This led to duplicated code, manual offset arithmetic, and a proliferation of helper structs (`ConstArrayInfo`, `ColumnArrayExecutionData`, etc.). Root cause: there was no shared abstraction for "read a row of an array column regardless of its outer wrapper". Each function solved this independently, accumulating inconsistent patterns. This PR introduces `ColumnArrayView<PType>` (and its row-accessor `ArrayDataView<PType>`) in `be/src/core/column/column_array_view.h`. The view is created once via `ColumnArrayView::create(col)` and handles Const/Nullable unwrapping automatically. Per-row access via `operator[](row)` returns an `ArrayDataView` with `get_data()`, `size()`, and `is_null_at()` — a uniform interface regardless of the underlying column shape. For ultra-light nullable primitive loops, `ColumnArrayView` also exposes flat-access helpers (`get_data()`, `get_null_map_data()`, `row_begin()`, `row_end()`) so callers can keep wrapper unwrapping centralized while still iterating directly over the flattened buffers when benchmark data shows that per-element row-view access would regress. **Benchmark results** (4096 rows, RELEASE build, `--benchmark_repetitions=5` on a shared host with CPU scaling enabled; raw outputs saved in `benchmark_array_view_raw_results_20260519.txt` and `benchmark_array_view_distance_split_raw_results_20260519.txt`): **Row-view access (`operator[]` / `ArrayDataView`)** | Scenario | Handwritten CPU (ns) | ColumnArrayView CPU (ns) | Delta | |---|---|---|---| | Distance Plain/Plain | 322530 | 311276 | **-3.5%** | | Distance Const/Plain | 301473 | 289794 | **-3.9%** | | Distance Nullable/Plain | 305970 | 313687 | +2.5% | | Int64 Plain sum | 15971 | 16036 | +0.4% | | Int64 WithNulls sum | 26700 | 29497 | +10.5% | | String Plain len-sum | 16857 | 17120 | +1.6% | | Int64 Const sum | 16051 | 16148 | +0.6% | | Int64 Nullable sum | 16198 | 16174 | -0.1% | **Flat-access follow-up (`get_data()` / `get_null_map_data()` / `row_begin()` / `row_end()`)** | Scenario | Handwritten CPU (ns) | ColumnArrayView Flat CPU (ns) | Delta | |---|---|---|---| | Int64 WithNulls sum | 26700 | 26765 | +0.2% | | Distance Plain/Plain | 322530 | 301274 | **-6.6%** | | Distance Const/Plain | 301473 | 314259 | +4.2% | | Distance Nullable/Plain | 305970 | 314077 | +2.7% | Most production-shaped cases stay within a few percent on this shared host. The only stable double-digit regression is the synthetic `Int64 WithNulls` microbenchmark, where each element performs only `if (!null) sum += val`. The flat-access helper path removes that regression (+0.2% vs handwritten) while keeping `Const` / `Nullable` unwrapping centralized in `ColumnArrayView`. Because these numbers were collected on a shared machine with CPU scaling enabled, the distance cases show visible run-to-run noise; (cherry picked from commit 73b32d2)
…mns (apache#63938) Some BE expression and storage code creates a concrete column type and then immediately casts the generic `ColumnPtr` or `MutableColumnPtr` back to the same concrete type before writing data. This adds unnecessary casts and makes the ownership intent less direct. Root cause: several local result columns were declared as generic column pointers even though the concrete column type was already known at creation time. This PR refines those local variables to keep concrete column pointers where the type is explicit, and directly accesses the concrete column data. It also updates the explode-numbers table function member to use a concrete column pointer. The change is limited to local refactoring and does not change runtime behavior. None - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> (cherry picked from commit a0a09b0)
…#63713) Problem Summary: Casting a JSON string with duplicated object keys to MAP kept all duplicated entries because the string-to-complex cast path returned the generic wrapper directly and skipped ColumnMap::deduplicate_keys(). This made string-to-map casts inconsistent with MAP constructor semantics where the last value wins. Reproduction SQL: ```sql SELECT CAST('{"a":1,"a":2}' AS MAP<STRING,INT>); SELECT size(CAST('{"a":1,"a":2}' AS MAP<STRING,INT>)); SELECT element_at(CAST('{"a":1,"a":2}' AS MAP<STRING,INT>), 'a'); SELECT CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>); SELECT element_at(element_at(CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>), 'outer'), 'a'); SELECT map('a',1,'a',2); SELECT size(map('a',1,'a',2)); SELECT element_at(map('a',1,'a',2), 'a'); ``` Before this fix: ```text {"a":1, "a":2} 2 1 {"outer":{"a":1, "a":2}} 1 {"a":2} 1 2 ``` After this fix: ```text {"a":2} 1 2 {"outer":{"a":2}} 2 {"a":2} 1 2 ``` (cherry picked from commit b653831)
Problem Summary: Some BE functions were marked with `PURE` even though their definitions are already visible in headers or they can allocate and throw exceptions. This change removes those annotations, because throwing from a `pure` function can make surrounding `catch` blocks unreliable: https://godbolt.org/z/Y7f73bKoY (cherry picked from commit b307a23)
Issue Number: N/A Problem Summary: Pipeline operator source and sink paths need a common place to validate output and input blocks. Before this change, `sink` and `get_block` were the virtual override points, so common validation either had to stay in call sites or be duplicated across operator implementations. Root cause: the public operator data-flow entry points were also the polymorphic implementation hooks, which left no wrapper layer for shared checks. This change makes `DataSinkOperatorXBase::sink` and `OperatorXBase::get_block` non-virtual wrappers. The wrappers run `Block::check_type_and_column()` at the source/sink boundary and then dispatch to the new virtual `sink_impl` and `get_block_impl` methods. All pipeline operator implementations, exchange operators, scan operators, and related BE test mocks are migrated to the new impl methods. The scan projection path is updated to call the base `get_block` wrapper so the shared checks still apply. (cherry picked from commit c27fd0b)
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
Author
|
run buildall |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Related PR: #63386, #63938, #63713, #63440, #64139
Problem Summary:
Backport a small batch of BE array/function/operator updates to branch-4.1. This includes ColumnArrayView-based array access, concrete local result column pointers, string-to-map cast key deduplication, removal of unsafe PURE annotations, and operator IO wrapper hooks. Conflict resolution keeps branch-4.1-only deletions for bucketed aggregation operators and avoids introducing master-only thrift helpers.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)