compute: OffsetOptimized index_pair + strided_len caching - 2.0-2.25x arrangement lookup speedup#35159
Draft
def- wants to merge 1 commit intoMaterializeInc:mainfrom
Draft
compute: OffsetOptimized index_pair + strided_len caching - 2.0-2.25x arrangement lookup speedup#35159def- wants to merge 1 commit intoMaterializeInc:mainfrom
def- wants to merge 1 commit intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
… arrangement lookup speedup Three optimizations to the core offset lookup data structures used in differential dataflow arrangement spines (OffsetOptimized, BytesBatch, BytesContainer in row_spine.rs): 1. OffsetStride::index_pair() returns (index(i), index(i+1)) with a single enum dispatch instead of two separate calls. For uniform stride (common after compaction), computes both values with one multiply. 2. strided_len field caches strided.len() on OffsetOptimized, avoiding repeated enum dispatch on every index() call. Updated on push_into(). 3. Single-batch fast path in BytesContainer::index() avoids iterator loop for the common single-batch case after compaction/merge. Coverage data shows 1.58 trillion OffsetStride operations, making this the most frequently executed code path in the compute layer. Benchmark results (10K lookups): - DatumContainer sequential int5: 31.44µs → 15.00µs (2.10x) - DatumContainer sequential mixed5: 29.88µs → 15.10µs (1.98x) - DatumContainer sequential narrow1: 28.14µs → 12.52µs (2.25x) - DatumContainer random int5: 31.67µs → 19.42µs (1.63x) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
43f60ff to
7cbffe0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three optimizations to the core offset lookup data structures used in differential dataflow arrangement spines (OffsetOptimized, BytesBatch, BytesContainer in row_spine.rs):
OffsetStride::index_pair() returns (index(i), index(i+1)) with a single enum dispatch instead of two separate calls. For uniform stride (common after compaction), computes both values with one multiply.
strided_len field caches strided.len() on OffsetOptimized, avoiding repeated enum dispatch on every index() call. Updated on push_into().
Single-batch fast path in BytesContainer::index() avoids iterator loop for the common single-batch case after compaction/merge.
Coverage data shows 1.58 trillion OffsetStride operations, making this the most frequently executed code path in the compute layer.
Benchmark results (10K lookups):
Taken and cleaned up from #35076