perf: lazy repeated array view for std.repeat — O(1) memory#787
Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Closed
perf: lazy repeated array view for std.repeat — O(1) memory#787He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Conversation
Motivation: std.repeat([1,2,3], 1000000) previously created a copy of the array 1M times, requiring O(n*k) memory where n=array length and k=repetition count. This was inefficient for large repetitions, particularly in comparison/streaming scenarios. Modification: - Add _isRepeated flag to Val.Arr to support lazy repeated view, similar to existing _isRange and _reversed views - Store base array (_repeatedBase) and repetition count (_repeatedCount) - Implement O(1) memory creation: Val.Arr.repeated(pos, base, count) - Index access via modulo: value(i) = base.value(i % base.length) - Lazy materialization in asLazyArray via materializeRepeated() when full array needed - Update std.repeat to use repeated() instead of System.arraycopy Design decision follows jrsonnet's RepeatedArray pattern (arr/spec.rs lines 523-567). Result: - std.repeat([1,2,3], 1000000) now O(1) memory and creation time vs O(n*k) before - All stdlib operations (sort, concat, map, filter, etc) work transparently - Lazy evaluation only materializes to full array when needed (e.g., for serialization) - Regression test: sjsonnet/test/resources/new_test_suite/repeat_view.jsonnet covers edge cases (zero, one, empty array), access patterns, and operations Upstream source: Inspired by jrsonnet/crates/jrsonnet-evaluator/src/arr/spec.rs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: add lazy repeated array view to std.repeat
Motivation
std.repeat([1,2,3], 1000000)currently allocates and fills a 3M-element array viaSystem.arraycopyin a loop — O(n*k) memory and time. This is catastrophic for large repetitions, especially in nested contexts (e.g., repeated array views in comprehensions).The jrsonnet project demonstrates a better approach: RepeatedArray — a zero-copy view that stores only the base array and repetition count, with modulo indexing to map any index back to the base. This costs O(1) memory and O(1) creation time.
Key Design Decision
Lazy repeated view with modulo indexing:
i % base.lengthto map repeated index back to baseThis follows sjsonnet's existing lazy array view pattern (range, reversed, concat views) and integrates seamlessly with the evaluator.
Modification
Benchmark Results
JMH Full Regression Suite (after repeat view optimization):
✅ All benchmarks stable — no regressions detected. Repeat view optimization is transparent to end-to-end performance (materialization occurs in test suites, so the benefit manifests primarily in repeated-heavy workloads not covered by standard regression suite).
Test Results:
Analysis
Why no visible improvement in standard regression suite?
Standard test suite runs to completion (materialize final result), so overhead of materialization is amortized
Benefit would be visible in workloads that:
The optimization prevents memory explosion and GC pressure in such scenarios, which the regression suite doesn't exercise heavily.
Safety:
References
Result
✅ Optimization complete — std.repeat now uses O(1) lazy views instead of O(n*k) eager copies. Memory usage and creation time scale independently of repetition count. All tests pass, no regressions detected.