-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of levenshtein by reusing cache buffer #19532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
479c5d6 to
02bc8e8
Compare
Add levenshtein_with_buffer() function that accepts a reusable cache buffer to avoid allocating a new Vec<usize> for each distance calculation. Changes: - Added levenshtein_with_buffer() in datafusion-common that takes a mutable Vec<usize> buffer parameter - Updated levenshtein function to use the optimized version with a reusable buffer across all rows - Applied optimization to all data types: Utf8View, Utf8, and LargeUtf8 - Added benchmark to measure performance improvements Optimization: - Before: Allocated new Vec<usize> cache for every row - After: Single Vec<usize> buffer reused across all rows Benchmark Results: - size=1024, str_len=8: 60.6 µs → 45.9 µs (24% faster) - size=1024, str_len=32: 615.5 µs → 598.5 µs (3% faster) - size=4096, str_len=8: 234.7 µs → 180.5 µs (23% faster) - size=4096, str_len=32: 2.46 ms → 2.38 ms (3% faster) The optimization shows significant improvements for shorter strings (23-24%) where allocation overhead is more prominent relative to algorithm cost. For longer strings, the O(m×n) algorithm complexity dominates, but still shows measurable 3% improvement from eliminating per-row allocations.
02bc8e8 to
a0e626f
Compare
| /// when computing many distances. | ||
| /// | ||
| /// The `cache` buffer will be resized as needed and reused across calls. | ||
| pub fn levenshtein_with_buffer(a: &str, b: &str, cache: &mut Vec<usize>) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way to unify this with generic_levenshtein to avoid duplication, something like having generic_levenshtein call this but creating a vec buffer at the callsite? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I refactored it to generic_levenshtein_with_buffer. Now generic_levenshtein calls it.
Make levenshtein_with_buffer generic to work with any iterator types, and have both levenshtein() and generic_levenshtein() call it. This addresses reviewer feedback to unify the implementations and avoid duplicating the Levenshtein algorithm logic. Changes: - Created generic_levenshtein_with_buffer() that accepts iterators and cache - Changed generic_levenshtein() to call generic_levenshtein_with_buffer() - Changed levenshtein() to call generic_levenshtein() (preserves existing behavior) - Changed levenshtein_with_buffer() to call generic_levenshtein_with_buffer() - Single implementation of the algorithm logic The refactored code: - Keeps generic functionality (works with any iterator types) - Has no code duplication (single implementation) - Is easier to maintain (changes only need to be made in one place) - Preserves all performance characteristics: * levenshtein() allocates cache per call * generic_levenshtein() allocates cache per call * levenshtein_with_buffer() reuses the provided cache (optimized)
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Add levenshtein_with_buffer() function that accepts a reusable cache buffer to avoid allocating a new Vec for each distance calculation.
Changes:
Vec buffer parameter
buffer across all rows
Optimization:
Benchmark Results:
The optimization shows significant improvements for shorter strings (23-24%) where allocation overhead is more prominent relative to algorithm cost. For longer strings, the O(m×n) algorithm complexity dominates, but still shows measurable 3% improvement from eliminating per-row allocations.
Are these changes tested?
Are there any user-facing changes?