perf: improve performance of levenshtein by reusing cache buffer #19532

viirya · 2025-12-28T21:54:05Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Add levenshtein_with_buffer() function that accepts a reusable cache buffer to avoid allocating a new Vec for each distance calculation.

Changes:

Added levenshtein_with_buffer() in datafusion-common that takes a mutable
Vec buffer parameter
Updated levenshtein function to use the optimized version with a reusable
buffer across all rows
Added benchmark to measure performance improvements

Optimization:

Before: Allocated new Vec cache for every row
After: Single Vec buffer reused across all rows

Benchmark Results:

size=1024, str_len=8: 60.6 µs → 45.9 µs (24% faster)
size=1024, str_len=32: 615.5 µs → 598.5 µs (3% faster)
size=4096, str_len=8: 234.7 µs → 180.5 µs (23% faster)
size=4096, str_len=32: 2.46 ms → 2.38 ms (3% faster)

The optimization shows significant improvements for shorter strings (23-24%) where allocation overhead is more prominent relative to algorithm cost. For longer strings, the O(m×n) algorithm complexity dominates, but still shows measurable 3% improvement from eliminating per-row allocations.

Are these changes tested?

Are there any user-facing changes?

Add levenshtein_with_buffer() function that accepts a reusable cache buffer to avoid allocating a new Vec<usize> for each distance calculation. Changes: - Added levenshtein_with_buffer() in datafusion-common that takes a mutable Vec<usize> buffer parameter - Updated levenshtein function to use the optimized version with a reusable buffer across all rows - Applied optimization to all data types: Utf8View, Utf8, and LargeUtf8 - Added benchmark to measure performance improvements Optimization: - Before: Allocated new Vec<usize> cache for every row - After: Single Vec<usize> buffer reused across all rows Benchmark Results: - size=1024, str_len=8: 60.6 µs → 45.9 µs (24% faster) - size=1024, str_len=32: 615.5 µs → 598.5 µs (3% faster) - size=4096, str_len=8: 234.7 µs → 180.5 µs (23% faster) - size=4096, str_len=32: 2.46 ms → 2.38 ms (3% faster) The optimization shows significant improvements for shorter strings (23-24%) where allocation overhead is more prominent relative to algorithm cost. For longer strings, the O(m×n) algorithm complexity dominates, but still shows measurable 3% improvement from eliminating per-row allocations.

Jefffrey · 2025-12-29T03:22:45Z

datafusion/common/src/utils/mod.rs

+    /// when computing many distances.
+    ///
+    /// The `cache` buffer will be resized as needed and reused across calls.
+    pub fn levenshtein_with_buffer(a: &str, b: &str, cache: &mut Vec<usize>) -> usize {


I wonder if there's a way to unify this with generic_levenshtein to avoid duplication, something like having generic_levenshtein call this but creating a vec buffer at the callsite? 🤔

Okay, I refactored it to generic_levenshtein_with_buffer. Now generic_levenshtein calls it.

Make levenshtein_with_buffer generic to work with any iterator types, and have both levenshtein() and generic_levenshtein() call it. This addresses reviewer feedback to unify the implementations and avoid duplicating the Levenshtein algorithm logic. Changes: - Created generic_levenshtein_with_buffer() that accepts iterators and cache - Changed generic_levenshtein() to call generic_levenshtein_with_buffer() - Changed levenshtein() to call generic_levenshtein() (preserves existing behavior) - Changed levenshtein_with_buffer() to call generic_levenshtein_with_buffer() - Single implementation of the algorithm logic The refactored code: - Keeps generic functionality (works with any iterator types) - Has no code duplication (single implementation) - Is easier to maintain (changes only need to be made in one place) - Preserves all performance characteristics: * levenshtein() allocates cache per call * generic_levenshtein() allocates cache per call * levenshtein_with_buffer() reuses the provided cache (optimized)

github-actions bot added common Related to common crate functions Changes to functions implementation labels Dec 28, 2025

viirya force-pushed the levenshtein_optimize branch from 479c5d6 to 02bc8e8 Compare December 28, 2025 21:57

viirya force-pushed the levenshtein_optimize branch from 02bc8e8 to a0e626f Compare December 28, 2025 21:59

viirya requested a review from andygrove December 29, 2025 01:25

Jefffrey reviewed Dec 29, 2025

View reviewed changes

Jefffrey approved these changes Dec 29, 2025

View reviewed changes

rluvaton added the performance Make DataFusion faster label Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: improve performance of levenshtein by reusing cache buffer #19532

perf: improve performance of levenshtein by reusing cache buffer #19532

viirya commented Dec 28, 2025 •

edited

Loading

Uh oh!

Jefffrey Dec 29, 2025

Uh oh!

viirya Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: improve performance of levenshtein by reusing cache buffer #19532

Are you sure you want to change the base?

perf: improve performance of levenshtein by reusing cache buffer #19532

Conversation

viirya commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

viirya Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viirya commented Dec 28, 2025 •

edited

Loading