Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 28, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Add levenshtein_with_buffer() function that accepts a reusable cache buffer to avoid allocating a new Vec for each distance calculation.

Changes:

  • Added levenshtein_with_buffer() in datafusion-common that takes a mutable
    Vec buffer parameter
  • Updated levenshtein function to use the optimized version with a reusable
    buffer across all rows
  • Added benchmark to measure performance improvements

Optimization:

  • Before: Allocated new Vec cache for every row
  • After: Single Vec buffer reused across all rows

Benchmark Results:

  • size=1024, str_len=8: 60.6 µs → 45.9 µs (24% faster)
  • size=1024, str_len=32: 615.5 µs → 598.5 µs (3% faster)
  • size=4096, str_len=8: 234.7 µs → 180.5 µs (23% faster)
  • size=4096, str_len=32: 2.46 ms → 2.38 ms (3% faster)

The optimization shows significant improvements for shorter strings (23-24%) where allocation overhead is more prominent relative to algorithm cost. For longer strings, the O(m×n) algorithm complexity dominates, but still shows measurable 3% improvement from eliminating per-row allocations.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added common Related to common crate functions Changes to functions implementation labels Dec 28, 2025
@viirya viirya force-pushed the levenshtein_optimize branch from 479c5d6 to 02bc8e8 Compare December 28, 2025 21:57
Add levenshtein_with_buffer() function that accepts a reusable cache buffer
to avoid allocating a new Vec<usize> for each distance calculation.

Changes:
- Added levenshtein_with_buffer() in datafusion-common that takes a mutable
  Vec<usize> buffer parameter
- Updated levenshtein function to use the optimized version with a reusable
  buffer across all rows
- Applied optimization to all data types: Utf8View, Utf8, and LargeUtf8
- Added benchmark to measure performance improvements

Optimization:
- Before: Allocated new Vec<usize> cache for every row
- After: Single Vec<usize> buffer reused across all rows

Benchmark Results:
- size=1024, str_len=8:  60.6 µs → 45.9 µs (24% faster)
- size=1024, str_len=32: 615.5 µs → 598.5 µs (3% faster)
- size=4096, str_len=8:  234.7 µs → 180.5 µs (23% faster)
- size=4096, str_len=32: 2.46 ms → 2.38 ms (3% faster)

The optimization shows significant improvements for shorter strings (23-24%)
where allocation overhead is more prominent relative to algorithm cost.
For longer strings, the O(m×n) algorithm complexity dominates, but still
shows measurable 3% improvement from eliminating per-row allocations.
@viirya viirya force-pushed the levenshtein_optimize branch from 02bc8e8 to a0e626f Compare December 28, 2025 21:59
@viirya viirya requested a review from andygrove December 29, 2025 01:25
/// when computing many distances.
///
/// The `cache` buffer will be resized as needed and reused across calls.
pub fn levenshtein_with_buffer(a: &str, b: &str, cache: &mut Vec<usize>) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a way to unify this with generic_levenshtein to avoid duplication, something like having generic_levenshtein call this but creating a vec buffer at the callsite? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I refactored it to generic_levenshtein_with_buffer. Now generic_levenshtein calls it.

Make levenshtein_with_buffer generic to work with any iterator types,
and have both levenshtein() and generic_levenshtein() call it.

This addresses reviewer feedback to unify the implementations and
avoid duplicating the Levenshtein algorithm logic.

Changes:
- Created generic_levenshtein_with_buffer() that accepts iterators and cache
- Changed generic_levenshtein() to call generic_levenshtein_with_buffer()
- Changed levenshtein() to call generic_levenshtein() (preserves existing behavior)
- Changed levenshtein_with_buffer() to call generic_levenshtein_with_buffer()
- Single implementation of the algorithm logic

The refactored code:
- Keeps generic functionality (works with any iterator types)
- Has no code duplication (single implementation)
- Is easier to maintain (changes only need to be made in one place)
- Preserves all performance characteristics:
  * levenshtein() allocates cache per call
  * generic_levenshtein() allocates cache per call
  * levenshtein_with_buffer() reuses the provided cache (optimized)
@rluvaton rluvaton added the performance Make DataFusion faster label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate functions Changes to functions implementation performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants