Skip to content

Add fuzzy/typo-tolerant matching for component name and I/O fields#2427

Open
Mbeaulne wants to merge 1 commit into
06-18-improve_component_search_scoring_relevancefrom
06-18-add_safe_typo_tolerance_for_names_and_io_fields
Open

Add fuzzy/typo-tolerant matching for component name and I/O fields#2427
Mbeaulne wants to merge 1 commit into
06-18-improve_component_search_scoring_relevancefrom
06-18-add_safe_typo_tolerance_for_names_and_io_fields

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds typo tolerance to the lexical search functionality for component names and input/output fields. When a query token is 4–6 characters, a single-character edit distance is allowed; for tokens 7+ characters, up to two edits are permitted. Fuzzy matches receive a slightly lower score than exact matches via a dedicated FUZZY_MATCH_BONUS_MULTIPLIER. Typo tolerance is intentionally restricted to name and io fields — descriptions and implementation text do not benefit from fuzzy matching to avoid noisy results.

The implementation uses a standard dynamic programming Levenshtein distance algorithm with an early-exit optimisation that abandons the computation once the running row minimum exceeds the allowed distance.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Run the existing test suite — two new test cases cover the expected behaviour:
    • Confirm that queries like filtr (for filter_rows) and datset (for dataset) return the correct component.
    • Confirm that typo queries against description or implementation text (e.g. xgbost) return no results.

Additional Comments

Fuzzy matching is skipped entirely when the computed max edit distance is 0 (tokens shorter than 4 characters), keeping short-token searches fast and precise.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-add_safe_typo_tolerance_for_names_and_io_fields/9ae8af8

Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Mbeaulne Mbeaulne changed the title Add safe typo tolerance for names and IO fields. Add fuzzy/typo-tolerant matching for component name and I/O fields Jun 18, 2026
@Mbeaulne Mbeaulne marked this pull request as ready for review June 18, 2026 17:35
@Mbeaulne Mbeaulne requested a review from a team as a code owner June 18, 2026 17:35
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from bbd53a7 to 36032c1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch 2 times, most recently from 0a7d588 to e379e64 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from 36032c1 to d8e31f8 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from e379e64 to 89029f0 Compare June 18, 2026 20:49
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d8e31f8 to d4d0a60 Compare June 18, 2026 20:49
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from 89029f0 to fc80727 Compare June 18, 2026 21:02
@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Add fuzzy/typo-tolerant matching for name and I/O fields

Well-scoped feature. Restricting fuzzy matching to name + io (the high-precision fields), weighting it below exact/prefix (0.75×), and the length-gated edit budget (<5 → none, 5–6 → 1, 7+ → 2) are all the right instincts. The maxTypoDistance comment justifying the length-5 floor with concrete collisions (data↔date, path↔bath, list↔last) is great. The bounded Levenshtein is correct — the length-diff short-circuit and the rowMinimum > maxDistance early-out are both valid. And the negative test asserting xgbost does not fuzzy-match description/implementation text locks the boundary in place.

Findings:

  • This is the heaviest per-keystroke addition in the stack. The fuzzy branch runs precisely when !fieldText.includes(token) — i.e. for the majority of (entry, token) pairs, since most entries don't contain most query tokens. For each it runs Levenshtein against every name/io field token. It's bounded (length-diff guard, early-out, length-≥5 gate, two fields only, synonym-expanded token count), so fine for hundreds of components — but stacked on Improve component search scoring relevance #2426's double index pass it's worth watching as libraries grow. lexicalSearch still runs in the render path until Debounce component search input #2433's debounce lands.

  • Distance-2 on io tokens is fairly permissive. A 7+ char query token tolerating 2 edits against generic I/O names could surface the occasional surprising match. The name field is high-signal so it's lower-risk there; might be worth a quick sanity check on real I/O vocabularies, or reserving distance-2 for name only. Not blocking.

  • Nit: redundant guard in hasFuzzyTokenMatch. !fieldToken.includes(token) can never be false here — the caller only enters the fuzzy branch when the whole fieldText lacks token as a substring, so no individual field token can contain it either. The !token.includes(fieldToken) half is meaningful (skips fuzzy when the field token is a substring of the query token). Dropping the dead half would make the intent clearer.

  • Minor (carryover): synonym/stem-expanded tokens also flow through fuzzy matching, so a fuzzy hit on an expanded variant adds another 0.75× contribution — compounding the per-concept stacking noted on Normalize component search tokens for better matching #2424Improve component search scoring relevance #2426. Low stakes given the 0.75 weight.

Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d4d0a60 to d403991 Compare June 24, 2026 18:11
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from fc80727 to 55e7004 Compare June 24, 2026 18:11
@Mbeaulne Mbeaulne requested a review from camielvs June 24, 2026 18:19
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d403991 to 5ff800e Compare June 24, 2026 19:52
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from 55e7004 to 3422035 Compare June 24, 2026 19:52
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from 5ff800e to e9e9957 Compare June 25, 2026 15:55
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from 3422035 to 80f3e75 Compare June 25, 2026 15:55
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from e9e9957 to 931f4ea Compare June 25, 2026 19:38
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch 2 times, most recently from 81a12de to 628e387 Compare June 25, 2026 19:43
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from 931f4ea to afd8b04 Compare June 26, 2026 13:51
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from 628e387 to 9ae8af8 Compare June 26, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants