Skip to content

Improve component search scoring relevance#2426

Open
Mbeaulne wants to merge 1 commit into
06-18-add_synonym_groupsfrom
06-18-improve_component_search_scoring_relevance
Open

Improve component search scoring relevance#2426
Mbeaulne wants to merge 1 commit into
06-18-add_synonym_groupsfrom
06-18-improve_component_search_scoring_relevance

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Improves the lexical search scoring model with three enhancements:

  • Prefix match boost: Partial query terms (e.g. classif) now rank components where the term is a prefix of a token higher than components where it appears only as a mid-string substring.
  • IDF-style rare token weighting: Query tokens that match fewer components are weighted more heavily than common tokens, preventing high-frequency terms from dominating scores. For example, searching train xgboost will surface components mentioning xgboost above generic train matches.
  • All-query-tokens bonus: When a component matches every token in the query (across any fields), it receives an additional score bonus, ensuring more complete matches rank above partial ones.

The phrase match bonus previously applied only to the name field has been extended to all search fields using per-field bonus weights (FIELD_PHRASE_BONUS).

The tokenize function has been refactored to extract a reusable uniqueTokens helper, and a new requiredQueryTokens function produces stemmed, deduplicated tokens from the raw query without synonym expansion, used for phrase and completeness checks.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Test Instructions

Three new unit tests cover the added behaviors:

  1. Search classif — verify classify_rows ranks above a component with classif as a non-prefix substring.
  2. Search train xgboost — verify the component with the rare token xgboost ranks first.
  3. Search train model — verify the component matching both tokens across fields ranks above one matching only train.

Run the test suite with:

npx jest componentSearchIndex

Additional Comments

Token weights are computed per-query using a smoothed inverse document frequency: 1 + log((N+1) / (df+1)), where N is the index size and df is the number of entries containing the token.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-improve_component_search_scoring_relevance/afd8b04

Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.test.ts Outdated
Comment thread src/services/componentSearchIndex.test.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 2655160 to dce82a1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from bbd53a7 to 36032c1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from dce82a1 to f5a29c0 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch 2 times, most recently from d8e31f8 to d4d0a60 Compare June 18, 2026 20:49
@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Improve component search scoring relevance

This is the strongest PR in the stack so far. Four well-chosen relevance signals — word-boundary prefix bonus, IDF rare-token weighting, all-query-tokens bonus, and per-field phrase bonuses — and the tests are genuinely discriminating: each one is constructed so it fails if the specific signal is removed (e.g. the alphabetical tie-break deliberately favors the wrong candidate so only the bonus can flip it). The inline comments explaining that are exactly what a reviewer wants. The per-entry fieldTokensFor cache to avoid re-splitting field text inside the hot per-token loop is a nice touch.

Findings:

  • IDF down-weights common tokens but doesn't resolve the synonym/stem stacking from Normalize component search tokens for better matching #2424Add synonym expansion to component lexical search #2425. Scoring is still Σ over tokens of fieldWeight × tokenWeight, and tokens is the stem+synonym-expanded set. So a component containing several members of one synonym group (or both the inflected and stemmed form of a word) still accumulates a contribution per surface variant — IDF only scales each, it doesn't collapse them to one concept. If you want one logical match to count once, dedupe by group/stem before the per-token loop. IDF genuinely helps (common expansions get ~1.0 weight), so this is lower-stakes than before, but the additive stacking is still there.

  • Math.max(0, inverseFrequency) is dead code. documentFrequency ≤ index.length always, so (N+1)/(df+1) ≥ 1 and log(...) ≥ 0. The clamp can never fire. Harmless, but either drop it or add a comment if it's guarding against a future change.

  • Per-keystroke cost is now ~2× a full index pass. buildRareTokenWeights scans the whole index once per query token, then scoreEntry scans again, and the all-tokens bonus re-runs entryMatchesToken per entry. lexicalSearch runs in the DashboardComponentsV2View render path on every query change (debounce only lands in Debounce component search input #2433), and synonym expansion multiplies the token count. Fine for hundreds of components; worth keeping in mind as libraries grow or if entryMatchesToken's substring scans get hotter. Minor: index.filter(...).length allocates an array just to count — a plain counter avoids it.

  • Inflected multi-word phrase bonus still won't fire (carryover from Add synonym expansion to component lexical search #2425). The index normalization interleaves [inflected, stem] (training_testing → "... training train testing test"), so the stemmed requiredTokens join "train test" is never contiguous in the index. The common non-inflected case (train test split → train_test_split) works. Low priority.

Routing the phrase/all-token bonuses through the synonym-free, stemmed requiredTokens (rather than the expanded tokens) is the correct separation — good call.

Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d4d0a60 to d403991 Compare June 24, 2026 18:11
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from f5a29c0 to 5adba4c Compare June 24, 2026 18:11
@Mbeaulne Mbeaulne requested a review from camielvs June 24, 2026 18:19
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d403991 to 5ff800e Compare June 24, 2026 19:52
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 5adba4c to 1b6c7b3 Compare June 24, 2026 19:52
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch 2 times, most recently from e9e9957 to 931f4ea Compare June 25, 2026 19:38
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 5e9dab4 to 1c2666a Compare June 25, 2026 19:38
Comment thread src/services/componentSearchIndex.ts
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from 931f4ea to afd8b04 Compare June 26, 2026 13:51
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 1c2666a to 8d10b47 Compare June 26, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants