Improve component search scoring relevance#2426
Conversation
🎩 PreviewA preview build has been created at: |
2655160 to
dce82a1
Compare
bbd53a7 to
36032c1
Compare
dce82a1 to
f5a29c0
Compare
d8e31f8 to
d4d0a60
Compare
🤖 Code review — Improve component search scoring relevanceThis is the strongest PR in the stack so far. Four well-chosen relevance signals — word-boundary prefix bonus, IDF rare-token weighting, all-query-tokens bonus, and per-field phrase bonuses — and the tests are genuinely discriminating: each one is constructed so it fails if the specific signal is removed (e.g. the alphabetical tie-break deliberately favors the wrong candidate so only the bonus can flip it). The inline comments explaining that are exactly what a reviewer wants. The per-entry Findings:
Routing the phrase/all-token bonuses through the synonym-free, stemmed |
d4d0a60 to
d403991
Compare
f5a29c0 to
5adba4c
Compare
d403991 to
5ff800e
Compare
5adba4c to
1b6c7b3
Compare
e9e9957 to
931f4ea
Compare
5e9dab4 to
1c2666a
Compare
931f4ea to
afd8b04
Compare
1c2666a to
8d10b47
Compare

Description
Improves the lexical search scoring model with three enhancements:
classif) now rank components where the term is a prefix of a token higher than components where it appears only as a mid-string substring.train xgboostwill surface components mentioningxgboostabove generictrainmatches.The phrase match bonus previously applied only to the
namefield has been extended to all search fields using per-field bonus weights (FIELD_PHRASE_BONUS).The
tokenizefunction has been refactored to extract a reusableuniqueTokenshelper, and a newrequiredQueryTokensfunction produces stemmed, deduplicated tokens from the raw query without synonym expansion, used for phrase and completeness checks.Related Issue and Pull requests
Type of Change
Checklist
Test Instructions
Three new unit tests cover the added behaviors:
classif— verifyclassify_rowsranks above a component withclassifas a non-prefix substring.train xgboost— verify the component with the rare tokenxgboostranks first.train model— verify the component matching both tokens across fields ranks above one matching onlytrain.Run the test suite with:
Additional Comments
Token weights are computed per-query using a smoothed inverse document frequency:
1 + log((N+1) / (df+1)), whereNis the index size anddfis the number of entries containing the token.