Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 25, 2025

This PR adds a faster linker to the mix.

This faster linker (primary_name_only_linker) is designed to link names only if
a) There's 1 suitable concept
b) There's 1 concept that considers the name a primary name

This results in faster linking. But it's also likely to reduce performance in cases where disambiguation is needed.

I ran a few performance / speed tests to look at the throughput and performance tradeoffs:

Dataset Configuration Precision Recall F1 Time (s)
COMETA
Spacy Vector context 0.9245 0.4521 0.6072 68.16
Spacy Faster linker 0.9266 0.4225 0.5804 51.64
Regex Vector context 0.9130 0.4136 0.5693 30.54
Regex Faster linker 0.9205 0.4108 0.5681 6.21
2023 Linking Challenge
Spacy Vector context 0.5353 0.3337 0.4112 75.40
Spacy Faster linker 0.5934 0.2873 0.3871 48.05
Regex Vector context 0.4522 0.3162 0.3722 117.55
Regex Faster linker 0.5091 0.2862 0.3664 82.61

As we can see, for the COMETA dataset, there's a clear benefit in running the faster componetns (tested for both regex tokenizer and this new faster linker). You can improve throughput by an order of magnitude! And the performance benefit isn't that big (up to around 10% in recall - no change in precision).

However, the Linking Challenge dataset shows that the situation is quite a bit more nuanced. In this case, the regex tokenizer results in slower execution than its spacy counterpart. I'm not entirely sure what the underlying cause is here (because the regex tokenizer creates aroudn 25% fewer tokens accross the dataset). But it's a good example of having to tailor the config to the specific usecase.

@tomolopolis
Copy link
Member

Task linked: CU-869b9h7y6 Add simple/fast linker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants