feat(medcat): CU-869b9h7y6 Add faster linker #243
+125
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a faster linker to the mix.
This faster linker (
primary_name_only_linker) is designed to link names only ifa) There's 1 suitable concept
b) There's 1 concept that considers the name a primary name
This results in faster linking. But it's also likely to reduce performance in cases where disambiguation is needed.
I ran a few performance / speed tests to look at the throughput and performance tradeoffs:
As we can see, for the COMETA dataset, there's a clear benefit in running the faster componetns (tested for both regex tokenizer and this new faster linker). You can improve throughput by an order of magnitude! And the performance benefit isn't that big (up to around 10% in recall - no change in precision).
However, the Linking Challenge dataset shows that the situation is quite a bit more nuanced. In this case, the regex tokenizer results in slower execution than its spacy counterpart. I'm not entirely sure what the underlying cause is here (because the
regextokenizer creates aroudn 25% fewer tokens accross the dataset). But it's a good example of having to tailor the config to the specific usecase.