Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 25, 2025

This PR creates the option to allow for faster spacy tokenization.

The default behaviour is to run the tokenization through the entire spacy pipeline. There's a lot of the pipeline that has (generally) already been disabled (see config.general.nlp.disabled_components and its docs). But there's still a few that do run (['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']) and they do take a signifcant time to run.
The results of the remaining components are used within the current medcat pipeline:

  • tok2vec is used as an input for tagger
  • tagger is used to generate .tag_, which we use in preprocessing / data cleaning as well as in some normalizing
  • attribute_ruler is used to generate .is_stop, which we use in the the NER process (i.e to ignore the stopwords in multi-token spans) and in the vector context model (i.e these won't be used for calculating the context vectors)
  • lemmatizer is used to generate .lemma_, which we use in preprocessing / data cleaning as well as in some normalizing (similar to .tag_ above)

With that said, I ran some metrics to see how much these things affect our overall performance, and here's the results:

Configuration Precision Recall F1 Time (s)
COMETA
Normal spacy 0.9245 0.4521 0.6072 65.98
No pipe 0.9251 0.4388 0.5804 19.24
2023 Linking Challenge
Normal spacy 0.5353 0.3337 0.4112 77.60
No pipe 0.5290 0.3259 0.4033 35.16

As we can see, depending on the specific usecase, we can increase throughput 2-3.5 fold. And there can be a hit to performance, but it doesn't seem to be super large. At least in these specific use cases.

@tomolopolis
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants