feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244

mart-r · 2025-11-25T18:05:02Z

This PR creates the option to allow for faster spacy tokenization.

The default behaviour is to run the tokenization through the entire spacy pipeline. There's a lot of the pipeline that has (generally) already been disabled (see config.general.nlp.disabled_components and its docs). But there's still a few that do run (['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']) and they do take a signifcant time to run.
The results of the remaining components are used within the current medcat pipeline:

tok2vec is used as an input for tagger
tagger is used to generate .tag_, which we use in preprocessing / data cleaning as well as in some normalizing
attribute_ruler is used to generate .is_stop, which we use in the the NER process (i.e to ignore the stopwords in multi-token spans) and in the vector context model (i.e these won't be used for calculating the context vectors)
lemmatizer is used to generate .lemma_, which we use in preprocessing / data cleaning as well as in some normalizing (similar to .tag_ above)

With that said, I ran some metrics to see how much these things affect our overall performance, and here's the results:

Configuration	Precision	Recall	F1	Time (s)
COMETA
Normal spacy	0.9245	0.4521	0.6072	65.98
No pipe	0.9251	0.4388	0.5804	19.24
2023 Linking Challenge
Normal spacy	0.5353	0.3337	0.4112	77.60
No pipe	0.5290	0.3259	0.4033	35.16

As we can see, depending on the specific usecase, we can increase throughput 2-3.5 fold. And there can be a hit to performance, but it doesn't seem to be super large. At least in these specific use cases.

tomolopolis · 2025-11-25T18:05:06Z

Task linked: CU-869b9n4mq Allow faster spacy tokenization

mart-r added 3 commits November 25, 2025 17:49

CU-869b9n4mq: Add config option for faster spacy tokenization

4573cc7

CU-869b9n4mq: Add implementation for faster spacy tokenization

0dce205

CU-869b9n4mq: Add a few tests for faster tokenization

ab96c65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244

feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244

Uh oh!

mart-r commented Nov 25, 2025

Uh oh!

tomolopolis commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244

Are you sure you want to change the base?

feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244

Uh oh!

Conversation

mart-r commented Nov 25, 2025

Uh oh!

tomolopolis commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants