feat(medcat): CU-869b9n4mq Allow faster spacy tokenization #244
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR creates the option to allow for faster
spacytokenization.The default behaviour is to run the tokenization through the entire
spacypipeline. There's a lot of the pipeline that has (generally) already been disabled (seeconfig.general.nlp.disabled_componentsand its docs). But there's still a few that do run (['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']) and they do take a signifcant time to run.The results of the remaining components are used within the current medcat pipeline:
tok2vecis used as an input fortaggertaggeris used to generate.tag_, which we use in preprocessing / data cleaning as well as in some normalizingattribute_ruleris used to generate.is_stop, which we use in the the NER process (i.e to ignore the stopwords in multi-token spans) and in the vector context model (i.e these won't be used for calculating the context vectors)lemmatizeris used to generate.lemma_, which we use in preprocessing / data cleaning as well as in some normalizing (similar to.tag_above)With that said, I ran some metrics to see how much these things affect our overall performance, and here's the results:
As we can see, depending on the specific usecase, we can increase throughput 2-3.5 fold. And there can be a hit to performance, but it doesn't seem to be super large. At least in these specific use cases.