Skip to content

Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached#13951

Open
jberg5 wants to merge 2 commits intoexplosion:masterfrom
jberg5:fix-has-special
Open

Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached#13951
jberg5 wants to merge 2 commits intoexplosion:masterfrom
jberg5:fix-has-special

Conversation

@jberg5
Copy link
Copy Markdown

@jberg5 jberg5 commented Mar 31, 2026

Description

Fixes #13950

Basically once a span with special cases appeared (like "can't" or any nonstandard contraction), has_special would be set and remain set for any subsequent spans, even if they weren't special. These spans wouldn't get cached, and performance suffered as a result. Fortunately the fix is pretty straightforward, and we immediately see some decent tokenizer speedups.

To illustrate the worst case scenario: if you stick "can't" at the beginning of Huckleberry Finn, it takes 691ms to tokenize on my macbook pro. Reusing the same tokenizer on the same text is another 637ms. After the fix, the first pass takes 230ms, and reusing (everything cached now) takes 87ms. So a 3x speedup on the cold run, and a 7x speedup on the warm run.

text = open("/tmp/huck_finn.txt").read()
tok = spacy.blank("en").tokenizer
doc = tok(text)

Important note: in the average case of tokenizing a diverse, real-world corpora (like a bunch of tweets or something), the impact is more modest because eventually most words will get cached because they appear in a document where they don't follow a special span.

Types of change

Bug fix.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@jberg5 jberg5 changed the title Expose _cache and test cache state Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached Mar 31, 2026
cdef class Tokenizer:
cdef Pool mem
cdef PreshMap _cache
cdef readonly PreshMap _cache # readonly so tests can check state
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love exposing private state so that tests can read it, but couldn't think of an easy alternative for testing the behavior I wanted to test. Open to suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance/caching issue: tokenizer fails to reset has_special flag after encountering special span, effectively disabling caching

1 participant