Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached by jberg5 · Pull Request #13951 · explosion/spaCy

jberg5 · 2026-03-31T21:31:05Z

Description

Basically once a span with special cases appeared (like "can't" or any nonstandard contraction), has_special would be set and remain set for any subsequent spans, even if they weren't special. These spans wouldn't get cached, and performance suffered as a result. Fortunately the fix is pretty straightforward, and we immediately see some decent tokenizer speedups.

To illustrate the worst case scenario: if you stick "can't" at the beginning of Huckleberry Finn, it takes 691ms to tokenize on my macbook pro. Reusing the same tokenizer on the same text is another 637ms. After the fix, the first pass takes 230ms, and reusing (everything cached now) takes 87ms. So a 3x speedup on the cold run, and a 7x speedup on the warm run.

text = open("/tmp/huck_finn.txt").read()
tok = spacy.blank("en").tokenizer
doc = tok(text)

Important note: in the average case of tokenizing a diverse, real-world corpora (like a bunch of tweets or something), the impact is more modest because eventually most words will get cached because they appear in a document where they don't follow a special span.

Types of change

Bug fix.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

jberg5 · 2026-03-31T21:35:26Z

spacy/tokenizer.pxd

 cdef class Tokenizer:
    cdef Pool mem
-    cdef PreshMap _cache
+    cdef readonly PreshMap _cache  # readonly so tests can check state


I don't love exposing private state so that tests can read it, but couldn't think of an easy alternative for testing the behavior I wanted to test. Open to suggestions!

Expose _cache and test cache state

39c32bb

jberg5 changed the title ~~Expose _cache and test cache state~~ Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached Mar 31, 2026

fix

b0834be

jberg5 commented Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached#13951

Fix #13950: tokenizer issue: special spans cause every subsequent span to not be cached#13951
jberg5 wants to merge 2 commits intoexplosion:masterfrom
jberg5:fix-has-special

jberg5 commented Mar 31, 2026 •

edited

Loading

Uh oh!

jberg5 Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jberg5 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of change

Checklist

Uh oh!

jberg5 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jberg5 commented Mar 31, 2026 •

edited

Loading