Skip to content

Fix NlpSentenceChunking destroying sentence order#1911

Closed
SamSi0322 wants to merge 1 commit intounclecode:mainfrom
SamSi0322:fix/sentence-chunking-order
Closed

Fix NlpSentenceChunking destroying sentence order#1911
SamSi0322 wants to merge 1 commit intounclecode:mainfrom
SamSi0322:fix/sentence-chunking-order

Conversation

@SamSi0322
Copy link
Copy Markdown

Summary

NlpSentenceChunking.chunk() used list(set(sens)) to deduplicate sentences, but set() is unordered in Python — this destroyed document sentence order, producing arbitrarily shuffled chunks.

Fix: Replaced with list(dict.fromkeys(sens)) which deduplicates while preserving insertion order (Python 3.7+).

Closes #1909

list(set(sens)) is unordered — sentences were returned in arbitrary
order instead of document order. Replaced with dict.fromkeys() which
deduplicates while preserving insertion order.

Closes unclecode#1909
@ntohidi
Copy link
Copy Markdown
Collaborator

ntohidi commented Apr 11, 2026

Thanks for your contribution. While fixing this, I also found that NlpSentenceChunking.init() had a broken re-import (from crawl4ai.le.legacy.model_loader import ...) that shadows the working top-level import. So I've made a new PR, #1913

@ntohidi ntohidi closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: NlpSentenceChunking.chunk() uses list(set(sens)) which destroys sentence order

2 participants