Bug Description
In crawl4ai/chunking_strategy.py, the NlpSentenceChunking.chunk() method returns list(set(sens)):
def chunk(self, text: str) -> list:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
sens = [sent.strip() for sent in sentences]
return list(set(sens)) # BUG: set() is unordered!
Problem
set() in Python is unordered — converting sentences to a set and back to a list:
- Destroys sentence order: sentences are returned in arbitrary order, not document order
- Removes duplicate sentences that may be intentionally present
Fix
Return sens directly: return sens
Bug Description
In
crawl4ai/chunking_strategy.py, theNlpSentenceChunking.chunk()method returnslist(set(sens)):Problem
set()in Python is unordered — converting sentences to a set and back to a list:Fix
Return
sensdirectly:return sens