Skip to content

bug: NlpSentenceChunking.chunk() uses list(set(sens)) which destroys sentence order #1909

@kuishou68

Description

@kuishou68

Bug Description

In crawl4ai/chunking_strategy.py, the NlpSentenceChunking.chunk() method returns list(set(sens)):

def chunk(self, text: str) -> list:
    from nltk.tokenize import sent_tokenize
    sentences = sent_tokenize(text)
    sens = [sent.strip() for sent in sentences]
    return list(set(sens))  # BUG: set() is unordered!

Problem

set() in Python is unordered — converting sentences to a set and back to a list:

  1. Destroys sentence order: sentences are returned in arbitrary order, not document order
  2. Removes duplicate sentences that may be intentionally present

Fix

Return sens directly: return sens

Metadata

Metadata

Assignees

No one assigned

    Labels

    ⚙️ In-progressIssues, Features requests that are in Progress🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions