Skip to content

Conversation

@VedantMadane
Copy link

@VedantMadane VedantMadane commented Jan 17, 2026

Summary

Fixes #8798

This PR expands the functionality of DocumentCleaner with two new parameters:

1. strip_whitespace: bool = False

When True, removes leading and trailing whitespace from document content using Python's str.strip().
Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace (useful for markdown formatting).

2.

regex_replace: dict[str, str] | None = None
A dictionary mapping regex patterns to replacement strings. This allows custom replacements instead of just removal. For example:

  • {r'\n\n+': '\n'} replaces multiple consecutive newlines with a single newline
  • {r'\s{2,}': ' '} replaces multiple spaces with a single space

Changes

  • Added strip_whitespace parameter to \DocumentCleaner.init()
  • Added regex_replace\ parameter to DocumentCleaner.init()
  • Added _replace_regex()\ method for custom regex replacements
  • Added comprehensive unit tests for both new features

Test plan

  • Added unit tests for strip_whitespace
  • Added unit tests for regex_replace with single/multiple patterns
  • Added test for combined usage of both features
  • Added test for initialization with new parameters

Add two new parameters to DocumentCleaner:

1. strip_whitespace - removes leading/trailing whitespace using str.strip()

2. regex_replace - maps regex patterns to replacement strings

Fixes deepset-ai#8798
@VedantMadane VedantMadane requested a review from a team as a code owner January 17, 2026 11:28
@VedantMadane VedantMadane requested review from julian-risch and removed request for a team January 17, 2026 11:28
@vercel
Copy link

vercel bot commented Jan 17, 2026

@VedantMadane is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 17, 2026
@VedantMadane VedantMadane marked this pull request as draft January 17, 2026 11:36
@VedantMadane VedantMadane marked this pull request as ready for review January 17, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand the functionality of the DocumentCleaner

2 participants