Skip to content

feat: add strip_whitespaces and replace_regexes to DocumentCleaner#10400

Merged
julian-risch merged 7 commits intodeepset-ai:mainfrom
VedantMadane:feat/document-cleaner-expansion
Feb 2, 2026
Merged

feat: add strip_whitespaces and replace_regexes to DocumentCleaner#10400
julian-risch merged 7 commits intodeepset-ai:mainfrom
VedantMadane:feat/document-cleaner-expansion

Conversation

@VedantMadane
Copy link
Contributor

@VedantMadane VedantMadane commented Jan 17, 2026

Summary

Fixes #8798

This PR expands the functionality of DocumentCleaner with two new parameters:

1. strip_whitespace: bool = False

When True, removes leading and trailing whitespace from document content using Python's str.strip().
Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace (useful for markdown formatting).

2.

regex_replace: dict[str, str] | None = None
A dictionary mapping regex patterns to replacement strings. This allows custom replacements instead of just removal. For example:

  • {r'\n\n+': '\n'} replaces multiple consecutive newlines with a single newline
  • {r'\s{2,}': ' '} replaces multiple spaces with a single space

Changes

  • Added strip_whitespace parameter to \DocumentCleaner.init()
  • Added regex_replace\ parameter to DocumentCleaner.init()
  • Added _replace_regex()\ method for custom regex replacements
  • Added comprehensive unit tests for both new features

Test plan

  • Added unit tests for strip_whitespace
  • Added unit tests for regex_replace with single/multiple patterns
  • Added test for combined usage of both features
  • Added test for initialization with new parameters

Add two new parameters to DocumentCleaner:

1. strip_whitespace - removes leading/trailing whitespace using str.strip()

2. regex_replace - maps regex patterns to replacement strings

Fixes deepset-ai#8798
@VedantMadane VedantMadane requested a review from a team as a code owner January 17, 2026 11:28
@VedantMadane VedantMadane requested review from julian-risch and removed request for a team January 17, 2026 11:28
@vercel
Copy link

vercel bot commented Jan 17, 2026

@VedantMadane is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 17, 2026
@VedantMadane VedantMadane marked this pull request as draft January 17, 2026 11:36
@VedantMadane VedantMadane marked this pull request as ready for review January 17, 2026 11:57
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this pull request @VedantMadane ! The changes look quite good to me already.
My main change request is to keep "\f" unchanged (see the implementation of _remove_regex) and my comment.
The only other thing I would suggest to change before we merge the PR are the parameter names. Instead of strip_whitespace, I suggest strip_whitespaces and instead of regex_replace, I suggest replace_regexes. That way, the naming is more consistent with the other parameter names (remove_extra_whitespaces, remove_regex).
Besides that, I'd like to note that remove_regex is a subset of replace_regexes and in a future version of Haystack, we could decide to deprecate and then later remove remove_regex. However, for now, I'd like to avoid a breaking change. 👍

cleaner = DocumentCleaner(
remove_empty_lines=False,
remove_extra_whitespaces=False,
regex_replace={r"\[REDACTED\]": "***", r"(\d{4})-(\d{2})-(\d{2})": r"\2/\3/\1"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for our team: This is quite powerful and I think we could include such an example in our documentation. That would make it easier for users to understand what the parameter can be used for.

@julian-risch
Copy link
Member

Add release notes documenting the new strip_whitespace and regex_replace
parameters added to the DocumentCleaner component.

- strip_whitespace: Removes leading/trailing whitespace while preserving internal formatting
- regex_replace: Allows custom regex-based text transformations with replacement strings

Addresses review feedback requesting release notes for PR deepset-ai#10400.
@vercel
Copy link

vercel bot commented Feb 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview Feb 2, 2026 11:13am

Request Review

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for addressing the change requests and congratulations on your first contribution to Haystack @VedantMadane ! I will merge this pull request now and it will be part of the Haystack 2.24 release.

@julian-risch julian-risch changed the title feat: add strip_whitespace and regex_replace to DocumentCleaner feat: add strip_whitespaces and replace_regexes to DocumentCleaner Feb 2, 2026
@julian-risch julian-risch enabled auto-merge (squash) February 2, 2026 11:21
@julian-risch julian-risch merged commit f0987de into deepset-ai:main Feb 2, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand the functionality of the DocumentCleaner

3 participants