feat: add strip_whitespaces and replace_regexes to DocumentCleaner#10400
Conversation
Add two new parameters to DocumentCleaner: 1. strip_whitespace - removes leading/trailing whitespace using str.strip() 2. regex_replace - maps regex patterns to replacement strings Fixes deepset-ai#8798
|
@VedantMadane is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
julian-risch
left a comment
There was a problem hiding this comment.
Thank you for opening this pull request @VedantMadane ! The changes look quite good to me already.
My main change request is to keep "\f" unchanged (see the implementation of _remove_regex) and my comment.
The only other thing I would suggest to change before we merge the PR are the parameter names. Instead of strip_whitespace, I suggest strip_whitespaces and instead of regex_replace, I suggest replace_regexes. That way, the naming is more consistent with the other parameter names (remove_extra_whitespaces, remove_regex).
Besides that, I'd like to note that remove_regex is a subset of replace_regexes and in a future version of Haystack, we could decide to deprecate and then later remove remove_regex. However, for now, I'd like to avoid a breaking change. 👍
| cleaner = DocumentCleaner( | ||
| remove_empty_lines=False, | ||
| remove_extra_whitespaces=False, | ||
| regex_replace={r"\[REDACTED\]": "***", r"(\d{4})-(\d{2})-(\d{2})": r"\2/\3/\1"}, |
There was a problem hiding this comment.
Just a note for our team: This is quite powerful and I think we could include such an example in our documentation. That would make it easier for users to understand what the parameter can be used for.
|
Just let me know if you need help with adding a release note (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes) or with ruff formatting (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#run-code-quality-checks-locally). I understand this is your first Haystack PR and I'll be happy to help! |
Add release notes documenting the new strip_whitespace and regex_replace parameters added to the DocumentCleaner component. - strip_whitespace: Removes leading/trailing whitespace while preserving internal formatting - regex_replace: Allows custom regex-based text transformations with replacement strings Addresses review feedback requesting release notes for PR deepset-ai#10400.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
julian-risch
left a comment
There was a problem hiding this comment.
Looks good to me! Thank you for addressing the change requests and congratulations on your first contribution to Haystack @VedantMadane ! I will merge this pull request now and it will be part of the Haystack 2.24 release.
Summary
Fixes #8798
This PR expands the functionality of DocumentCleaner with two new parameters:
1. strip_whitespace: bool = False
When True, removes leading and trailing whitespace from document content using Python's str.strip().
Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace (useful for markdown formatting).
2.
regex_replace: dict[str, str] | None = None
A dictionary mapping regex patterns to replacement strings. This allows custom replacements instead of just removal. For example:
Changes
Test plan