feat: add comprehensive data sanitization to prevent null bytes#17775
Closed
Classic298 wants to merge 7 commits intoopen-webui:devfrom
Closed
feat: add comprehensive data sanitization to prevent null bytes#17775Classic298 wants to merge 7 commits intoopen-webui:devfrom
Classic298 wants to merge 7 commits intoopen-webui:devfrom
Conversation
This commit introduces a robust, centralized sanitization mechanism to prevent null bytes (`\u0000`) from being stored in the database. These characters were causing `UntranslatableCharacter` errors during search operations in PostgreSQL deployments. The solution consists of a new centralized utility, `data_sanitizer.py`, which provides a `sanitize_data` function that recursively traverses data structures (dictionaries, lists, and strings) and removes null bytes. To apply this sanitization broadly and automatically, this change introduces custom SQLAlchemy `TypeDecorator` classes, `SanitizedText` and `SanitizedJSON`. These types wrap the standard `Text` and `JSON` types and apply the sanitization logic before the data is sent to the database. These new sanitized types have been applied to all relevant columns in the database models, including `Chat`, `Prompt`, `File`, `Message`, `User`, and `Knowledge`, ensuring that all user-generated content and external data are cleaned proactively. This approach provides a comprehensive, long-term fix for the issue by ensuring data integrity at the ORM layer.
9 tasks
silentoplayz
approved these changes
Oct 1, 2025
Contributor
|
qq: why does this happen in the first place? |
Collaborator
Author
Collaborator
Author
|
This is related and (in my opinion) should ALSO be merged, if this PR is merged |
Collaborator
Author
|
#18576 does search sanitization while my PR attempts to prevent the Null bytes from even reaching the database. But both PRs should be merged for an in-depth protection on the level where data is added, and also on the level where data is fetched. This would provide the best solution **My PR attempts to prevent the null bytes from existing in the first place, wheras the other PR adds extra protection if null-bytes
|
Contributor
|
@Classic298 Can we first have a PR that filters these characters on insertion? |
Collaborator
Author
|
Check the other pr i mentioned here in the comments @tjbck |
Contributor
|
#18576 ? or am I missing something. |
Collaborator
Author
Closed
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces a robust, centralized sanitization mechanism to prevent null bytes (
\u0000) from being stored in the database. These characters were causingUntranslatableCharactererrors during search operations in PostgreSQL deployments.The solution consists of a new centralized utility,
data_sanitizer.py, which provides asanitize_datafunction that recursively traverses data structures (dictionaries, lists, and strings) and removes null bytes.To apply this sanitization broadly and automatically, this change introduces custom SQLAlchemy
TypeDecoratorclasses,SanitizedTextandSanitizedJSON. These types wrap the standardTextandJSONtypes and apply the sanitization logic before the data is sent to the database.These new sanitized types have been applied to all relevant columns in the database models, including
Chat,Prompt,File,Message,User, andKnowledge, ensuring that all user-generated content and external data are cleaned proactively. This approach provides a comprehensive, long-term fix for the issue by ensuring data integrity at the ORM layer.Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.