Skip to content

Comments

feat: add comprehensive data sanitization to prevent null bytes#17775

Closed
Classic298 wants to merge 7 commits intoopen-webui:devfrom
Classic298:fix-postgres-null-byte-error
Closed

feat: add comprehensive data sanitization to prevent null bytes#17775
Classic298 wants to merge 7 commits intoopen-webui:devfrom
Classic298:fix-postgres-null-byte-error

Conversation

@Classic298
Copy link
Collaborator

This commit introduces a robust, centralized sanitization mechanism to prevent null bytes (\u0000) from being stored in the database. These characters were causing UntranslatableCharacter errors during search operations in PostgreSQL deployments.

The solution consists of a new centralized utility, data_sanitizer.py, which provides a sanitize_data function that recursively traverses data structures (dictionaries, lists, and strings) and removes null bytes.

To apply this sanitization broadly and automatically, this change introduces custom SQLAlchemy TypeDecorator classes, SanitizedText and SanitizedJSON. These types wrap the standard Text and JSON types and apply the sanitization logic before the data is sent to the database.

These new sanitized types have been applied to all relevant columns in the database models, including Chat, Prompt, File, Message, User, and Knowledge, ensuring that all user-generated content and external data are cleaned proactively. This approach provides a comprehensive, long-term fix for the issue by ensuring data integrity at the ORM layer.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

This commit introduces a robust, centralized sanitization mechanism to prevent null bytes (`\u0000`) from being stored in the database. These characters were causing `UntranslatableCharacter` errors during search operations in PostgreSQL deployments.

The solution consists of a new centralized utility, `data_sanitizer.py`, which provides a `sanitize_data` function that recursively traverses data structures (dictionaries, lists, and strings) and removes null bytes.

To apply this sanitization broadly and automatically, this change introduces custom SQLAlchemy `TypeDecorator` classes, `SanitizedText` and `SanitizedJSON`. These types wrap the standard `Text` and `JSON` types and apply the sanitization logic before the data is sent to the database.

These new sanitized types have been applied to all relevant columns in the database models, including `Chat`, `Prompt`, `File`, `Message`, `User`, and `Knowledge`, ensuring that all user-generated content and external data are cleaned proactively. This approach provides a comprehensive, long-term fix for the issue by ensuring data integrity at the ORM layer.
@Classic298 Classic298 marked this pull request as ready for review September 26, 2025 09:26
@Classic298 Classic298 mentioned this pull request Sep 26, 2025
9 tasks
@silentoplayz silentoplayz added enhancement New feature or request testing wanted Testing from the community is needed labels Oct 1, 2025
@tjbck
Copy link
Contributor

tjbck commented Oct 27, 2025

qq: why does this happen in the first place?

@Classic298
Copy link
Collaborator Author

@tjbck uploaded PDF files or web search answers containing those null bytes. That's how chats can get infested and PostgreSQL breaks because PostgreSQL cannot work with null bytes.

For more info see #15616

@Classic298
Copy link
Collaborator Author

This is related and (in my opinion) should ALSO be merged, if this PR is merged

#18576

@Classic298
Copy link
Collaborator Author

Classic298 commented Oct 27, 2025

#18576 does search sanitization while my PR attempts to prevent the Null bytes from even reaching the database.

But both PRs should be merged for an in-depth protection on the level where data is added, and also on the level where data is fetched. This would provide the best solution

**My PR attempts to prevent the null bytes from existing in the first place, wheras the other PR adds extra protection if null-bytes

  1. already exist (they do on installations like from those users who commented on the issue!)
  2. somehow made it past my sanitization (maybe some data inputs i missed somewhere)**

@tjbck
Copy link
Contributor

tjbck commented Nov 5, 2025

@Classic298 Can we first have a PR that filters these characters on insertion?

@Classic298
Copy link
Collaborator Author

Check the other pr i mentioned here in the comments @tjbck

@tjbck
Copy link
Contributor

tjbck commented Nov 5, 2025

#18576 ? or am I missing something.

@Classic298
Copy link
Collaborator Author

yes #18576 and also #18207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request testing wanted Testing from the community is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants