Formatting: Rework `wp_check_invalid_utf8()`, add `wp_scrub_utf8()`#9498

Closed

dmsnell wants to merge 1 commit intoWordPress:trunkfrom

dmsnell:formatting/fix-wp-check-invalid-utf8

Member

dmsnell commented Aug 16, 2025 •

edited

Loading

Trac Ticket: Core-63837
Trac Ticket: Core-29717
See: ~~#9825~~, ~~#9830~~, (~~#9498~~), ~~#9826~~, ~~#9827~~, #9798, ~~#9828~~, ~~#9829~~

wp_check_invalid_utf8() has been dependent on the runtime configuration of the system running it. For systems without the necessary Unicode support it would simply not perform any validation of the input. Worse, the use of iconv() is not only system-dependent, but it also fails to perform substitution when requested, returning false instead of the substituted string as described in the function docblock.

In this patch a newer spec-compliant fallback is provided while the function itself defers to mb_scrub(). Both of these implementations are spec-compliant (to UTF-8) and perform “Maximal subpart replacement” when encountering invalid byte sequences — an important aspect to maintain secure and quality interoperability between systems.

Also included as part of this work

a new function wp_scrub_utf8() which performs the conversion that wp_check_invalid_utf8() does if the blog_charset is UTF-8 and if $strip = true is passed. this is a more focused function.
a new function _wp_scrub_utf8_fallback() which is a user-space fallback for when the mbstring extension isn’t available

github-actions bot commented Aug 16, 2025 •

edited

Loading

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, shailu25, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions bot commented Aug 16, 2025

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 3 times, most recently from f923dcb to 21469a9 Compare

August 18, 2025 15:21

dmsnell changed the title ~~Formatting: Normalize behavior in wp_check_invalid_utf8()~~ Formatting: Rework wp_check_invalid_utf8(), add wp_scrub_utf8()

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch from 1d5aab4 to 2dcebf1 Compare

August 19, 2025 13:34

sirreal self-requested a review

August 19, 2025 13:45

shail-mehta reviewed

View reviewed changes

src/wp-includes/formatting.php Outdated

+               *
+               *     'Pi�a' === _wp_scrub_utf8_fallback( mb_convert_encoding( 'Piña', 'Windows-1252', 'UTF-8' ) );
+               *
+               * @see wp_scrub_utf8()

Member

shail-mehta Aug 19, 2025 •

edited

Loading

Here also, I think the @see tag should come after the @since tag according to the PHP documentation standards.

Member Author

dmsnell Sep 16, 2025

I have updated the code and the relevant link has now been hidden below the unrelated tag.

shail-mehta reviewed

View reviewed changes

src/wp-includes/formatting.php Outdated

+               *     '.��.' === wp_scrub_utf8( ".\xC1\xBF." );         // Overlong sequence.
+               *     '.���.' === wp_scrub_utf8( ".\xED\xA0\x80." );    // Surrogate half.
+               *
+               * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630

Member

shail-mehta Aug 19, 2025 •

edited

Loading

I’m not sure, but I think the @see tag should come after the @since tag according to the PHP documentation standards.

Ref: https://developer.wordpress.org/coding-standards/inline-documentation-standards/php/

Member Author

dmsnell Sep 16, 2025

I have updated the code and the relevant link has now been hidden below the unrelated tag.

sirreal reviewed

View reviewed changes

Member

sirreal left a comment

This is an early review focusing on language and general structure. I'll follow up with detailed scrutiny of the implementation.

src/wp-includes/formatting.php Show resolved Hide resolved

src/wp-includes/formatting.php Show resolved Hide resolved

src/wp-includes/formatting.php Outdated

Comment on lines 1322 to 1325

+               * While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary
+               * to work only with a valid string, such as when producing XML or feeding data into a
+               * large language model. In these cases, calling this function will produce such a valid
+               * string where the invalid segments are replaced with the Unicode Replacement Character.

Member

sirreal Aug 20, 2025

This tone makes it seem like invalid UTF-8 is usually fine, but I'm reluctant to give that impression. I'd prefer less relaxed language:

Suggested change

      
             * While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary
          
             * to work only with a valid string, such as when producing XML or feeding data into a
          
             * large language model. In these cases, calling this function will produce such a valid
          
             * string where the invalid segments are replaced with the Unicode Replacement Character.
          
             * It's possible to ignore invalid UTF-8, but there are cases where it’s necessary
          
             * to work only with a valid string, such as when producing XML or feeding data into a
          
             * large language model. In these cases, calling this function will produce such a valid
          
             * string where the invalid segments are replaced with the Unicode Replacement Character.

Member Author

dmsnell Sep 16, 2025

After thinking about your comment I have changed the wording considerably. I welcome your thoughts on the new construction. I’m trying to balance passing through invalid bytes for round-tripping value and scrubbing to avoid safety issues.

It’s really hard to come up with cases where we should scrub. The more I look, the fewer places I find. The only places I can think of are where some external interface crashes in the presence of invalid bytes, such as XML. For every other case I think there are strong arguments to be made about leaving the “bad” bytes in place.

phpcs.xml.dist Outdated Show resolved Hide resolved

src/wp-includes/formatting.php Outdated Show resolved Hide resolved

sirreal reviewed

View reviewed changes

src/wp-includes/formatting.php Outdated Show resolved Hide resolved

sirreal previously approved these changes

View reviewed changes

Member

sirreal left a comment

This seems to be correct. I spent most of my time working through the maximal subpart section.

I did leave some suggestions that would be good to address, but they're mostly around the documentation and comment. I don't see any issues with the implementation.

src/wp-includes/formatting.php Outdated

Comment on lines 1385 to 1391

+              	/*
+              	 * In PHP 8.0, `mb_scrub()` only returns a valid string. Once WordPress
+              	 * depends on PHP 8.0.0 or above, this check will not be necessary.
+              	 */
+              	return false === $scrubbed
+              		? _wp_scrub_utf8_fallback( $text )
+              		: $scrubbed;

Member

sirreal Aug 20, 2025

Is this now covered by the use_fallback on PHP 8.1.6 check above? It seems like this should be unreachable.

Member Author

dmsnell Aug 21, 2025

yes, very good catch. you can probably infer that I wrote this before adding the 8.1.6 check above.

sirreal self-requested a review

August 22, 2025 21:20

dmsnell mentioned this pull request

29717 wp check invalid utf8 #6320

Open

sirreal dismissed their stale review

August 26, 2025 09:08

Review is stale.

This was referenced Aug 30, 2025

fix-Replace warning-suppression with error-handler in _wp_can_use_pcre_u #9618

Open

Formatting: Rely on _wp_can_use_pcre_u() to detect Unicode PCRE support. #9678

Closed

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch from 688d44b to 06d6054 Compare

September 10, 2025 21:00

This was referenced Sep 10, 2025

Charset: Introduce wp_utf8_chunks() to iterate through strings. #9826

Closed

Charset: Create compat-utf8.php module with fallback code. #9825

Closed

Charset: Track detection of non-characters when scanning strings. #9827

Closed

HTML API: Replace use of PCRE functions with new UTF-8 pipeline. #9798

Closed

Charset: Rely on new UTF-8 pipeline for mb_strlen() fallback. #9828

Closed

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

Closed

Charset: Abstract new UTF-8 scanning pipeline from validation fallback. #9830

Closed

dmsnell marked this pull request as draft

September 11, 2025 03:33

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 3 times, most recently from 657b56b to 9ea268b Compare

September 15, 2025 18:58

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 2 times, most recently from 178b83a to 9352bab Compare

September 16, 2025 12:43

dmsnell marked this pull request as ready for review

September 16, 2025 12:43

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 4 times, most recently from 15a1a37 to 5e326a0 Compare

September 22, 2025 20:41


          Introduce new UTF-8 pipeline in new compat-utf8 module.

6a2cda9

dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch from 5e326a0 to 6a2cda9 Compare

September 22, 2025 20:50

pento pushed a commit that referenced this pull request


          Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipel…

d1e7f56

…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in #9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82

github-actions bot commented Sep 23, 2025

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60793
GitHub commit: d1e7f56

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

github-actions bot closed this

dmsnell deleted the formatting/fix-wp-check-invalid-utf8 branch

September 23, 2025 03:35

markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request


          Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipel…

…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress/wordpress-develop#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Built from https://develop.svn.wordpress.org/trunk@60793


git-svn-id: http://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd

github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request


          Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipel…

424146f

…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress/wordpress-develop#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Built from https://develop.svn.wordpress.org/trunk@60793


git-svn-id: https://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd

jonnynews pushed a commit to spacedmonkey/wordpress-develop that referenced this pull request


          Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipel…

…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet