Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 16, 2025

Trac Ticket: Core-63837
Trac Ticket: Core-29717
See: #9825, #9830, (#9498), #9826, #9827, #9798, #9828, #9829

wp_check_invalid_utf8() has been dependent on the runtime configuration of the system running it. For systems without the necessary Unicode support it would simply not perform any validation of the input. Worse, the use of iconv() is not only system-dependent, but it also fails to perform substitution when requested, returning false instead of the substituted string as described in the function docblock.

In this patch a newer spec-compliant fallback is provided while the function itself defers to mb_scrub(). Both of these implementations are spec-compliant (to UTF-8) and perform “Maximal subpart replacement” when encountering invalid byte sequences — an important aspect to maintain secure and quality interoperability between systems.

Also included as part of this work

  • a new function wp_scrub_utf8() which performs the conversion that wp_check_invalid_utf8() does if the blog_charset is UTF-8 and if $strip = true is passed. this is a more focused function.
  • a new function _wp_scrub_utf8_fallback() which is a user-space fallback for when the mbstring extension isn’t available

@github-actions
Copy link

github-actions bot commented Aug 16, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, shailu25, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 3 times, most recently from f923dcb to 21469a9 Compare August 18, 2025 15:21
@dmsnell dmsnell changed the title Formatting: Normalize behavior in wp_check_invalid_utf8() Formatting: Rework wp_check_invalid_utf8(), add wp_scrub_utf8() Aug 18, 2025
@dmsnell dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch from 1d5aab4 to 2dcebf1 Compare August 19, 2025 13:34
@sirreal sirreal self-requested a review August 19, 2025 13:45
*
* 'Pi�a' === _wp_scrub_utf8_fallback( mb_convert_encoding( 'Piña', 'Windows-1252', 'UTF-8' ) );
*
* @see wp_scrub_utf8()
Copy link
Member

@shail-mehta shail-mehta Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, I think the @see tag should come after the @since tag according to the PHP documentation standards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code and the relevant link has now been hidden below the unrelated tag.

* '.��.' === wp_scrub_utf8( ".\xC1\xBF." ); // Overlong sequence.
* '.���.' === wp_scrub_utf8( ".\xED\xA0\x80." ); // Surrogate half.
*
* @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630
Copy link
Member

@shail-mehta shail-mehta Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure, but I think the @see tag should come after the @since tag according to the PHP documentation standards.

Ref: https://developer.wordpress.org/coding-standards/inline-documentation-standards/php/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code and the relevant link has now been hidden below the unrelated tag.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an early review focusing on language and general structure. I'll follow up with detailed scrutiny of the implementation.

Comment on lines 1322 to 1325
* While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary
* to work only with a valid string, such as when producing XML or feeding data into a
* large language model. In these cases, calling this function will produce such a valid
* string where the invalid segments are replaced with the Unicode Replacement Character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tone makes it seem like invalid UTF-8 is usually fine, but I'm reluctant to give that impression. I'd prefer less relaxed language:

Suggested change
* While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary
* to work only with a valid string, such as when producing XML or feeding data into a
* large language model. In these cases, calling this function will produce such a valid
* string where the invalid segments are replaced with the Unicode Replacement Character.
* It's possible to ignore invalid UTF-8, but there are cases where it’s necessary
* to work only with a valid string, such as when producing XML or feeding data into a
* large language model. In these cases, calling this function will produce such a valid
* string where the invalid segments are replaced with the Unicode Replacement Character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about your comment I have changed the wording considerably. I welcome your thoughts on the new construction. I’m trying to balance passing through invalid bytes for round-tripping value and scrubbing to avoid safety issues.

It’s really hard to come up with cases where we should scrub. The more I look, the fewer places I find. The only places I can think of are where some external interface crashes in the presence of invalid bytes, such as XML. For every other case I think there are strong arguments to be made about leaving the “bad” bytes in place.

sirreal
sirreal previously approved these changes Aug 20, 2025
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be correct. I spent most of my time working through the maximal subpart section.

I did leave some suggestions that would be good to address, but they're mostly around the documentation and comment. I don't see any issues with the implementation.

Comment on lines 1385 to 1391
/*
* In PHP 8.0, `mb_scrub()` only returns a valid string. Once WordPress
* depends on PHP 8.0.0 or above, this check will not be necessary.
*/
return false === $scrubbed
? _wp_scrub_utf8_fallback( $text )
: $scrubbed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now covered by the use_fallback on PHP 8.1.6 check above? It seems like this should be unreachable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, very good catch. you can probably infer that I wrote this before adding the 8.1.6 check above.

@dmsnell dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 2 times, most recently from 178b83a to 9352bab Compare September 16, 2025 12:43
@dmsnell dmsnell marked this pull request as ready for review September 16, 2025 12:43
@dmsnell dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch 4 times, most recently from 15a1a37 to 5e326a0 Compare September 22, 2025 20:41
@dmsnell dmsnell force-pushed the formatting/fix-wp-check-invalid-utf8 branch from 5e326a0 to 6a2cda9 Compare September 22, 2025 20:50
pento pushed a commit that referenced this pull request Sep 23, 2025
…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in #9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82
@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60793
GitHub commit: d1e7f56

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Sep 23, 2025
@dmsnell dmsnell deleted the formatting/fix-wp-check-invalid-utf8 branch September 23, 2025 03:35
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Sep 23, 2025
…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress/wordpress-develop#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Built from https://develop.svn.wordpress.org/trunk@60793


git-svn-id: http://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Sep 23, 2025
…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress/wordpress-develop#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Built from https://develop.svn.wordpress.org/trunk@60793


git-svn-id: https://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd
jonnynews pushed a commit to spacedmonkey/wordpress-develop that referenced this pull request Sep 24, 2025
…ine.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in WordPress#9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants