-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Formatting: Rework wp_check_invalid_utf8(), add wp_scrub_utf8()
#9498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formatting: Rework wp_check_invalid_utf8(), add wp_scrub_utf8()
#9498
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
f923dcb to
21469a9
Compare
wp_check_invalid_utf8()wp_check_invalid_utf8(), add wp_scrub_utf8()
1d5aab4 to
2dcebf1
Compare
src/wp-includes/formatting.php
Outdated
| * | ||
| * 'Pi�a' === _wp_scrub_utf8_fallback( mb_convert_encoding( 'Piña', 'Windows-1252', 'UTF-8' ) ); | ||
| * | ||
| * @see wp_scrub_utf8() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, I think the @see tag should come after the @since tag according to the PHP documentation standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code and the relevant link has now been hidden below the unrelated tag.
src/wp-includes/formatting.php
Outdated
| * '.��.' === wp_scrub_utf8( ".\xC1\xBF." ); // Overlong sequence. | ||
| * '.���.' === wp_scrub_utf8( ".\xED\xA0\x80." ); // Surrogate half. | ||
| * | ||
| * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure, but I think the @see tag should come after the @since tag according to the PHP documentation standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code and the relevant link has now been hidden below the unrelated tag.
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an early review focusing on language and general structure. I'll follow up with detailed scrutiny of the implementation.
src/wp-includes/formatting.php
Outdated
| * While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary | ||
| * to work only with a valid string, such as when producing XML or feeding data into a | ||
| * large language model. In these cases, calling this function will produce such a valid | ||
| * string where the invalid segments are replaced with the Unicode Replacement Character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tone makes it seem like invalid UTF-8 is usually fine, but I'm reluctant to give that impression. I'd prefer less relaxed language:
| * While it’s usually safe to ignore invalid UTF-8, there are cases where it’s necessary | |
| * to work only with a valid string, such as when producing XML or feeding data into a | |
| * large language model. In these cases, calling this function will produce such a valid | |
| * string where the invalid segments are replaced with the Unicode Replacement Character. | |
| * It's possible to ignore invalid UTF-8, but there are cases where it’s necessary | |
| * to work only with a valid string, such as when producing XML or feeding data into a | |
| * large language model. In these cases, calling this function will produce such a valid | |
| * string where the invalid segments are replaced with the Unicode Replacement Character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about your comment I have changed the wording considerably. I welcome your thoughts on the new construction. I’m trying to balance passing through invalid bytes for round-tripping value and scrubbing to avoid safety issues.
It’s really hard to come up with cases where we should scrub. The more I look, the fewer places I find. The only places I can think of are where some external interface crashes in the presence of invalid bytes, such as XML. For every other case I think there are strong arguments to be made about leaving the “bad” bytes in place.
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be correct. I spent most of my time working through the maximal subpart section.
I did leave some suggestions that would be good to address, but they're mostly around the documentation and comment. I don't see any issues with the implementation.
src/wp-includes/formatting.php
Outdated
| /* | ||
| * In PHP 8.0, `mb_scrub()` only returns a valid string. Once WordPress | ||
| * depends on PHP 8.0.0 or above, this check will not be necessary. | ||
| */ | ||
| return false === $scrubbed | ||
| ? _wp_scrub_utf8_fallback( $text ) | ||
| : $scrubbed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now covered by the use_fallback on PHP 8.1.6 check above? It seems like this should be unreachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, very good catch. you can probably infer that I wrote this before adding the 8.1.6 check above.
688d44b to
06d6054
Compare
657b56b to
9ea268b
Compare
178b83a to
9352bab
Compare
15a1a37 to
5e326a0
Compare
5e326a0 to
6a2cda9
Compare
…ine. This is the fourth in a series of patches to modernize and standardize UTF-8 handling. `wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string. This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline. To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees. Developed in #9498 Discussed in https://core.trac.wordpress.org/ticket/63837 Follow-up to: [60768]. Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25. Fixes #63837, #29717. See #63863. git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82
…ine. This is the fourth in a series of patches to modernize and standardize UTF-8 handling. `wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string. This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline. To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees. Developed in WordPress/wordpress-develop#9498 Discussed in https://core.trac.wordpress.org/ticket/63837 Follow-up to: [60768]. Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25. Fixes #63837, #29717. See #63863. Built from https://develop.svn.wordpress.org/trunk@60793 git-svn-id: http://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd
…ine. This is the fourth in a series of patches to modernize and standardize UTF-8 handling. `wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string. This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline. To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees. Developed in WordPress/wordpress-develop#9498 Discussed in https://core.trac.wordpress.org/ticket/63837 Follow-up to: [60768]. Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25. Fixes #63837, #29717. See #63863. Built from https://develop.svn.wordpress.org/trunk@60793 git-svn-id: https://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd
…ine. This is the fourth in a series of patches to modernize and standardize UTF-8 handling. `wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string. This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline. To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees. Developed in WordPress#9498 Discussed in https://core.trac.wordpress.org/ticket/63837 Follow-up to: [60768]. Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25. Fixes #63837, #29717. See #63863. git-svn-id: https://develop.svn.wordpress.org/trunk@60793 602fd350-edb4-49c9-b593-d223f7449a82
Trac Ticket: Core-63837
Trac Ticket: Core-29717
See:
#9825,#9830, (#9498),#9826,#9827, #9798,#9828,#9829wp_check_invalid_utf8()has been dependent on the runtime configuration of the system running it. For systems without the necessary Unicode support it would simply not perform any validation of the input. Worse, the use oficonv()is not only system-dependent, but it also fails to perform substitution when requested, returningfalseinstead of the substituted string as described in the function docblock.In this patch a newer spec-compliant fallback is provided while the function itself defers to
mb_scrub(). Both of these implementations are spec-compliant (to UTF-8) and perform “Maximal subpart replacement” when encountering invalid byte sequences — an important aspect to maintain secure and quality interoperability between systems.Also included as part of this work
wp_scrub_utf8()which performs the conversion thatwp_check_invalid_utf8()does if theblog_charsetisUTF-8and if$strip = trueis passed. this is a more focused function._wp_scrub_utf8_fallback()which is a user-space fallback for when thembstringextension isn’t available