Make WordPress Core

Opened 6 months ago

Closed 4 months ago

#63863 closed enhancement (fixed)

Standardize UTF-8 handling and fallbacks in 6.9

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version: 6.9
Component: Charset Keywords: has-patch has-unit-tests
Focuses: performance Cc:

Description (last modified by dmsnell)

Core uses a wide variety of methods for working with UTF-8, especially so when the running server and process doesn’t have the mbstring extension loaded. Much of the diversity in implementation and specification in this code is the legacy of the time and context in which it was added. Today, it’s viable to unify and standardize code handling UTF-8, which is what this ticket proposes.

WordPress should unify handling of UTF-8 data and depend foremost on the mbstring-provided functions which have received significant updates during the 8.x releases of PHP. When unavailable, WordPress should defer to a single and reasonable implementation of the functionality in pure PHP.

Why mbstring?

The mbstring PHP extension has been highly recommended for a long time now and it provides high-performance and spec-compliant tools for working with UTF-8 and other text encodings. The extension has seen active development and integration with PHP throughout the PHP 8.x cycle, including improved Unicode support and a couple of notable enhancements for this work:

  • Since PHP 8.1.6 the mb_scrub() function follows the Unicode substitution of maximal subparts. This is an important aspect for UTF-8 decoders which ensures safe decoding of untrusted inputs and standardized interoperability with UTF-8 handling in other systems.
  • Since PHP 8.3.0 the UTF-8 validity of a string is cached on the ZSTR value itself and repeated checks are property-accesses for strings validated as UTF-8.
  • These functions use high-performance implementations in C including the use of SIMD and vectorized algorithms whose speed cannot be matched in PHP and which aren’t deployed in every other potential method for working with UTF-8 (for example, in the iconv and PCRE libraries).

The case for mbstring is strong. On the contrary side I believe that there are reasons to avoid other methods:

  • iconv()’s behavior depends on the version of iconv() compiled in to the running PHP version and it’s hard to establish uniform behavior in the PHP code.
  • iconv() cannot “scrub” invalid byte sequences from UTF-8 texts. It only supports an //IGNORE mode which removes byte sequences and this is not a sound security practice. It’s far more dangerous than substituting the sequences as is done with mb_scrub().
  • PCRE-based approaches are also highly system-dependent.
  • Older versions of the PCRE libraries accepted certain invalid UTF-8 sequences as valid, e.g. with five-byte sequences.
  • PCRE-based approaches are memory heavy, most operating through one of two approaches: split the string into an array of chunks and then process the chunks; or use preg_replace_callback() and pass around sub-allocations of the input string.
  • PCRE functions give no method for zero-allocation operation. This makes PCRE-based approaches vulnerable to denial-of-service attacks and accidents because it must match and extract a full group to operate on it. (For example, HTML numeric character references may contain an arbitrary number of leading zeros and can choke preg_match(), doubling memory use of more just to make the match.

Finally, the plethora of approaches in Core can be dizzying when trying to understand behaviors and bugs. Without a well-defined specification of the behaviors, things only seem to work because the code does not explain the bounds of operation.

Why a single fallback in pure PHP?

While a pure user-space PHP implementation of UTF-8 handling will never be able to approach the performance and efficiency of underlying native libraries, it can be reasonably fast and fast enough for reasonable use. Reportedly, around 0.5% of WordPress installations lack the mbstring extension. For those sites which lack the extension, WordPress can make a tradeoff between performance and harmony of behaviors.

When functions try a chain of fallback options (e.g. mbstring, iconv, PCRE with Unicode support, PCRE with byte patterns hard-coded, ASCII) then it breaks apart WordPress’ reliability and makes for very difficult debugging and resolution.

A pure PHP fallback gives WordPress the ability to identify and fix bugs in its implementation and the implementation details are visible for all to inspect. There’s a much higher barrier to try and diagnose why the functions return the wrong result when one needs to scan PHP’s source code, iconv’s source code, one of several PCRE libraries’ source code, and more.


While the up-front effort is high, the HTML API has demonstrated how valuable it can be to have a reliable API in WordPress for interoperating with various web standards. It’s time to modernize WordPress to support UTF-8 universally and remove the existing complexity of ad-hoc handling and runtime dependencies.

Proposal

  • Create a new wp-includes/compat-utf8.php polyfilling basic UTF-8 handling and implementing the current mb_ polyfills from combat.php. Moving this to a UTF-8-specific module keeps the code in that module focused and makes it easier to exclude the WPCS rule for rejecting goto statements. goto is a valuable construct when handling decoding errors in a low-level decoder. This module loads before wp-includes/compat.php which makes it simpler in that file to polypill things like mb_substr().
  • Create wp-includes/utf8.php containing WordPress-specific functions for handling UTF-8. This abstracts access to text behind a unifying interface and allows WordPress to improve support, performance, and reliability while lifting up all calling code. UTF-8 is universal enough to warrant its own subsystem.
  • String functions are conditionally-defined based on the presence of the mbstring extension and any other relevant factors. This moves support-checks to PHP initialization instead of on every invocation of these functions. A side-effect of splitting these functions based on the presence of the extension is the safe removal of mbstring_binary_safe_encoding(). When mbstring is loaded, functions will call the mb_ functions directly; when it’s not available, there can be no mbstring.func_overload.
  • A new UTF-8 decoding pipeline provides zero-allocation, streaming, and re-entrant access to a string so that common operations don’t need to involve any more overhead than they require. In addition to being a versatile fallback mechanism, this low-level scanner can provide access to new abilities not available today such as: count code points within a substring without allocating, split a string into chunks of valid and invalid byte sequences, and combine identification, validation, and transformation of a string into a single pass.
  • Replace existing non-canonical UTF-8 code in Core with the new abstractions. No more static $utf8_pcre checks, no more if ( function_exists( 'mb_substr' ) ) — just unconditional explicit semantics.
  • Remove the single regex from the HTML API.

As WordPress builds its own abstraction and polyfills for the mbstring library it can remove the fallback behaviors as it changes its minimum supported versions for PHP and if it starts requiring mbstring.

Related tickets

  • #38044: Deprecate seems_utf8() and add wp_is_valid_utf8().
  • #62172: Deprecate non-UTF-8 support.
  • #63837: Overhaul wp_check_invalid_utf8() to remove runtime dependencies.
    • #29717: Optimize and fix wp_check_invalid_utf8().
    • #43224: Remove $pcre_utf8 logic from wp_check_invalid_utf8().
  • #55603: Address deprecation of utf8_decode() and utf8_encode(), discussion of requiring mbstring.

Related PRs

  • #6883 introduce custom UTF-8 decoding pipeline. (this PR was exploratory as part of background research).
  • #9498 update wp_check_invalid_utf8() (currently contains broader updates which will be removed and transferred into a new PR).

Change History (53)

#1 @dmsnell
6 months ago

  • Description modified (diff)

#2 @dmsnell
6 months ago

  • Description modified (diff)

#3 @dmsnell
6 months ago

  • Description modified (diff)

#4 @dmsnell
6 months ago

  • Component changed from Formatting to Charset

#5 @tusharbharti
6 months ago

Hi @dmsnell, I believe https://core.trac.wordpress.org/ticket/63804 is also related to this as it would introduce mb_trim() ( and mb_ltrim() & mb_rtrim() if needed ) in compat for js_trim().

It has a patch for this with compat & unit tests for mb_trim.

This ticket was mentioned in PR #9678 on WordPress/wordpress-develop by @dmsnell.


6 months ago
#6

  • Keywords has-patch added

Trac ticket: Core-63863.

The sanitize_file_name() function attempts to make its own manual detection of Unicode PCRE support, but WordPress already provides a more-robust method of doing so. In order to avoid re-computation, it also stores the result in a static var inside the function. That check is already stored in a static var, however, inside _wp_can_use_pcre_u() check and so the extra cache is largely ineffective.

This patch leans on the recently-updated _wp_can_use_pcre_u() check and removes the superfluous static var.

Follow-up to #9576
Follow-up to [60694]

Work extracted from #9498

#7 @dmsnell
6 months ago

@tusharbharti thanks for linking that. I’ll try and think it through and see how to combine it. there’s definitely some limit to poly-filling the mbstring extension that I’m trying to balance, and much or most of that is focused on UTF-8.


otherwise there’s a funny thing about #9678. it’s worth doing, and I considered creating a ticket like “ensure that code calling PCRE functions with the u flag rely on _wp_can_use_pcre_u()” but as I looked around, almost everything in Core just assumes the flag is present and breaks when it isn’t.

This is part of what I mean when I talk at times about WordPress already being broken in non-UTF-8 environments and I think we could have a lot of un-reported or under-reported bugs that derive from places where the regex fails. it would be incredibly difficult to pin the reported behavior back onto the regex failure.

A good example of this might be shortcode_parse_atts(), where a call to preg_replace() returns NULL instead of a string when support is missing. I don’t know what to do with these functions when support is lacking other than tackle them one at a time and in unique ways for each one.

#8 @dmsnell
6 months ago

In 60695:

Formatting: Rely on _wp_can_use_pcre_u() to detect UTF-8 PCRE support.

The sanitize_file_name() function attempts to detect UTF-8 PCRE support, but WordPress already provides a more robust method. It then caches its check in a static var, which WordPress already does in the canonical function _wp_can_use_pcre_u().

This patch refactors sanitize_file_name() to call _wp_can_use_pcre_u() directly instead of (mostly) recreating and recaching Core’s detection algorithm.

Developed in https://github.com/WordPress/wordpress-develop/pull/9678
Discussed in https://core.trac.wordpress.org/ticket/63863

Follow-up to [60694].

Props dmsnell.
See #63863.

@dmsnell commented on PR #9678:


6 months ago
#9

Merged in [60695]

#10 @tusharbharti
6 months ago

otherwise there’s a funny thing about #9678. it’s worth doing, and I considered creating a ticket like “ensure that code calling PCRE functions with the u flag rely on _wp_can_use_pcre_u()” but as I looked around, almost everything in Core just assumes the flag is present and breaks when it isn’t.

I think creating a ticket for this would make sense as this ticket is for standardizing the UTF-8 part. If there is a lot of places which needs "_wp_can_use_pcre_u()" fix then it can be done on basis of a section like formatting or even basis of files to make reviewing easier ( ofc depending on the files changed ). I can help with this.

here’s definitely some limit to poly-filling the mbstring extension that I’m trying to balance, and much or most of that is focused on UTF-8.

For polyfills, https://github.com/symfony/polyfill-mbstring seems like a good reference, though I am also exploring https://github.com/php/php-src/tree/master/ext/mbstring

#11 @dmsnell
6 months ago

@tusharbharti created the tracking ticket in #63913. We can move discussion about those unchecked uses of the PCRE flag there.

This ticket was mentioned in Slack in #hosting by zunaid321. View the logs.


5 months ago

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


5 months ago

This ticket was mentioned in Slack in #hosting by amykamala. View the logs.


5 months ago

This ticket was mentioned in PR #9825 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#15

Trac Ticket: Core-63863
Follow-up to: #9317, [60630]

This patch introduces a new early-loaded compat-utf8.php file which provides the UTF-8 validation fallback. This is part of a broader effort to unify and standardize UTF-8 handling.

This is an intermediate change in order to better facilitate source-code tracking. This code was introduced in formatting.php and originally was intended to be duplicated inside of wp_check_invalid_utf8(), but the difference between the two functions is so minor that the part of the code which scans through the bytes in a string should be abstracted and reused. This re-use function would ideally load early enough to be used to polyfill methods like mb_substr(), so I am moving this early.

While this is not the abstracted code, this change will be helpful by providing continuity between the function as it stands today and as it will transform when reused by wp_check_invalid_utf8(). In other words, this change exists to make sure that source control shows that this function moved first, and then was changed later. To move it and change it in one go is likely to sever its history.

This will turn into _wp_scan_utf8(), the updated iteration of the #6883 which provides a fast, spec-compliant, and streamable UTF-8 parser.

This ticket was mentioned in PR #9826 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#16

  • Keywords has-unit-tests added

Trac ticket: Core-63863
See: #9825, #9498

Introduces a new method, wp_utf8_chunks(), for iterating through a string which might contain invalid spans of UTF-8 bytes.

This function is largely a low-level function for working with text data. With it, it’s possible to operate on the valid UTF-8 portions of a string while preserving the invalid parts. For instance, this might be used to operate on improperly-constructed JSON data.

This ticket was mentioned in PR #9827 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#17

Trac ticket: Core-63863
See: #9498, #9825, #9826, #9798

Augments the UTF-8 scanning pipeline to track whether noncharacters have been seen. This operation adds a marginal cost to scanning UTF-8 but provides valuable information for HTML- and XML-related calling-code. Tracking it inline allows skipping a second pass through strings to determine whether or not these characters exist.

This ticket was mentioned in PR #9798 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#18

Trac Ticket: Core-63863
See: #9498, #9825, #9826, #9827

Relies on the new UTF-8 pipeline to eliminate the use of preg_match() in the HTML API.

This ticket was mentioned in PR #9828 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#19

Trac ticket: Core-63863
See: #9498, #9825, #9826, #9827, #9798

Update the polyfill of mb_substr() to rely on the new UTF-8 pipeline.

This ticket was mentioned in PR #9829 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#20

Trac ticket: Core-63863
See: #9498, #9825, #9826, #9827, #9798, #9828

Update the polyfill of mb_substr() to rely on the new UTF-8 pipeline.

This ticket was mentioned in PR #9830 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#21

Trac ticket: Core-63863
See: #9498, #9825, (), #9826, #9827, #9798, #9828, #9829

This patch is a second middle-man commit meant to guard the continuity of changes through source-control for the broader efforts to unify and standardize UTF-8 handling.

This change abstracts the user-space UTF-8 scanner to a more reusable and generic form and then uses it in the _wp_is_valid_utf8_fallback() function.

This ticket was mentioned in Slack in #core by dmsnell. View the logs.


5 months ago

#23 @dmsnell
5 months ago

In 60743:

Charset: Create compat-utf8.php module with fallback code.

This is the second in a series of patches to modernize and standardize UTF-8 handling.

When the fallback UTF-8 validation code was added it was placed inside formatting.php; however, that validation logic can be reused for a number of related UTF-8 functions. To faciliate this it should move into a new location and be loaded early. This patch is the first half of doing that, whereby the original fallback function is moved unchanged to the compat-utf8.php module. The follow-up patch will abstract the UTF-8 scanning logic for reuse. Splitting this into a move and a separate change involves an extra step, but faciliates tracking the heritage of the code through the changes.

Developed in https://github.com/WordPress/wordpress-develop/pull/9825
Discussed in https://core.trac.wordpress.org/ticket/63863

Follow-up to: [60630].

See #63863.

@dmsnell commented on PR #9825:


5 months ago
#24

Merged in 31cac36351894e42e4c31054f3e4440ac2df2cd9

#25 @desrosj
5 months ago

In 60764:

Coding Standards: Apply changes from running composer format.

Follow up to [60684], [60743].

See #63840, #63863.

#26 @dmsnell
5 months ago

In 60768:

Charset: Introduce UTF-8 scanning pipeline.

This is the third in a series of patches to modernize and standardize UTF-8 handling.

When the fallback UTF-8 validation code was added it was placed inside formatting.php; however, that validation logic can be reused for a number of related UTF-8 functions. To faciliate this it was moved into a new location and loaded early. This patch is follow-up to that first half, whereby the UTF-8 scanning logic forms its own new _wp_scan_utf8() function. This new UTF-8 scanner is a low-level function which forms a shared spec-compliant processing core to power multiple fallback functions and some new functionality as well.

Developed in https://github.com/WordPress/wordpress-develop/pull/9830
Discussed in https://core.trac.wordpress.org/ticket/63863

Follow-up to: [60743].

See #63863.

@dmsnell commented on PR #9830:


5 months ago
#27

Merged in c9166919cce1f78fc10de220a92970f9448e03dd
[60768]

This ticket was mentioned in PR #6883 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#28

Trac ticket: Core-63863
See also #9317

## Status

This is an exploratory patch so far.

## Tasks

  • [ ] Test behavior of new functions.
  • [ ] Benchmark performance.

## Motivation

  • _mb_strlen() attempts to split a string on UTF-8 boundaries, falling back to assumed character patterns if it can't run Unicode-supported PCRE patterns.
  • wp_check_invalid_utf8() performs similar PCRE-based counting.
  • If sending HTML in any encoding other than UTF-8, it's important not to perform basic conversion with a function like mb_convert_encoding() or iconv() because these can lead to data loss for characters not representable in the target encoding. Although mb_encode_numericentity() exists with the multi-byte extension, having a streaming UTF-8 decoder would allow WordPress to handle proper conversion to numeric character references natively and universally. E.g. converting to …
  • URL detection and XML name parsing requires detecting sequences of bytes with specific Unicode ranges, like a Unicode-aware strspn(), and should stop as soon as any given character is outside of that range.

## Description

WordPress relies on various extensions, regular expressions, and basic string operations when working with text potentially encoded as UTF-8.

In this patch an efficient UTF-8 decoding pipeline is introduced which can remove these dependencies, normalize all decoding behaviors, and open up new kinds of processing opportunities.

The decoder was taken from [Björn Höhrmann]. While it may be possible that other methods are more efficient, such as in the multi-byte extension, this decoder provides a streamable interface useful for more flexible kinds of processing: for example, whether or not to replace invalid byte sequences, zero-memory-overhead code point counting, and partially decoding strings.

[Björn Höhrmann]: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

#29 @dmsnell
5 months ago

In 60793:

Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipeline.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

wp_check_invalid_utf8() has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the mbstring extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, wp_scrub_utf8() is created with a proper fallback so that the remaining logic inside of wp_check_invalid_utf8() can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in https://github.com/WordPress/wordpress-develop/pull/9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

#30 @dmsnell
5 months ago

In 60794:

Charset: Fix typo in docblock of _wp_is_valid_utf8_fallback().

Quick-fix for typo in docblock.

Follow-up to: [60793].
Props mukesh27.
See #63863.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


5 months ago

This ticket was mentioned in Slack in #core by welcher. View the logs.


4 months ago

#33 @welcher
4 months ago

@dmsnell @desrosj this was discussed in the 6.9 scrub today. Do we think this will be ready in time for the Beta 1 cut off next week?

#34 @dmsnell
4 months ago

In 60949:

Charset: Rely on new UTF-8 pipeline for mb_strlen() fallback.

The existing polyfill for mb_strlen() contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of character to count them (1,000 at a time, iterating until complete), and entirely gives up when the Unicode support is missing.

This patch provides an updated polyfill which will reliably count code points in a UTF-8 string, even in the presence of sequences of invalid bytes. It scans through the input with zero allocations. Additionally, the underlying fallback extends the behavior of mb_strlen() to provide character counts for substrings within a larger input without extracting the substring (it can counts characters within a byte offset and length of a larger string).

This change improves the reliability of UTF-8 string length calculations and removes behavioral variability based on the runtime system.

Developed in https://github.com/WordPress/wordpress-develop/pull/9828
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

@dmsnell commented on PR #9828:


4 months ago
#35

Merged in 8508427bca9082a0a77ce96165b01117a20b1821
[60949]

#36 @dmsnell
4 months ago

@welcher this is a tracking ticket so I think by default it will “make it” into 6.9. Whatever associated work that isn’t merged will be punted to a new tracking ticket for 7.0.

There are four things I would still like to try and get in for 6.9:

There is also a new utility I find useful that I would like to add, unless there’s opposition:

#37 @dmsnell
4 months ago

In 60950:

Charset: Conditionally polyfill utf8_encode() and utf8_decode().

The utf8_encode() and utf8_decode() functions were deprecated in PHP 8.2.0 and will be removed in PHP 9.0. When that happens, any existing code which calls them will trigger a crash.

This patch introduces polyfills for those functions when they aren’t already present. The polyfill functions maintain backwards compatibility, including a deprecation notice.

Any code calling either of these functions ought to be refactored to avoid using them; there are better options which don’t carry the issues these functions do, and any code calling them is likely calling them inappropriately.

Developed in https://github.com/WordPress/wordpress-develop/pull/10011
Discussed in https://core.trac.wordpress.org/ticket/55603
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

#38 @dmsnell
4 months ago

In 60969:

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback.

The existing polyfill for mb_substr() contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of characters (1,000 at a time, iterating until complete), and re-joins them at the end.

This patch provides an updated polyfill which will reliably parse UTF-8 strings even in the presence of invalid bytes. It computes boundaries for the substring extraction with zero allocations and then returns a single substr() call at the end.

This change improves the reliability of UTF-8 string handling and removes behavioral variability based on the runtime system.

Developed in https://github.com/WordPress/wordpress-develop/pull/9829
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

#40 @westonruter
4 months ago

  • Owner set to dmsnell
  • Status changed from new to assigned

@dmsnell commented on PR #9826:


4 months ago
#41

Without a clear and compelling need to abstract this in Core I’m closing for now. It’s easy enough to create in an application-specific way. If such a need emerges we can re-open this for later consideration.

#42 @wildworks
4 months ago

The 6.9 Beta1 release is coming soon.

this is a tracking ticket so I think by default it will “make it” into 6.9. Whatever associated work that isn’t merged will be punted to a new tracking ticket for 7.0.

@dmsnell Can we punt this ticket to 7.0 now?

#43 @dmsnell
4 months ago

@wildworks I’m still working on these; I don’t know what the practice is, but if we could wait to close them until the Beta is out that would be a big help to me (keep me from having to address the same comment or closure multiple times).

Either way, if you need to close them I won’t stand in your way, but I will continue to try and merge work from them before the Beta.

#44 @dmsnell
4 months ago

In 61000:

Charset: wp_has_noncharacters() for more-specific Unicode handling.

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid.

Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, wp_has_noncharacters(), which answers this question.

See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612

Developed in https://github.com/WordPress/wordpress-develop/pull/9827
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

#46 @dmsnell
4 months ago

In 61003:

HTML API: Replace PCRE in set_attribute() with new UTF-8 utility.

The HTML API has relied upon a single PCRE to determine whether to allow setting certain attribute names. This was because those names aren’t allowed to contain Unicode noncharacters, but detecting noncharacters without a UTF-8 parser is nontrivial.

In this change the direct PCRE has been replaced with a number of strcpn() calls and a call to the newer wp_has_noncharacters() function. Under the hood, this function will still defer to a PCRE if Unicode support is available, but otherwise will fall back to the UTF-8 pipeline in Core.

This change removes the platform variability, making the HTML API more reliable when Unicode support for PCRE is lacking.

Developed in https://github.com/WordPress/wordpress-develop/pull/9798
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

#48 @dmsnell
4 months ago

In 61004:

Charset: Update docblock comment in _wp_scan_utf8().

Rewording of comment to better clarify the purpose of the arguments as written.

Discussed in https://core.trac.wordpress.org/ticket/63863

Follow-up to [60768].

See #63863.

#49 @dmsnell
4 months ago

In 61005:

Charset: Update docblock @param type in _wp_scan_utf8().

Add missing |null to optional argument &$has_noncharacters.

Discussed in https://core.trac.wordpress.org/ticket/63863

Follow-up to [61000].

See #63863.

#50 @dmsnell
4 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

#51 @zieladam
4 months ago

@dmsnell I've been migrating https://github.com/WordPress/php-toolkit/ to those new UTF-8 utilities and I've experienced a huge slowdown to the point where GitHub CI kills unit test runners after an hour. They normally only take a few minutes. At first I thought I had an infinite loop somewhere, but it seems to be actually related to the UTF-8 decoding performance. It seems like _wp_scan_utf8 is a lot slower than utf8_codepoint_at. Here's a small reproduction script I've created:

<?php

if(!isset($argv[1])) {
        echo 'Usage: php benchmark_utf8_next_codepoint.php utf8_codepoint_at|_wp_scan_utf8' . "\n";
        exit(1);
}
if($argv[1] === 'utf8_codepoint_at') {
        $next_codepoint_function = 'next_codepoint_utf8_codepoint_at';
} else if($argv[1] === '_wp_scan_utf8') {
        $next_codepoint_function = 'next_codepoint_wp_scan_utf8';
} else {
        echo 'Usage: php benchmark_utf8_next_codepoint.php utf8_codepoint_at|_wp_scan_utf8' . "\n";
        exit(1);
}

// A 10MB XML file from
// https://raw.githubusercontent.com/WordPress/php-toolkit/ec9187b4a24e1e98c47a185fe9d8114bb09287a3/components/DataLiberation/Tests/wxr/10MB.xml
$string = file_get_contents( './10MB.xml' );

echo 'Parsing a 10MB XML file with ' . $next_codepoint_function . '...' . "\n";
$start_time = microtime(true);
echo 'Starting at: ' . $start_time . "\n";

$max_iterations = 1_000_000;
$iterations = 0;
$at = 0;
while($at < strlen($string) && $iterations < $max_iterations) {
        ++$iterations;

        $matched_bytes = 0;
        $next_codepoint = $next_codepoint_function($string, $at, $matched_bytes);
        if (false === $next_codepoint) {
                break;
        }

        $at += $matched_bytes;
        if($iterations % 100000 === 0) {
                $time_taken = microtime(true) - $start_time;
                echo 'Parsed ' . $iterations . ' codepoints in ' . $time_taken . ' seconds' . "\n";
        }
}

/**
 * This seems to be substantially slower than next_name_utf8_codepoint_at().
 */
function next_codepoint_wp_scan_utf8($string, $at, &$matched_bytes = 0) {
        $new_at = $at;
        $invalid_length = 0;
        if ( 1 !== _wp_scan_utf8( $string, $new_at, $invalid_length, null, 1 ) ) {
                // Byte sequence is not a valid UTF-8 codepoint
                return false;
        }

        $codepoint_byte_length = $new_at - $at;
        $matched_bytes = $codepoint_byte_length;
        return utf8_ord( substr( $string, $at, $codepoint_byte_length ) );
}

function next_codepoint_utf8_codepoint_at($string, $offset, &$matched_bytes = 0) {
        $codepoint = utf8_codepoint_at(
                $string,
                $offset,
                $matched_bytes
        );
        if (
                // Byte sequence is not a valid UTF-8 codepoint.
                ( 0xFFFD === $codepoint && 0 === $matched_bytes ) ||
                // No codepoint at the given offset.
                null === $codepoint
        ) {
                return false;
        }
        return $codepoint;
}

The output is:

> php benchmark_utf8_next_codepoint.php utf8_codepoint_at
Parsing a 10MB XML file with next_codepoint_utf8_codepoint_at...
Starting at: 1762020120.5581
Parsed 100000 codepoints in 0.030929088592529 seconds
Parsed 200000 codepoints in 0.061965942382812 seconds
Parsed 300000 codepoints in 0.092796087265015 seconds
Parsed 400000 codepoints in 0.12393403053284 seconds
Parsed 500000 codepoints in 0.15528607368469 seconds
Parsed 600000 codepoints in 0.18659496307373 seconds
Parsed 700000 codepoints in 0.21793103218079 seconds
Parsed 800000 codepoints in 0.24924111366272 seconds
Parsed 900000 codepoints in 0.28038597106934 seconds
Parsed 1000000 codepoints in 0.3110020160675 seconds

> php benchmark_utf8_next_codepoint.php _wp_scan_utf8
Parsing a 10MB XML file with next_codepoint_wp_scan_utf8...
Starting at: 1762020128.9959
Parsed 100000 codepoints in 0.50441312789917 seconds
Parsed 200000 codepoints in 2.2213699817657 seconds
Parsed 300000 codepoints in 3.5741710662842 seconds
Parsed 400000 codepoints in 3.844514131546 seconds
Parsed 500000 codepoints in 3.9653370380402 seconds
Parsed 600000 codepoints in 4.1616570949554 seconds
Parsed 700000 codepoints in 5.0790500640869 seconds
Parsed 800000 codepoints in 6.4049270153046 seconds
Parsed 900000 codepoints in 7.6122870445251 seconds
Parsed 1000000 codepoints in 7.8051130771637 seconds

So there's a 20x difference in performance. That sounds like a big deal given the upcoming 6.9 release – any ideas where the slowness might come from?

Last edited 4 months ago by zieladam (previous) (diff)

#52 @westonruter
4 months ago

  • Focuses performance added
  • Resolution fixed deleted
  • Status changed from closed to reopened

#53 @dmsnell
4 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

After working with @zieladam on this issue I don’t believe there’s any cause for concern here, given that he was using this pipeline in a degenerate case it’s not designed for: scanning through a string one code point at a time.

This bypassed a substantial speedup, which was quickly scanning past ASCII bytes, causing the slowdown to be due mostly to PHP’s function-calling overhead. By reworking the code in the PHP toolkit so that it retains the ASCII fast-path the performance issues seem to have disappeared.

The earlier utf8_codepoint_at() function was a previous incarnation of the UTF-8 decoder which turned out to be slower than what was merged in _wp_scan_utf8() for almost every single case other than scanning one code point at a time.

The performance characteristics for both are complicated and significantly worse for multi-byte characters than they are for ASCII, but the merged code is significantly faster when dealing with US-ASCII spans of text.

@zieladam can correct me if I have any details wrong here, and @westonruter feel free to re-open again, but I closed this because I think it was a bit of a false alarm based on (a) the testing I performed against realistic inputs and usage for Core and (b) the extreme edge case usage in the first version of its introduction to the PHP toolkit (Update after our discussion in wordpress/php-toolkit#201)

Note: See TracTickets for help on using tickets.