Opened 6 months ago
Closed 4 months ago
#63863 closed enhancement (fixed)
Standardize UTF-8 handling and fallbacks in 6.9
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 6.9 | Priority: | normal |
| Severity: | normal | Version: | 6.9 |
| Component: | Charset | Keywords: | has-patch has-unit-tests |
| Focuses: | performance | Cc: |
Description (last modified by )
Core uses a wide variety of methods for working with UTF-8, especially so when the running server and process doesn’t have the mbstring extension loaded. Much of the diversity in implementation and specification in this code is the legacy of the time and context in which it was added. Today, it’s viable to unify and standardize code handling UTF-8, which is what this ticket proposes.
WordPress should unify handling of UTF-8 data and depend foremost on the mbstring-provided functions which have received significant updates during the 8.x releases of PHP. When unavailable, WordPress should defer to a single and reasonable implementation of the functionality in pure PHP.
Why mbstring?
The mbstring PHP extension has been highly recommended for a long time now and it provides high-performance and spec-compliant tools for working with UTF-8 and other text encodings. The extension has seen active development and integration with PHP throughout the PHP 8.x cycle, including improved Unicode support and a couple of notable enhancements for this work:
- Since PHP 8.1.6 the
mb_scrub()function follows the Unicode substitution of maximal subparts. This is an important aspect for UTF-8 decoders which ensures safe decoding of untrusted inputs and standardized interoperability with UTF-8 handling in other systems. - Since PHP 8.3.0 the UTF-8 validity of a string is cached on the ZSTR value itself and repeated checks are property-accesses for strings validated as UTF-8.
- These functions use high-performance implementations in C including the use of SIMD and vectorized algorithms whose speed cannot be matched in PHP and which aren’t deployed in every other potential method for working with UTF-8 (for example, in the
iconvandPCRElibraries).
The case for mbstring is strong. On the contrary side I believe that there are reasons to avoid other methods:
iconv()’s behavior depends on the version oficonv()compiled in to the running PHP version and it’s hard to establish uniform behavior in the PHP code.iconv()cannot “scrub” invalid byte sequences from UTF-8 texts. It only supports an//IGNOREmode which removes byte sequences and this is not a sound security practice. It’s far more dangerous than substituting the sequences as is done withmb_scrub().- PCRE-based approaches are also highly system-dependent.
- Older versions of the PCRE libraries accepted certain invalid UTF-8 sequences as valid, e.g. with five-byte sequences.
- PCRE-based approaches are memory heavy, most operating through one of two approaches: split the string into an array of chunks and then process the chunks; or use
preg_replace_callback()and pass around sub-allocations of the input string. - PCRE functions give no method for zero-allocation operation. This makes PCRE-based approaches vulnerable to denial-of-service attacks and accidents because it must match and extract a full group to operate on it. (For example, HTML numeric character references may contain an arbitrary number of leading zeros and can choke
preg_match(), doubling memory use of more just to make the match.
Finally, the plethora of approaches in Core can be dizzying when trying to understand behaviors and bugs. Without a well-defined specification of the behaviors, things only seem to work because the code does not explain the bounds of operation.
Why a single fallback in pure PHP?
While a pure user-space PHP implementation of UTF-8 handling will never be able to approach the performance and efficiency of underlying native libraries, it can be reasonably fast and fast enough for reasonable use. Reportedly, around 0.5% of WordPress installations lack the mbstring extension. For those sites which lack the extension, WordPress can make a tradeoff between performance and harmony of behaviors.
When functions try a chain of fallback options (e.g. mbstring, iconv, PCRE with Unicode support, PCRE with byte patterns hard-coded, ASCII) then it breaks apart WordPress’ reliability and makes for very difficult debugging and resolution.
A pure PHP fallback gives WordPress the ability to identify and fix bugs in its implementation and the implementation details are visible for all to inspect. There’s a much higher barrier to try and diagnose why the functions return the wrong result when one needs to scan PHP’s source code, iconv’s source code, one of several PCRE libraries’ source code, and more.
While the up-front effort is high, the HTML API has demonstrated how valuable it can be to have a reliable API in WordPress for interoperating with various web standards. It’s time to modernize WordPress to support UTF-8 universally and remove the existing complexity of ad-hoc handling and runtime dependencies.
Proposal
- Create a new
wp-includes/compat-utf8.phppolyfilling basic UTF-8 handling and implementing the currentmb_polyfills fromcombat.php. Moving this to a UTF-8-specific module keeps the code in that module focused and makes it easier to exclude the WPCS rule for rejectinggotostatements.gotois a valuable construct when handling decoding errors in a low-level decoder. This module loads beforewp-includes/compat.phpwhich makes it simpler in that file to polypill things likemb_substr(). - Create
wp-includes/utf8.phpcontaining WordPress-specific functions for handling UTF-8. This abstracts access to text behind a unifying interface and allows WordPress to improve support, performance, and reliability while lifting up all calling code. UTF-8 is universal enough to warrant its own subsystem. - String functions are conditionally-defined based on the presence of the
mbstringextension and any other relevant factors. This moves support-checks to PHP initialization instead of on every invocation of these functions. A side-effect of splitting these functions based on the presence of the extension is the safe removal ofmbstring_binary_safe_encoding(). Whenmbstringis loaded, functions will call themb_functions directly; when it’s not available, there can be nombstring.func_overload. - A new UTF-8 decoding pipeline provides zero-allocation, streaming, and re-entrant access to a string so that common operations don’t need to involve any more overhead than they require. In addition to being a versatile fallback mechanism, this low-level scanner can provide access to new abilities not available today such as: count code points within a substring without allocating, split a string into chunks of valid and invalid byte sequences, and combine identification, validation, and transformation of a string into a single pass.
- Replace existing non-canonical UTF-8 code in Core with the new abstractions. No more
static $utf8_pcrechecks, no moreif ( function_exists( 'mb_substr' ) )— just unconditional explicit semantics. - Remove the single regex from the HTML API.
As WordPress builds its own abstraction and polyfills for the mbstring library it can remove the fallback behaviors as it changes its minimum supported versions for PHP and if it starts requiring mbstring.
Related tickets
- #38044: Deprecate
seems_utf8()and addwp_is_valid_utf8(). - #62172: Deprecate non-UTF-8 support.
- #63837: Overhaul
wp_check_invalid_utf8()to remove runtime dependencies. - #55603: Address deprecation of
utf8_decode()andutf8_encode(), discussion of requiringmbstring.
Related PRs
Change History (53)
This ticket was mentioned in PR #9678 on WordPress/wordpress-develop by @dmsnell.
6 months ago
#6
- Keywords has-patch added
Trac ticket: Core-63863.
The sanitize_file_name() function attempts to make its own manual detection of Unicode PCRE support, but WordPress already provides a more-robust method of doing so. In order to avoid re-computation, it also stores the result in a static var inside the function. That check is already stored in a static var, however, inside _wp_can_use_pcre_u() check and so the extra cache is largely ineffective.
This patch leans on the recently-updated _wp_can_use_pcre_u() check and removes the superfluous static var.
Follow-up to #9576
Follow-up to [60694]
Work extracted from #9498
#7
@
6 months ago
@tusharbharti thanks for linking that. I’ll try and think it through and see how to combine it. there’s definitely some limit to poly-filling the mbstring extension that I’m trying to balance, and much or most of that is focused on UTF-8.
otherwise there’s a funny thing about #9678. it’s worth doing, and I considered creating a ticket like “ensure that code calling PCRE functions with the u flag rely on _wp_can_use_pcre_u()” but as I looked around, almost everything in Core just assumes the flag is present and breaks when it isn’t.
This is part of what I mean when I talk at times about WordPress already being broken in non-UTF-8 environments and I think we could have a lot of un-reported or under-reported bugs that derive from places where the regex fails. it would be incredibly difficult to pin the reported behavior back onto the regex failure.
A good example of this might be shortcode_parse_atts(), where a call to preg_replace() returns NULL instead of a string when support is missing. I don’t know what to do with these functions when support is lacking other than tackle them one at a time and in unique ways for each one.
#10
@
6 months ago
otherwise there’s a funny thing about #9678. it’s worth doing, and I considered creating a ticket like “ensure that code calling PCRE functions with the u flag rely on _wp_can_use_pcre_u()” but as I looked around, almost everything in Core just assumes the flag is present and breaks when it isn’t.
I think creating a ticket for this would make sense as this ticket is for standardizing the UTF-8 part. If there is a lot of places which needs "_wp_can_use_pcre_u()" fix then it can be done on basis of a section like formatting or even basis of files to make reviewing easier ( ofc depending on the files changed ). I can help with this.
here’s definitely some limit to poly-filling the mbstring extension that I’m trying to balance, and much or most of that is focused on UTF-8.
For polyfills, https://github.com/symfony/polyfill-mbstring seems like a good reference, though I am also exploring https://github.com/php/php-src/tree/master/ext/mbstring
#11
@
6 months ago
@tusharbharti created the tracking ticket in #63913. We can move discussion about those unchecked uses of the PCRE flag there.
This ticket was mentioned in Slack in #hosting by zunaid321. View the logs.
5 months ago
This ticket was mentioned in Slack in #core by audrasjb. View the logs.
5 months ago
This ticket was mentioned in Slack in #hosting by amykamala. View the logs.
5 months ago
This ticket was mentioned in PR #9825 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#15
Trac Ticket: Core-63863
Follow-up to: #9317, [60630]
This patch introduces a new early-loaded compat-utf8.php file which provides the UTF-8 validation fallback. This is part of a broader effort to unify and standardize UTF-8 handling.
This is an intermediate change in order to better facilitate source-code tracking. This code was introduced in formatting.php and originally was intended to be duplicated inside of wp_check_invalid_utf8(), but the difference between the two functions is so minor that the part of the code which scans through the bytes in a string should be abstracted and reused. This re-use function would ideally load early enough to be used to polyfill methods like mb_substr(), so I am moving this early.
While this is not the abstracted code, this change will be helpful by providing continuity between the function as it stands today and as it will transform when reused by wp_check_invalid_utf8(). In other words, this change exists to make sure that source control shows that this function moved first, and then was changed later. To move it and change it in one go is likely to sever its history.
This will turn into _wp_scan_utf8(), the updated iteration of the #6883 which provides a fast, spec-compliant, and streamable UTF-8 parser.
This ticket was mentioned in PR #9826 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#16
- Keywords has-unit-tests added
Trac ticket: Core-63863
See: #9825, #9498
Introduces a new method, wp_utf8_chunks(), for iterating through a string which might contain invalid spans of UTF-8 bytes.
This function is largely a low-level function for working with text data. With it, it’s possible to operate on the valid UTF-8 portions of a string while preserving the invalid parts. For instance, this might be used to operate on improperly-constructed JSON data.
This ticket was mentioned in PR #9827 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#17
Trac ticket: Core-63863
See: #9498, #9825, #9826, #9798
Augments the UTF-8 scanning pipeline to track whether noncharacters have been seen. This operation adds a marginal cost to scanning UTF-8 but provides valuable information for HTML- and XML-related calling-code. Tracking it inline allows skipping a second pass through strings to determine whether or not these characters exist.
This ticket was mentioned in PR #9798 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#18
This ticket was mentioned in PR #9828 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#19
This ticket was mentioned in PR #9829 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#20
This ticket was mentioned in PR #9830 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#21
Trac ticket: Core-63863
See: #9498, #9825, (), #9826, #9827, #9798, #9828, #9829
This patch is a second middle-man commit meant to guard the continuity of changes through source-control for the broader efforts to unify and standardize UTF-8 handling.
This change abstracts the user-space UTF-8 scanner to a more reusable and generic form and then uses it in the _wp_is_valid_utf8_fallback() function.
This ticket was mentioned in Slack in #core by dmsnell. View the logs.
5 months ago
This ticket was mentioned in PR #6883 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#28
Trac ticket: Core-63863
See also #9317
## Status
This is an exploratory patch so far.
## Tasks
- [ ] Test behavior of new functions.
- [ ] Benchmark performance.
## Motivation
_mb_strlen()attempts to split a string on UTF-8 boundaries, falling back to assumed character patterns if it can't run Unicode-supported PCRE patterns.wp_check_invalid_utf8()performs similar PCRE-based counting.- If sending HTML in any encoding other than UTF-8, it's important not to perform basic conversion with a function like
mb_convert_encoding()oriconv()because these can lead to data loss for characters not representable in the target encoding. Althoughmb_encode_numericentity()exists with the multi-byte extension, having a streaming UTF-8 decoder would allow WordPress to handle proper conversion to numeric character references natively and universally. E.g. converting…to… - URL detection and XML name parsing requires detecting sequences of bytes with specific Unicode ranges, like a Unicode-aware
strspn(), and should stop as soon as any given character is outside of that range.
## Description
WordPress relies on various extensions, regular expressions, and basic string operations when working with text potentially encoded as UTF-8.
In this patch an efficient UTF-8 decoding pipeline is introduced which can remove these dependencies, normalize all decoding behaviors, and open up new kinds of processing opportunities.
The decoder was taken from [Björn Höhrmann]. While it may be possible that other methods are more efficient, such as in the multi-byte extension, this decoder provides a streamable interface useful for more flexible kinds of processing: for example, whether or not to replace invalid byte sequences, zero-memory-overhead code point counting, and partially decoding strings.
This ticket was mentioned in Slack in #core by audrasjb. View the logs.
5 months ago
This ticket was mentioned in Slack in #core by welcher. View the logs.
4 months ago
#33
@
4 months ago
@dmsnell @desrosj this was discussed in the 6.9 scrub today. Do we think this will be ready in time for the Beta 1 cut off next week?
#36
@
4 months ago
@welcher this is a tracking ticket so I think by default it will “make it” into 6.9. Whatever associated work that isn’t merged will be punted to a new tracking ticket for 7.0.
There are four things I would still like to try and get in for 6.9:
- New fallback for
mb_substr()in wordpress/wordpress-develop#9829 - Replace single use of PCRE in HTML API in wordpress/wordpress-develop#9798 and in wordpress/wordpress-develop#9827
- Deprecated polyfills for
utf8_encode()andutf8_decode()in wordpress/wordpress-develop#10011
There is also a new utility I find useful that I would like to add, unless there’s opposition:
- Operate on valid parts of UTF-8 string while preserving the invalid bytes in wordpress/wordpress-develop#9826
4 months ago
#41
Without a clear and compelling need to abstract this in Core I’m closing for now. It’s easy enough to create in an application-specific way. If such a need emerges we can re-open this for later consideration.
#42
@
4 months ago
The 6.9 Beta1 release is coming soon.
this is a tracking ticket so I think by default it will “make it” into 6.9. Whatever associated work that isn’t merged will be punted to a new tracking ticket for 7.0.
@dmsnell Can we punt this ticket to 7.0 now?
#43
@
4 months ago
@wildworks I’m still working on these; I don’t know what the practice is, but if we could wait to close them until the Beta is out that would be a big help to me (keep me from having to address the same comment or closure multiple times).
Either way, if you need to close them I won’t stand in your way, but I will continue to try and merge work from them before the Beta.
#51
@
4 months ago
@dmsnell I've been migrating https://github.com/WordPress/php-toolkit/ to those new UTF-8 utilities and I've experienced a huge slowdown to the point where GitHub CI kills unit test runners after an hour. They normally only take a few minutes. At first I thought I had an infinite loop somewhere, but it seems to be actually related to the UTF-8 decoding performance. It seems like _wp_scan_utf8 is a lot slower than utf8_codepoint_at. Here's a small reproduction script I've created:
<?php if(!isset($argv[1])) { echo 'Usage: php benchmark_utf8_next_codepoint.php utf8_codepoint_at|_wp_scan_utf8' . "\n"; exit(1); } if($argv[1] === 'utf8_codepoint_at') { $next_codepoint_function = 'next_codepoint_utf8_codepoint_at'; } else if($argv[1] === '_wp_scan_utf8') { $next_codepoint_function = 'next_codepoint_wp_scan_utf8'; } else { echo 'Usage: php benchmark_utf8_next_codepoint.php utf8_codepoint_at|_wp_scan_utf8' . "\n"; exit(1); } // A 10MB XML file from // https://raw.githubusercontent.com/WordPress/php-toolkit/ec9187b4a24e1e98c47a185fe9d8114bb09287a3/components/DataLiberation/Tests/wxr/10MB.xml $string = file_get_contents( './10MB.xml' ); echo 'Parsing a 10MB XML file with ' . $next_codepoint_function . '...' . "\n"; $start_time = microtime(true); echo 'Starting at: ' . $start_time . "\n"; $max_iterations = 1_000_000; $iterations = 0; $at = 0; while($at < strlen($string) && $iterations < $max_iterations) { ++$iterations; $matched_bytes = 0; $next_codepoint = $next_codepoint_function($string, $at, $matched_bytes); if (false === $next_codepoint) { break; } $at += $matched_bytes; if($iterations % 100000 === 0) { $time_taken = microtime(true) - $start_time; echo 'Parsed ' . $iterations . ' codepoints in ' . $time_taken . ' seconds' . "\n"; } } /** * This seems to be substantially slower than next_name_utf8_codepoint_at(). */ function next_codepoint_wp_scan_utf8($string, $at, &$matched_bytes = 0) { $new_at = $at; $invalid_length = 0; if ( 1 !== _wp_scan_utf8( $string, $new_at, $invalid_length, null, 1 ) ) { // Byte sequence is not a valid UTF-8 codepoint return false; } $codepoint_byte_length = $new_at - $at; $matched_bytes = $codepoint_byte_length; return utf8_ord( substr( $string, $at, $codepoint_byte_length ) ); } function next_codepoint_utf8_codepoint_at($string, $offset, &$matched_bytes = 0) { $codepoint = utf8_codepoint_at( $string, $offset, $matched_bytes ); if ( // Byte sequence is not a valid UTF-8 codepoint. ( 0xFFFD === $codepoint && 0 === $matched_bytes ) || // No codepoint at the given offset. null === $codepoint ) { return false; } return $codepoint; }
The output is:
> php benchmark_utf8_next_codepoint.php utf8_codepoint_at Parsing a 10MB XML file with next_codepoint_utf8_codepoint_at... Starting at: 1762020120.5581 Parsed 100000 codepoints in 0.030929088592529 seconds Parsed 200000 codepoints in 0.061965942382812 seconds Parsed 300000 codepoints in 0.092796087265015 seconds Parsed 400000 codepoints in 0.12393403053284 seconds Parsed 500000 codepoints in 0.15528607368469 seconds Parsed 600000 codepoints in 0.18659496307373 seconds Parsed 700000 codepoints in 0.21793103218079 seconds Parsed 800000 codepoints in 0.24924111366272 seconds Parsed 900000 codepoints in 0.28038597106934 seconds Parsed 1000000 codepoints in 0.3110020160675 seconds > php benchmark_utf8_next_codepoint.php _wp_scan_utf8 Parsing a 10MB XML file with next_codepoint_wp_scan_utf8... Starting at: 1762020128.9959 Parsed 100000 codepoints in 0.50441312789917 seconds Parsed 200000 codepoints in 2.2213699817657 seconds Parsed 300000 codepoints in 3.5741710662842 seconds Parsed 400000 codepoints in 3.844514131546 seconds Parsed 500000 codepoints in 3.9653370380402 seconds Parsed 600000 codepoints in 4.1616570949554 seconds Parsed 700000 codepoints in 5.0790500640869 seconds Parsed 800000 codepoints in 6.4049270153046 seconds Parsed 900000 codepoints in 7.6122870445251 seconds Parsed 1000000 codepoints in 7.8051130771637 seconds
So there's a 20x difference in performance. That sounds like a big deal given the upcoming 6.9 release – any ideas where the slowness might come from?
#52
@
4 months ago
- Focuses performance added
- Resolution fixed deleted
- Status changed from closed to reopened
#53
@
4 months ago
- Resolution set to fixed
- Status changed from reopened to closed
After working with @zieladam on this issue I don’t believe there’s any cause for concern here, given that he was using this pipeline in a degenerate case it’s not designed for: scanning through a string one code point at a time.
This bypassed a substantial speedup, which was quickly scanning past ASCII bytes, causing the slowdown to be due mostly to PHP’s function-calling overhead. By reworking the code in the PHP toolkit so that it retains the ASCII fast-path the performance issues seem to have disappeared.
The earlier utf8_codepoint_at() function was a previous incarnation of the UTF-8 decoder which turned out to be slower than what was merged in _wp_scan_utf8() for almost every single case other than scanning one code point at a time.
The performance characteristics for both are complicated and significantly worse for multi-byte characters than they are for ASCII, but the merged code is significantly faster when dealing with US-ASCII spans of text.
@zieladam can correct me if I have any details wrong here, and @westonruter feel free to re-open again, but I closed this because I think it was a bit of a false alarm based on (a) the testing I performed against realistic inputs and usage for Core and (b) the extreme edge case usage in the first version of its introduction to the PHP toolkit (Update after our discussion in wordpress/php-toolkit#201)
Hi @dmsnell, I believe https://core.trac.wordpress.org/ticket/63804 is also related to this as it would introduce
mb_trim()( andmb_ltrim()&mb_rtrim()if needed ) in compat forjs_trim().It has a patch for this with compat & unit tests for
mb_trim.