Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

dmsnell · 2025-09-10T22:15:56Z

Trac ticket: Core-63863
See: ~~#9825~~, ~~#9830~~, ~~#9498~~, ~~#9826~~, ~~#9827~~, #9798, ~~#9828~~, (~~#9829~~)

Update the polyfill of mb_substr() to rely on the new UTF-8 pipeline.

github-actions · 2025-09-10T22:31:09Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions · 2025-10-16T21:32:25Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

…ess#9829) The existing polyfill for `mb_substr()` contains a number of issues áving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of characters (1,000 at a time, iterating until complete), and re-joins them at the end. This patch provides an updated polyfill which will reliably parse UTF-8 strings even in the presence of invalid bytes. It computes boundaries for the substring extraction with zero allocations and then returns a single `substr()` call at the end. This change improves the reliability of UTF-8 string handling and removes behavioral variability based on the runtime system. Github-PR: 9829 Github-PR-URL: WordPress#9829 Trac-Ticket: 63863 Trac-Ticket-URL: https://core.trac.wordpress.org/ticket/63863

The existing polyfill for `mb_substr()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of characters (1,000 at a time, iterating until complete), and re-joins them at the end. This patch provides an updated polyfill which will reliably parse UTF-8 strings even in the presence of invalid bytes. It computes boundaries for the substring extraction with zero allocations and then returns a single `substr()` call at the end. This change improves the reliability of UTF-8 string handling and removes behavioral variability based on the runtime system. Developed in #9829 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. git-svn-id: https://develop.svn.wordpress.org/trunk@60969 602fd350-edb4-49c9-b593-d223f7449a82

dmsnell · 2025-10-18T04:36:30Z

Merged in 8ec91a4
[60969]

The existing polyfill for `mb_substr()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of characters (1,000 at a time, iterating until complete), and re-joins them at the end. This patch provides an updated polyfill which will reliably parse UTF-8 strings even in the presence of invalid bytes. It computes boundaries for the substring extraction with zero allocations and then returns a single `substr()` call at the end. This change improves the reliability of UTF-8 string handling and removes behavioral variability based on the runtime system. Developed in WordPress/wordpress-develop#9829 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. Built from https://develop.svn.wordpress.org/trunk@60969 git-svn-id: http://core.svn.wordpress.org/trunk@60305 1a063a9b-81f0-0310-95a4-ce76da25c4cd

The existing polyfill for `mb_substr()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of characters (1,000 at a time, iterating until complete), and re-joins them at the end. This patch provides an updated polyfill which will reliably parse UTF-8 strings even in the presence of invalid bytes. It computes boundaries for the substring extraction with zero allocations and then returns a single `substr()` call at the end. This change improves the reliability of UTF-8 string handling and removes behavioral variability based on the runtime system. Developed in WordPress/wordpress-develop#9829 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. Built from https://develop.svn.wordpress.org/trunk@60969 git-svn-id: https://core.svn.wordpress.org/trunk@60305 1a063a9b-81f0-0310-95a4-ce76da25c4cd

This was referenced Sep 10, 2025

Charset: Abstract new UTF-8 scanning pipeline from validation fallback. #9830

Closed

Charset: Introduce wp_utf8_chunks() to iterate through strings. #9826

Closed

dmsnell force-pushed the utf8/update-mb-substr branch 9 times, most recently from 6d20d74 to 6a2908b Compare September 16, 2025 13:27

dmsnell force-pushed the utf8/update-mb-substr branch 13 times, most recently from 20b3871 to 4730c30 Compare September 25, 2025 00:57

dmsnell force-pushed the utf8/update-mb-substr branch 3 times, most recently from 98435d0 to 35ee6cd Compare October 1, 2025 23:14

dmsnell force-pushed the utf8/update-mb-substr branch 6 times, most recently from 45f6697 to f67bb55 Compare October 9, 2025 00:48

dmsnell force-pushed the utf8/update-mb-substr branch 5 times, most recently from 307bbe8 to 8ef9abc Compare October 16, 2025 19:52

dmsnell force-pushed the utf8/update-mb-substr branch from 8ef9abc to 2404133 Compare October 16, 2025 21:12

dmsnell marked this pull request as ready for review October 16, 2025 21:21

dmsnell force-pushed the utf8/update-mb-substr branch 2 times, most recently from 6a2ddd5 to fa7821d Compare October 17, 2025 22:50

dmsnell force-pushed the utf8/update-mb-substr branch from fa7821d to f7fb52b Compare October 17, 2025 23:43

dmsnell closed this Oct 18, 2025

dmsnell deleted the utf8/update-mb-substr branch October 18, 2025 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

Uh oh!

dmsnell commented Sep 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

dmsnell commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

Charset: Rely on new UTF-8 pipeline for mb_substr() fallback. #9829

Uh oh!

Conversation

dmsnell commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

Test using WordPress Playground

Some things to be aware of

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

dmsnell commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dmsnell commented Sep 10, 2025 •

edited

Loading