Add `wp_word_count()` function #4430

t-hamano · 2023-05-06T15:26:27Z

Trac ticket: https://core.trac.wordpress.org/ticket/57987

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

costdev

Hi @t-hamano, thanks for the PR! I've left some thoughts below 🙂

We may also need some input validation added here, but I'll leave that for other reviewers to offer their thoughts on that.

src/wp-includes/l10n.php

tests/phpunit/tests/l10n/wpWordCount.php

src/wp-includes/l10n.php

Co-authored-by: Colin Stewart <[email protected]>

costdev

Thanks for the updates @t-hamano! This notes some corrections that are needed. 🙂

tests/phpunit/tests/l10n/wpWordCount.php

Co-authored-by: Colin Stewart <[email protected]>

src/wp-includes/l10n.php

github-actions · 2025-09-25T00:51:03Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props wildworks, costdev, shailu25, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

dmsnell

Looks like this has been in the works for a long time!

since Github has collapsed and hidden so many comments I’m not going to review every one of them, so pardon me if this has already been brought up, but I find there are a few major opportunities here to rely on more sound and semantic approaches for counting words than the current array of PCRE patterns. these opportunities bring potential performance improvements along the way.

HTML API to analyze only text nodes
\IntlBreakIterator to analyze words
Core’s shortcode parser to wipe out the shortcodes

As impressive as all of the regular expressions are, and as configurable as this is, I wonder if a sounder approach would end up simpler and more reliable.

For one, we can start by gathering the plaintext of a post and not have to concern ourselves with character references or tags. It would be ideal if we did not start introducing new incomplete HTML parsers into Core when we already have a reliable one.

$text = '';
$processor = new WP_HTML_Tag_Processor( strip_shortcodes( $html ) );
while ( $processor->next_token() ) {
	switch ( $processor->get_token_name() ) {
		case '#text':
			$text .= $processor->get_modifiable_text();
			break;

		case 'P':
		case 'BR':
			$text .= "\n";
			break;

		// In my own explorations I’ve gone further and
		// removed entire nodes with things like `aria-hidden`
		// skipped TEMPLATE elements, performed better
		// newline conversion (block elements)
	}
}

Next, if we have the intl extension loaded we can count words in a much more robust way. I’ve been working on my own function to count words and only noticed this PR today.

$word_breaker = IntlBreakIterator::createWordInstance( get_locale() );
$word_breaker->setText( $text );
$word_count = 0;

foreach ( $word_breaker as $boundary ) {
	if ( IntlBreakIterator::WORD_NONE !== $word_breaker->getRuleStatus() ) {
		$word_count++;
	}
}

Now I may have gotten some details wrong in this, but apart from extracting the plaintext content from the HTML it’s non-allocating when counting words. We pass over the full input once, then over the plaintext a second time. It counts words in space-separated languages and it counts words in non-space-separated languages. It accounts for complicated Unicode rules for identifying words, characters, and sentences even.

It won’t be perfect either, but it should be as good as any software is at counting words. in the absence of the intl extension I would argue that it’s worth falling back to a more primitive and known-to-fail mechanism, but one which doesn’t attempt to be as complete as possible — hopefully most sites have the intl extension available. even still, we can start doing that after extracting the decoded text nodes and removing shortcodes.

dmsnell · 2025-10-03T15:40:54Z

src/wp-includes/l10n.php

+		$text = preg_replace( $settings['html_entity_regexp'], 'a', $text );
+
+		// Remove surrogate points.
+		$text = preg_replace( $settings['astral_regexp'], 'a', $text );


something doesn’t match between the comment and the PCRE pattern.

all surrogate halves are in the basic multilingual plane.

surrogates are illegal in UTF-8.

the astral plane includes an incredible variety of content.

it seems like the goal is to replace content we don’t recognize with an a, hoping that strings of a letters will be counted as a single word.

if we want to consider surrogates though, we should treat them as invalid UTF-8 and ignore them, or call wp_scrub_utf8()/mb_scrub() beforehand to eliminate invalid text.

one note: although UTF-16 encodes all characters from the astral plane using surrogate pairs, these are not to be expected here where UTF-8 is the norm.

t-hamano · 2025-10-03T16:32:27Z

@dmsnell Thanks for the feedback! First, let me explain why this function is implemented the way it is.

The Time to Read block needs to dynamically measure the content length on both the client side and the server side. The values must match exactly. In other words, the wp_word_count function is the PHP version of the count function 😅 To be more precise, the wp_word_count function is the PHP version of wp.utils.wordcounter.

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

dmsnell · 2025-10-03T18:11:53Z

The values must match exactly.

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

I doubt it would be pragmatic to get a 100% match; I have my doubts that this currently performs a 100% match. In fact, I know that based on the implementation as-is they don’t match, and pointed out one area where they differ.

If we want a 100% match then it would probably be best to request it from the server via API call. JavaScript strings and RegExp are so different than PHP strings and preg_ functions that fully harmonizing is almost impossible without something like a shared module compiled to WASM.

There are efforts to bring the HTML API into JavaScript, but the harder part is going to be lower-level string behaviors.

JavaScript has Intl.Segmenter which should give the same word-breaks as IntlBreakIterator::createWordInstance(), however, it’s likely that as new Unicode versions are released, if the word-breaking rules change, PHP will likely lag behind the browsers.

At some point I thought I suggested moving the JS word count function to rely on Intl.Segmenter — it’s so fraught to attempt to count words manually, especially for languages which aren’t segmented on spaces.

It’s my aspiration to build something like .innerText using the HTML API because that’s the best property to lean into in JS for counting words. The problem is that .innerText incorporates CSS styling and gets really complicated.

That being said, if we take .innerText in the browser and a reasonable facsimile from the HTML API on the server, then lean on the Intl tools (which are all using the same underlying ICU4C library) then we’ll have counts that should normally agree and also, more importantly, far more closely match the actual word counts from the content than with expensive and complicated regex-based approaches.

One thing I love about relying on the standard functions is that however good or bad it is, it should agree with what you’d get from a browser, from an operating system, from a word processor.

Carry on as you wish. I’m happy to share more about these if you are curious, but I don’t need to hold up your good work. Just a personal side-mission of mine to let people know we can have realistic word counts if we want them 😄

t-hamano · 2025-10-04T00:28:07Z

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

It feels odd to see the front-end and editor display differently, even though the displayed content is the same. My guess is that users will see it as a bug.

As Beta1 approaches, we need to decide what direction to go in. What do you suggest is the best way to go about it? Would it be better to refactor the wp_word_count function to use the HTML API before the Beta 1 release?

dmsnell · 2025-10-04T02:48:16Z

users will see it as a bug.

It’s already buggy and strange. For long posts though it’s still a reasonable estimate. Though it could be interesting to compare the counts using a word counter and the counts using a regex.

What do you suggest is the best way to go about it?

If you want to pursue something closer to the stronger word counts I’ll support you, but if you want to get this in now I support your call there as well. These things can be made better in the future; I just get an allergic reaction when I see new HTML parsers being introduced 🙃.

t-hamano · 2025-10-04T03:05:11Z

We've decided to ship the Time to Read and Word Count blocks in 6.9. These blocks require the wp_word_count() function to work.

It would be nice if we could incorporate the HTML API into this function in the 6.9 release, but to resolve client-side inconsistencies, we would need a new REST API endpoint to calculate the word count from the current editor content.

I'm not very familiar with the HTML API, so I'd like to know what the optimal solution is for the 6.9 release 👀

t-hamano · 2025-10-05T15:16:19Z

Here's what I think is the most prudent approach for the 6.9 release: What do you think?

Close this PR.
Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.
Continually improve that logic using the HTML API.
Once the logic is stable and robust, add it to the core as the wp_word_count function.

dmsnell · 2025-10-06T03:35:16Z

Sounds good @t-hamano — I don’t know if you need to close this, unless you meant by merging. We can iterate on it after you have something out there working for what you need.

t-hamano · 2025-10-06T06:03:28Z

Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.

I've submitted a Gutenberg PR on this: WordPress/gutenberg#72091

Let's close this PR without committing to core.

Add wp_word_count() function

57d9777

t-hamano marked this pull request as ready for review May 6, 2023 15:39

costdev requested changes May 6, 2023

View reviewed changes

costdev reviewed May 6, 2023

View reviewed changes

src/wp-includes/l10n.php Outdated Show resolved Hide resolved

t-hamano and others added 25 commits May 8, 2023 20:28

Update src/wp-includes/l10n.php

a9a86fe

Co-authored-by: Colin Stewart <[email protected]>

Update src/wp-includes/l10n.php

6d1b46f

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

c3ef362

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

48e326b

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

4b867d4

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

48ecdd0

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

6deaf99

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

19b25c6

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

24c8658

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

200c8ea

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

e900ad8

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

6abd3c0

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

5854053

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

07d20fc

Co-authored-by: Colin Stewart <[email protected]>

Update tests/phpunit/tests/l10n/wpWordCount.php

923aa73

Co-authored-by: Colin Stewart <[email protected]>

Update src/wp-includes/l10n.php

8b6344a

Co-authored-by: Colin Stewart <[email protected]>

Fix lint errors

90c4be0

Fix PHPUnit tests

9b5085e

Add unit tests for empty text and whitespace text

e47f1a6

Split into three different methods

91fd4a4

Fix comment indentation

b836a3d

Fix wrong ticket number

3eabb9b

Add test on wrong count type

3622916

Add test for non-array or empty shortcodes

dd8654c

Fix lint

05efa88

costdev requested changes May 8, 2023

View reviewed changes

tests/phpunit/tests/l10n/wpWordCount.php Outdated Show resolved Hide resolved

tests/phpunit/tests/l10n/wpWordCount.php Outdated Show resolved Hide resolved

tests/phpunit/tests/l10n/wpWordCount.php Outdated Show resolved Hide resolved

t-hamano and others added 5 commits May 9, 2023 17:11

Update tests/phpunit/tests/l10n/wpWordCount.php

9a12bac

Co-authored-by: Colin Stewart <[email protected]>

Fix wrong order of arguments

310ed7a

Simplify processing around preg_match_all() function

4fe0576

Fix wrong ticket number

85886e8

Merge branch 'trunk' into 57987

a363c02

aristath mentioned this pull request Oct 15, 2024

Make the word-count method properly count words for all languages ProgressPlanner/progress-planner#87

Merged

t-hamano mentioned this pull request Aug 23, 2025

Add wp_word_count() function to count words and characters in text #9562

Closed

shail-mehta reviewed Aug 23, 2025

View reviewed changes

src/wp-includes/l10n.php Outdated Show resolved Hide resolved

t-hamano mentioned this pull request Sep 10, 2025

Time to Read: Stabilize block WordPress/gutenberg#71588

Merged

Merge branch 'trunk' into 57987

63fda3d

dmsnell reviewed Oct 3, 2025

View reviewed changes

t-hamano mentioned this pull request Oct 6, 2025

Time to Read: Don't use wp_word_count() function WordPress/gutenberg#72091

Merged

t-hamano closed this Oct 6, 2025

t-hamano mentioned this pull request Nov 10, 2025

Time to Read: Re-introduce wp_word_count WordPress/gutenberg#73100

Closed

6 tasks

Add wp_word_count() function #4430

Add wp_word_count() function #4430

Uh oh!

Conversation

t-hamano commented May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

costdev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

costdev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell left a comment

Choose a reason for hiding this comment

Uh oh!

dmsnell Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

t-hamano commented Oct 3, 2025

Uh oh!

dmsnell commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-hamano commented Oct 4, 2025

Uh oh!

dmsnell commented Oct 4, 2025

Uh oh!

t-hamano commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-hamano commented Oct 5, 2025

Uh oh!

dmsnell commented Oct 6, 2025

Uh oh!

t-hamano commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `wp_word_count()` function #4430

Add `wp_word_count()` function #4430

t-hamano commented May 6, 2023 •

edited

Loading

costdev left a comment •

edited

Loading

costdev left a comment •

edited

Loading

github-actions bot commented Sep 25, 2025 •

edited

Loading

dmsnell commented Oct 3, 2025 •

edited

Loading

t-hamano commented Oct 4, 2025 •

edited

Loading