Skip to content

Conversation

@t-hamano
Copy link
Contributor

@t-hamano t-hamano commented May 6, 2023

Trac ticket: https://core.trac.wordpress.org/ticket/57987


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@t-hamano t-hamano marked this pull request as ready for review May 6, 2023 15:39
Copy link
Contributor

@costdev costdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @t-hamano, thanks for the PR! I've left some thoughts below 🙂

We may also need some input validation added here, but I'll leave that for other reviewers to offer their thoughts on that.

t-hamano and others added 25 commits May 8, 2023 20:28
Copy link
Contributor

@costdev costdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @t-hamano! This notes some corrections that are needed. 🙂

@github-actions
Copy link

github-actions bot commented Sep 25, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props wildworks, costdev, shailu25, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this has been in the works for a long time!

since Github has collapsed and hidden so many comments I’m not going to review every one of them, so pardon me if this has already been brought up, but I find there are a few major opportunities here to rely on more sound and semantic approaches for counting words than the current array of PCRE patterns. these opportunities bring potential performance improvements along the way.

  • HTML API to analyze only text nodes
  • \IntlBreakIterator to analyze words
  • Core’s shortcode parser to wipe out the shortcodes

As impressive as all of the regular expressions are, and as configurable as this is, I wonder if a sounder approach would end up simpler and more reliable.

For one, we can start by gathering the plaintext of a post and not have to concern ourselves with character references or tags. It would be ideal if we did not start introducing new incomplete HTML parsers into Core when we already have a reliable one.

$text = '';
$processor = new WP_HTML_Tag_Processor( strip_shortcodes( $html ) );
while ( $processor->next_token() ) {
	switch ( $processor->get_token_name() ) {
		case '#text':
			$text .= $processor->get_modifiable_text();
			break;

		case 'P':
		case 'BR':
			$text .= "\n";
			break;

		// In my own explorations I’ve gone further and
		// removed entire nodes with things like `aria-hidden`
		// skipped TEMPLATE elements, performed better
		// newline conversion (block elements)
	}
}

Next, if we have the intl extension loaded we can count words in a much more robust way. I’ve been working on my own function to count words and only noticed this PR today.

$word_breaker = IntlBreakIterator::createWordInstance( get_locale() );
$word_breaker->setText( $text );
$word_count = 0;

foreach ( $word_breaker as $boundary ) {
	if ( IntlBreakIterator::WORD_NONE !== $word_breaker->getRuleStatus() ) {
		$word_count++;
	}
}

Now I may have gotten some details wrong in this, but apart from extracting the plaintext content from the HTML it’s non-allocating when counting words. We pass over the full input once, then over the plaintext a second time. It counts words in space-separated languages and it counts words in non-space-separated languages. It accounts for complicated Unicode rules for identifying words, characters, and sentences even.

It won’t be perfect either, but it should be as good as any software is at counting words. in the absence of the intl extension I would argue that it’s worth falling back to a more primitive and known-to-fail mechanism, but one which doesn’t attempt to be as complete as possible — hopefully most sites have the intl extension available. even still, we can start doing that after extracting the decoded text nodes and removing shortcodes.

$text = preg_replace( $settings['html_entity_regexp'], 'a', $text );

// Remove surrogate points.
$text = preg_replace( $settings['astral_regexp'], 'a', $text );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something doesn’t match between the comment and the PCRE pattern.

  • all surrogate halves are in the basic multilingual plane.
  • surrogates are illegal in UTF-8.
  • the astral plane includes an incredible variety of content.

it seems like the goal is to replace content we don’t recognize with an a, hoping that strings of a letters will be counted as a single word.

if we want to consider surrogates though, we should treat them as invalid UTF-8 and ignore them, or call wp_scrub_utf8()/mb_scrub() beforehand to eliminate invalid text.

one note: although UTF-16 encodes all characters from the astral plane using surrogate pairs, these are not to be expected here where UTF-8 is the norm.

@t-hamano
Copy link
Contributor Author

t-hamano commented Oct 3, 2025

@dmsnell Thanks for the feedback! First, let me explain why this function is implemented the way it is.

The Time to Read block needs to dynamically measure the content length on both the client side and the server side. The values ​​must match exactly. In other words, the wp_word_count function is the PHP version of the count function 😅 To be more precise, the wp_word_count function is the PHP version of wp.utils.wordcounter.

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

@dmsnell
Copy link
Member

dmsnell commented Oct 3, 2025

The values ​​must match exactly.

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

I doubt it would be pragmatic to get a 100% match; I have my doubts that this currently performs a 100% match. In fact, I know that based on the implementation as-is they don’t match, and pointed out one area where they differ.

If we want a 100% match then it would probably be best to request it from the server via API call. JavaScript strings and RegExp are so different than PHP strings and preg_ functions that fully harmonizing is almost impossible without something like a shared module compiled to WASM.

There are efforts to bring the HTML API into JavaScript, but the harder part is going to be lower-level string behaviors.

JavaScript has Intl.Segmenter which should give the same word-breaks as IntlBreakIterator::createWordInstance(), however, it’s likely that as new Unicode versions are released, if the word-breaking rules change, PHP will likely lag behind the browsers.

At some point I thought I suggested moving the JS word count function to rely on Intl.Segmenter — it’s so fraught to attempt to count words manually, especially for languages which aren’t segmented on spaces.

It’s my aspiration to build something like .innerText using the HTML API because that’s the best property to lean into in JS for counting words. The problem is that .innerText incorporates CSS styling and gets really complicated.


That being said, if we take .innerText in the browser and a reasonable facsimile from the HTML API on the server, then lean on the Intl tools (which are all using the same underlying ICU4C library) then we’ll have counts that should normally agree and also, more importantly, far more closely match the actual word counts from the content than with expensive and complicated regex-based approaches.

One thing I love about relying on the standard functions is that however good or bad it is, it should agree with what you’d get from a browser, from an operating system, from a word processor.


Carry on as you wish. I’m happy to share more about these if you are curious, but I don’t need to hold up your good work. Just a personal side-mission of mine to let people know we can have realistic word counts if we want them 😄

@t-hamano
Copy link
Contributor Author

t-hamano commented Oct 4, 2025

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

It feels odd to see the front-end and editor display differently, even though the displayed content is the same. My guess is that users will see it as a bug.

As Beta1 approaches, we need to decide what direction to go in. What do you suggest is the best way to go about it? Would it be better to refactor the wp_word_count function to use the HTML API before the Beta 1 release?

@dmsnell
Copy link
Member

dmsnell commented Oct 4, 2025

users will see it as a bug.

It’s already buggy and strange. For long posts though it’s still a reasonable estimate. Though it could be interesting to compare the counts using a word counter and the counts using a regex.

What do you suggest is the best way to go about it?

If you want to pursue something closer to the stronger word counts I’ll support you, but if you want to get this in now I support your call there as well. These things can be made better in the future; I just get an allergic reaction when I see new HTML parsers being introduced 🙃.

@t-hamano
Copy link
Contributor Author

t-hamano commented Oct 4, 2025

We've decided to ship the Time to Read and Word Count blocks in 6.9. These blocks require the wp_word_count() function to work.

It would be nice if we could incorporate the HTML API into this function in the 6.9 release, but to resolve client-side inconsistencies, we would need a new REST API endpoint to calculate the word count from the current editor content.

I'm not very familiar with the HTML API, so I'd like to know what the optimal solution is for the 6.9 release 👀

@t-hamano
Copy link
Contributor Author

t-hamano commented Oct 5, 2025

Here's what I think is the most prudent approach for the 6.9 release: What do you think?

  • Close this PR.
  • Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.
  • Continually improve that logic using the HTML API.
  • Once the logic is stable and robust, add it to the core as the wp_word_count function.

@dmsnell
Copy link
Member

dmsnell commented Oct 6, 2025

Sounds good @t-hamano — I don’t know if you need to close this, unless you meant by merging. We can iterate on it after you have something out there working for what you need.

@t-hamano
Copy link
Contributor Author

t-hamano commented Oct 6, 2025

  • Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.

I've submitted a Gutenberg PR on this: WordPress/gutenberg#72091

Let's close this PR without committing to core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants