-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add wp_word_count() function
#4430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @t-hamano, thanks for the PR! I've left some thoughts below 🙂
We may also need some input validation added here, but I'll leave that for other reviewers to offer their thoughts on that.
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
Co-authored-by: Colin Stewart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @t-hamano! This notes some corrections that are needed. 🙂
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
dmsnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this has been in the works for a long time!
since Github has collapsed and hidden so many comments I’m not going to review every one of them, so pardon me if this has already been brought up, but I find there are a few major opportunities here to rely on more sound and semantic approaches for counting words than the current array of PCRE patterns. these opportunities bring potential performance improvements along the way.
- HTML API to analyze only text nodes
\IntlBreakIteratorto analyze words- Core’s shortcode parser to wipe out the shortcodes
As impressive as all of the regular expressions are, and as configurable as this is, I wonder if a sounder approach would end up simpler and more reliable.
For one, we can start by gathering the plaintext of a post and not have to concern ourselves with character references or tags. It would be ideal if we did not start introducing new incomplete HTML parsers into Core when we already have a reliable one.
$text = '';
$processor = new WP_HTML_Tag_Processor( strip_shortcodes( $html ) );
while ( $processor->next_token() ) {
switch ( $processor->get_token_name() ) {
case '#text':
$text .= $processor->get_modifiable_text();
break;
case 'P':
case 'BR':
$text .= "\n";
break;
// In my own explorations I’ve gone further and
// removed entire nodes with things like `aria-hidden`
// skipped TEMPLATE elements, performed better
// newline conversion (block elements)
}
}Next, if we have the intl extension loaded we can count words in a much more robust way. I’ve been working on my own function to count words and only noticed this PR today.
$word_breaker = IntlBreakIterator::createWordInstance( get_locale() );
$word_breaker->setText( $text );
$word_count = 0;
foreach ( $word_breaker as $boundary ) {
if ( IntlBreakIterator::WORD_NONE !== $word_breaker->getRuleStatus() ) {
$word_count++;
}
}Now I may have gotten some details wrong in this, but apart from extracting the plaintext content from the HTML it’s non-allocating when counting words. We pass over the full input once, then over the plaintext a second time. It counts words in space-separated languages and it counts words in non-space-separated languages. It accounts for complicated Unicode rules for identifying words, characters, and sentences even.
It won’t be perfect either, but it should be as good as any software is at counting words. in the absence of the intl extension I would argue that it’s worth falling back to a more primitive and known-to-fail mechanism, but one which doesn’t attempt to be as complete as possible — hopefully most sites have the intl extension available. even still, we can start doing that after extracting the decoded text nodes and removing shortcodes.
| $text = preg_replace( $settings['html_entity_regexp'], 'a', $text ); | ||
|
|
||
| // Remove surrogate points. | ||
| $text = preg_replace( $settings['astral_regexp'], 'a', $text ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something doesn’t match between the comment and the PCRE pattern.
- all surrogate halves are in the basic multilingual plane.
- surrogates are illegal in UTF-8.
- the astral plane includes an incredible variety of content.
it seems like the goal is to replace content we don’t recognize with an a, hoping that strings of a letters will be counted as a single word.
if we want to consider surrogates though, we should treat them as invalid UTF-8 and ignore them, or call wp_scrub_utf8()/mb_scrub() beforehand to eliminate invalid text.
one note: although UTF-16 encodes all characters from the astral plane using surrogate pairs, these are not to be expected here where UTF-8 is the norm.
|
@dmsnell Thanks for the feedback! First, let me explain why this function is implemented the way it is. The Time to Read block needs to dynamically measure the content length on both the client side and the server side. The values must match exactly. In other words, the Is it possible to completely mimic the logic of the |
Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?
I doubt it would be pragmatic to get a 100% match; I have my doubts that this currently performs a 100% match. In fact, I know that based on the implementation as-is they don’t match, and pointed out one area where they differ. If we want a 100% match then it would probably be best to request it from the server via API call. JavaScript strings and There are efforts to bring the HTML API into JavaScript, but the harder part is going to be lower-level string behaviors. JavaScript has At some point I thought I suggested moving the JS word count function to rely on It’s my aspiration to build something like That being said, if we take One thing I love about relying on the standard functions is that however good or bad it is, it should agree with what you’d get from a browser, from an operating system, from a word processor. Carry on as you wish. I’m happy to share more about these if you are curious, but I don’t need to hold up your good work. Just a personal side-mission of mine to let people know we can have realistic word counts if we want them 😄 |
It feels odd to see the front-end and editor display differently, even though the displayed content is the same. My guess is that users will see it as a bug. As Beta1 approaches, we need to decide what direction to go in. What do you suggest is the best way to go about it? Would it be better to refactor the |
It’s already buggy and strange. For long posts though it’s still a reasonable estimate. Though it could be interesting to compare the counts using a word counter and the counts using a regex.
If you want to pursue something closer to the stronger word counts I’ll support you, but if you want to get this in now I support your call there as well. These things can be made better in the future; I just get an allergic reaction when I see new HTML parsers being introduced 🙃. |
|
We've decided to ship the Time to Read and Word Count blocks in 6.9. These blocks require the It would be nice if we could incorporate the HTML API into this function in the 6.9 release, but to resolve client-side inconsistencies, we would need a new REST API endpoint to calculate the word count from the current editor content. I'm not very familiar with the HTML API, so I'd like to know what the optimal solution is for the 6.9 release 👀 |
|
Here's what I think is the most prudent approach for the 6.9 release: What do you think?
|
|
Sounds good @t-hamano — I don’t know if you need to close this, unless you meant by merging. We can iterate on it after you have something out there working for what you need. |
I've submitted a Gutenberg PR on this: WordPress/gutenberg#72091 Let's close this PR without committing to core. |
Trac ticket: https://core.trac.wordpress.org/ticket/57987
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.