-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Fix sanitization of non-breaking hyphens in sanitize_title_with_dashes #10204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds URL-encoded non-breaking hyphen () to the list of characters converted to regular hyphens in sanitize_title_with_dashes() Fixes ticket #64089
|
Hi @ppiwo! 👋 Thank you for your contribution to WordPress! 💖 It looks like this is your first pull request to No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making. More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook. Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook. If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook. The Developer Hub also documents the various coding standards that are followed:
Thank you, |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
|
@ppiwo Could you also add test cases for this as part of https://github.com/WordPress/wordpress-develop/blob/trunk/tests/phpunit/tests/formatting/sanitizeTitleWithDashes.php ? |
dmsnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the ping, @westonruter. we recently had similar work in #9103 (Core-62995).
@ppiwo we could consider an approach similar to that taken over there, which is to rely on a Unicode-supported PCRE to replace everything with the Dash_Punctuation character property, and also the Space_Separator.
if ( _wp_can_use_pcre_u() ) {
$title = preg_replace( '~[\p{Pd}\p{Zs}]~u', '-', $title );
}Over time I think it’s okay to be more and more restrictive on these, but I hope we push more in the direction of finding ways to ensure the titles and filenames more closely match the content they are associated with.
src/wp-includes/formatting.php
Outdated
| if ( function_exists( 'mb_chr' ) ) { | ||
| $replacements[] = rawurlencode( mb_chr( $decimal_codepoint, 'UTF-8' ) ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly discourage replacements that attempt to match normative character references, or which mix UTF-8 characters and HTML character references. these lead to strange edge cases and can easily lead to situations where we cannot accomplish what should be allowable.
to that end if we want to make these replacements I would encourage backing up to the top of this function and replacing strip_tags() with a run through the HTML API to extract the title as decoded plaintext. once that’s done we can examine raw UTF-8 replacements and not have to concern ourselves if someone wrote or   or   or   — all of these decode into the same U+00A0 code point.
If not wanting to reconsider this function more holistically, this can still be decoded as WP_HTML_Decoder::decode_text_node( $title ) before making these replacements. They can be done rather swiftly with strtr(). Further, since we are creating a static replacements array, we don’t have to use a potentially-missing runtime function to generate them: we can use Unicode string literals like \u{2011} for the patterns/matches.
Also a quick side note: HTML’s named character references are case-sensitive, so while I am guessing the use of str_ireplace() is to catch variations like  , if it actually does that it will transform plaintext content and not the placeholder for a no-break space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(My bad for using str_ireplace() in my addition.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it’s fine. that’s why we review each other’s work.
did I guess the purpose right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially. The purpose was so that %c2%a0 and %C2%A0 would both be matched, same as   and  . I forgot that named entities are case-sensitive in HTML, which is ironic since everything else is case insensitive (although I'm sure I'm not entirely accurate there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case it’d be fine to leáve it in after decoding, but we wouldn’t want or need to replace character references — they will have already been replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that’s fine @westonruter, though my comment about decoding was intended to keep this moving without blocking it. I think your suggestion was good, but it needed decoding.
somewhere I believe I have an existing branch for this entire thing. like usual, there are complications…
so either the way it is now or the way you had it, but with decoding sounds good. I am going to mark my approval and leave it up to you two. 🙇♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up ticket: Core-64089
This reverts commit ff2d2a7.
…into ppiwo/trunk * 'trunk' of https://github.com/WordPress/wordpress-develop: Twenty Sixteen: Document the `twentysixteen_author_avatar_size` filter. Abilities API: Ensure public method is used in the codebase General: Add comment explaining use of queried object in `feed_links_extra()` instead of global `$post`. Posts, Post Types: Update `get_the_modified_author()` to handle missing global `$post` and add (missing) `$post` arg. General: Improve resilience of `feed_links_extra()` when global `$post` is not set. Twenty Sixteen: Document the `twentysixteen_content_width` filter. Template activation: fix unique slug filtering. Coding Standards: Use Yoda conditions consistently in `wp-includes/formatting.php`.
…nitize_title_with_dashes()`. Developed in #10204 Follow-up to [18705], [36775]. Props patpiwo, westonruter, dmsnell. See #31790, #10797. Fixes #64089. git-svn-id: https://develop.svn.wordpress.org/trunk@61061 602fd350-edb4-49c9-b593-d223f7449a82
…nitize_title_with_dashes()`. Developed in WordPress/wordpress-develop#10204 Follow-up to [18705], [36775]. Props patpiwo, westonruter, dmsnell. See #31790, #10797. Fixes #64089. Built from https://develop.svn.wordpress.org/trunk@61061 git-svn-id: http://core.svn.wordpress.org/trunk@60397 1a063a9b-81f0-0310-95a4-ce76da25c4cd
|
Committed in r61061 |
…nitize_title_with_dashes()`. Developed in WordPress/wordpress-develop#10204 Follow-up to [18705], [36775]. Props patpiwo, westonruter, dmsnell. See #31790, #10797. Fixes #64089. Built from https://develop.svn.wordpress.org/trunk@61061 git-svn-id: https://core.svn.wordpress.org/trunk@60397 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Adds non-breaking hyphen to the list of characters converted to regular hyphens in sanitize_title_with_dashes()
Trac ticket: #64089
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.
Drafted commit message
Formatting: Replace non-breaking hyphens with hyphens in
sanitize_title_with_dashes().Developed in #10204
Follow-up to [18705], [36775].
Props patpiwo, westonruter.
See #31790, #10797.
Fixes #64089.