-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Widgets: Decode HTML entities in widget RSS title before escaping #9042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Widgets: Decode HTML entities in widget RSS title before escaping #9042
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
|
Thanks @Infinite-Null, The changes looks good to me. |
|
@Infinite-Null these changes look good, Tested well for me on WP 5.0: |
|
Let's add unit tests that check the updated functionality. |
|
@Infinite-Null Thanks for the PR. Are the |
|
@t-hamano Thanks for the feedback! I've updated the patch to use the simpler Please review at your convenience. |
|
@Infinite-Null can you check the following feedback?
|
|
Sure @mukeshpanchal27 @t-hamano, I will start working on the unit test shortly. |
|
Hi @mukeshpanchal27 and @t-hamano, I have completed the unit test for this PR can you please review the test at your convenience. |
|
Thanks for the update, the unit tests look good to me. |
|
I have merged the latest trunk branch into this branch. Once all CI checks pass, I will commit this pull request. |
|
@dmsnell, I noticed your feedback regarding I was planning to commit this PR before beta3, but do you have any feedback regarding this PR? In this case, I'm wondering whether simply using |
dmsnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s good we think through these changes, and think very intentionally about what kind of data we have, since we have multiple layers of encoding and escaping.
The RSS 2.0 Spec is very clear that only description should contain encoded HTML, and the linked feed is probably generated improperly. That means this fix has the potential to break other feeds, but in practice will probably fix more of them.
If we are going to apply this change to title I think it would make sense to update all of the item elements in this function together so we don’t get into a situation even more confusing than it already is.
| $link = esc_url( strip_tags( $link ) ); | ||
|
|
||
| $title = esc_html( trim( strip_tags( $item->get_title() ) ) ); | ||
| $title = esc_html( trim( strip_tags( html_entity_decode( $item->get_title() ) ) ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sigh. we have a hard time separating XML and HTML. RSS makes it even more complicated because it’s not required to be a valid XML file and they usually don’t indicate how we should be interpreting the characters inside the different elements.
here is what comes from the linked feed. it’s obvious they have sent encoded content inside the title element.
<dc:title>Oral administration of <em>Lactiplantibacillus plantarum</em> GKK1 ameliorates atopic dermatitis in a mouse model</dc:title>this brings us to an odd spot. if we follow this logic we are missing a second html_entity_decode(), and the first one is wrong. the first level is decoding XML, where HTML named character references are not valid. that produces the HTML of the item’s title. then we want to decode the item’s title, revealing plaintext. but that title itself could have referenced something like drug A is < drug B in which case its HTML would be drug A is < drug B in which case the XML wrapping should escape that a second time into drug A is &lt; drug B.
we have decoded the XML into HTML and then removed tags, potentially leaving those same character references undecoded from the HTML side.
it may help to use additional variables, and I will recommend switching from html_entity_decode() on the XML side into the HTML API on the HTML side. why this function works for XML and breaks for HTML is beyond my imagination, but alas, that’s the way it is.
$item_title_xml = $item->get_title();
$item_title_html = html_entity_decode( $item_title_xml, ENT_XML1 | ENT_SUBSTITUTE );
$processor = new WP_HTML_Tag_Processor( $item_title_html );
$item_title = '';
while ( $processor->next_token() ) {
if ( '#text' === $processor->get_token_name() ) {
$item_title .= $processor->get_modifiable_text();
}
}This function is old and could use a lot of love.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it’s very likely we could encounter other encodings of the title attribute. Fixing this could break others, because RSS is not explicit about the content type of the contained values.
The above code could turn into some function explicitly indicating what it is assuming and used here. Should we decide to enhance the robustness later, we would have an easy way to assess the existing code and swap it out as appropriate.
/**
* Returns the plaintext content of encoded HTML content serialized in XML.
*
* When an RSS tag contains “encoded content” then the decoded XML
* represents HTML. After decoding the XML into HTML, this returns
* the plaintext content of that decoded HTML.
*
* Example:
*
* echo wp_rss_xml_to_html( '<p>&#x1f63c;</p>' );
* // <p>😼</p>
*
* echo wp_rss_xml_to_html_to_text( '<p>&#x1f63c;</p>' );
* // 😼
*
*/
function wp_rss_xml_to_html_to_text( string $raw_xml ): string {
// XML only defines five named entities: & > < ' "
$html = html_entity_decode( $raw_xml, ENT_XML1 | ENT_SUBSTITUTE );
$plaintext = '';
$processor = new WP_HTML_Tag_Processor( $html );
while ( $processor->next_token() ) {
if ( '#text' === $processor->get_token_name() ) {
$plaintext .= $processor->get_modifiable_text();
}
}
return $plaintext;
}and then call it
$escaped_title = esc_html( trim( wp_rss_xml_to_html_to_text( $item->get_title() ) ) );The same should likely apply to the description below as well. Why do we call esc_attr( esc_html( $desc ) ) 🤦♂️ 😭.


63611
What
RSS feeds sometimes contain HTML tags that have been escaped as HTML entities in their titles. The
wp_widget_rss_output()function currently strips HTML tags but doesn't decode HTML entities first, causing escaped tags like<em>to display as literal text instead of being properly removed.Example:
Oral administration of <em> Lactiplantibacillus plantarum </em> GKK1 ameliorates atopic dermatitis<em>as visible textHow
Add
html_entity_decode()beforestrip_tags()in the title processing logic.Testing Instruction
Twenty ThirteenthemeClassic Widgetsplugin to disable block editor widget editorhttps://pubmed.ncbi.nlm.nih.gov/rss/search/16cUU5Jcud0BSYRzHgbqJGm_F6kq07gr9atM8kZoogUmZdX5oj/Screenshot: