Skip to content

Conversation

@Infinite-Null
Copy link

@Infinite-Null Infinite-Null commented Jun 22, 2025

63611

What

RSS feeds sometimes contain HTML tags that have been escaped as HTML entities in their titles. The wp_widget_rss_output() function currently strips HTML tags but doesn't decode HTML entities first, causing escaped tags like <em> to display as literal text instead of being properly removed.

Example:

  • RSS feed title: Oral administration of <em> Lactiplantibacillus plantarum </em> GKK1 ameliorates atopic dermatitis
  • Current display: Shows <em> as visible text
  • Expected display: Clean title without HTML entities or tags

How

Add html_entity_decode() before strip_tags() in the title processing logic.

Testing Instruction

  1. Activate Twenty Thirteen theme
  2. Use Classic Widgets plugin to disable block editor widget editor
  3. Go to widgets and add RSS Widget
  4. Paste this URL: https://pubmed.ncbi.nlm.nih.gov/rss/search/16cUU5Jcud0BSYRzHgbqJGm_F6kq07gr9atM8kZoogUmZdX5oj/

Screenshot:

Before After
image image

@github-actions
Copy link

github-actions bot commented Jun 22, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props ankitkumarshah, wildworks, mukesh27, n8finch.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@mukeshpanchal27
Copy link
Member

Thanks @Infinite-Null, The changes looks good to me.

@n8finch
Copy link

n8finch commented Jun 23, 2025

@Infinite-Null these changes look good, can you also make the change to the same code in the wp-includes/blocks/rss.php file? Nevermind, I see that's taken care of here: WordPress/gutenberg#70491

Tested well for me on WP 5.0:

Before:
before

After:
after

@mukeshpanchal27
Copy link
Member

Let's add unit tests that check the updated functionality.

@t-hamano
Copy link
Contributor

@Infinite-Null Thanks for the PR.

Are the ENT_QUOTES and get_option( 'blog_charset' ) options necessary? Because the title will be escaped by esc_html at the end.

@Infinite-Null
Copy link
Author

@t-hamano Thanks for the feedback! I've updated the patch to use the simpler html_entity_decode( $item->get_title() ) without the additional parameters.

Please review at your convenience.

@t-hamano
Copy link
Contributor

@Infinite-Null can you check the following feedback?

Let's add unit tests that check the updated functionality.

@Infinite-Null
Copy link
Author

Infinite-Null commented Jun 24, 2025

Sure @mukeshpanchal27 @t-hamano, I will start working on the unit test shortly.

@Infinite-Null
Copy link
Author

Hi @mukeshpanchal27 and @t-hamano, I have completed the unit test for this PR can you please review the test at your convenience.

@t-hamano
Copy link
Contributor

Thanks for the update, the unit tests look good to me.

@t-hamano
Copy link
Contributor

t-hamano commented Nov 4, 2025

I have merged the latest trunk branch into this branch. Once all CI checks pass, I will commit this pull request.

@t-hamano
Copy link
Contributor

t-hamano commented Nov 4, 2025

@dmsnell, I noticed your feedback regarding html_entity_decode() on a separate ticket: https://core.trac.wordpress.org/ticket/64177#comment:10

I was planning to commit this PR before beta3, but do you have any feedback regarding this PR? In this case, I'm wondering whether simply using html_entity_decode() is sufficient.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s good we think through these changes, and think very intentionally about what kind of data we have, since we have multiple layers of encoding and escaping.

The RSS 2.0 Spec is very clear that only description should contain encoded HTML, and the linked feed is probably generated improperly. That means this fix has the potential to break other feeds, but in practice will probably fix more of them.

If we are going to apply this change to title I think it would make sense to update all of the item elements in this function together so we don’t get into a situation even more confusing than it already is.

$link = esc_url( strip_tags( $link ) );

$title = esc_html( trim( strip_tags( $item->get_title() ) ) );
$title = esc_html( trim( strip_tags( html_entity_decode( $item->get_title() ) ) ) );
Copy link
Member

@dmsnell dmsnell Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh. we have a hard time separating XML and HTML. RSS makes it even more complicated because it’s not required to be a valid XML file and they usually don’t indicate how we should be interpreting the characters inside the different elements.

here is what comes from the linked feed. it’s obvious they have sent encoded content inside the title element.

<dc:title>Oral administration of &lt;em&gt;Lactiplantibacillus plantarum&lt;/em&gt; GKK1 ameliorates atopic dermatitis in a mouse model</dc:title>

this brings us to an odd spot. if we follow this logic we are missing a second html_entity_decode(), and the first one is wrong. the first level is decoding XML, where HTML named character references are not valid. that produces the HTML of the item’s title. then we want to decode the item’s title, revealing plaintext. but that title itself could have referenced something like drug A is < drug B in which case its HTML would be drug A is &lt; drug B in which case the XML wrapping should escape that a second time into drug A is &amp;lt; drug B.

we have decoded the XML into HTML and then removed tags, potentially leaving those same character references undecoded from the HTML side.


it may help to use additional variables, and I will recommend switching from html_entity_decode() on the XML side into the HTML API on the HTML side. why this function works for XML and breaks for HTML is beyond my imagination, but alas, that’s the way it is.

$item_title_xml  = $item->get_title();
$item_title_html = html_entity_decode( $item_title_xml, ENT_XML1 | ENT_SUBSTITUTE );

$processor  = new WP_HTML_Tag_Processor( $item_title_html );
$item_title = '';
while ( $processor->next_token() ) {
	if ( '#text' === $processor->get_token_name() ) {
		$item_title .= $processor->get_modifiable_text();
	}
}

This function is old and could use a lot of love.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it’s very likely we could encounter other encodings of the title attribute. Fixing this could break others, because RSS is not explicit about the content type of the contained values.

The above code could turn into some function explicitly indicating what it is assuming and used here. Should we decide to enhance the robustness later, we would have an easy way to assess the existing code and swap it out as appropriate.

/**
 * Returns the plaintext content of encoded HTML content serialized in XML.
 *
 * When an RSS tag contains “encoded content” then the decoded XML
 * represents HTML. After decoding the XML into HTML, this returns
 * the plaintext content of that decoded HTML.
 *
 * Example:
 *
 *     echo wp_rss_xml_to_html( '&lt;p&gt;&amp;#x1f63c;&lt;/p&gt;' );
 *     // <p>&#x1f63c;</p>
 *
 *     echo wp_rss_xml_to_html_to_text( '&lt;p&gt;&amp;#x1f63c;&lt;/p&gt;' );
 *     // 😼
 *
 */
function wp_rss_xml_to_html_to_text( string $raw_xml ): string {
	// XML only defines five named entities: &amp; &gt; &lt; &apos; &quot;
	$html = html_entity_decode( $raw_xml, ENT_XML1 | ENT_SUBSTITUTE );

	$plaintext = '';
	$processor = new WP_HTML_Tag_Processor( $html );
	while ( $processor->next_token() ) {
		if ( '#text' === $processor->get_token_name() ) {
			$plaintext .= $processor->get_modifiable_text();
		}
	}

	return $plaintext;
}

and then call it

$escaped_title = esc_html( trim( wp_rss_xml_to_html_to_text( $item->get_title() ) ) );

The same should likely apply to the description below as well. Why do we call esc_attr( esc_html( $desc ) ) 🤦‍♂️ 😭.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants