Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Oct 3, 2025

Attribute values set with the HTML API method set_attribute() are escaped with esc_attr(). That function avoids "double encoding" things that look like HTML character references.

The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.

By contrast, set_modifiable_text() does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.

The text & appears to be an encoded character reference:

$amp_text = '&';
$p = WP_HTML_Processor::create_fragment( '<p>x</p>' );
$p->next_tag();
$p->set_attribute( 'data-attr', $amp_text );
$p->next_token();
$p->set_modifiable_text( $amp_text );
echo $p->get_updated_html();

This prints the following HTML:

<p data-attr="&amp;">&amp;amp;</p>

Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &. The text node in the P element correctly renders & as expected.

Here's a demo of the difference in behavior between setting attributes and modifiable text.

Trac ticket: https://core.trac.wordpress.org/ticket/64054


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value );
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true )
? esc_url( $value )
: htmlspecialchars( $value, ENT_QUOTES | ENT_HTML5 );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches how set_modifiable_text() escapes:

if ( self::STATE_TEXT_NODE === $this->parser_state ) {
$this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
$this->text_starts_at,
$this->text_length,
htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 )
);

This performs escaping like this:

  • < => &lt;
  • > => &gt;
  • & => &amp;
  • " => &quot;
  • ' => &apos;

This is more escaping than is strictly necessary because " is always used a few lines later, but it aligns with common practices in WordPress and the other HTML API escaping.

$updated_attribute = "{$name}=\"{$escaped_new_value}\"";

Copy link
Member

@dmsnell dmsnell Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t trust the PHP man page descriptions: here is what it says in a special called-out note:

Note that this function does not translate anything beyond what is listed above. For full entity translation, see htmlentities()

but it does more

$s = "z\x00\x95\xC2\x96b\xED\xa0\x80"; // letter, null, invalid byte, C1 control, letter, surrogate half

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 ) );
string(0) ""

// minimal subpart leaves us with four substitute characters
// (what looks like the second substitute character is being display by the browser, not the bytes forming U+FFFD, as it renders a C1 control)
var_dump( mb_scrub( $s, 'UTF-8' ) );
string(17) "z��b���"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE ) );
string(11) "z��b�"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_IGNORE ) );
string(5) "z�b"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED ) );
string(0) ""

given that this is yet-again another PHP function which doesn’t do what it says, and that we generally rely on the browser in the HTML API to handle invalid encodings, maybe we should avoid this here and update set_modifiable_text() to only replace those characters you mention. strtr() and str_replace() should both be fast, I would expect, and possibly faster than htmlspecialchars() anyway, and it doesn’t require deep dives into the PHP source code or creating complex test cases to make the behavior obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both escapes to use strtr() in 1435b6e.

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member Author

sirreal commented Oct 6, 2025

The "double encode" behavior of esc_attr() (which really comes from htmlspecialchars()) is very strange when considering named character references that appear with and without a trailing ;. They're both character references and will be decoded by the browser (it's slightly more complex) but the double-encoded behavior does not treat them the same. Here's a good demo with some explanation.

@sirreal sirreal requested review from dmsnell and westonruter October 6, 2025 10:17
@sirreal

This comment was marked as resolved.

*/
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value );
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true )
? esc_url( $value )
Copy link
Member

@westonruter westonruter Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The esc_url() function also does some stuff with entities, namely:

// Replace ampersands and single quotes only when displaying.
if ( 'display' === $_context ) {
$url = wp_kses_normalize_entities( $url );
$url = str_replace( '&amp;', '&#038;', $url );
$url = str_replace( "'", '&#039;', $url );
}

To avoid that, I think using esc_url_raw() is preferable:

Suggested change
? esc_url( $value )
? htmlspecialchars( esc_url_raw( $value ) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is only a formatting change, I broke the ternary onto three lines.

Can we consider that suggestion in a separate ticket and PR?

Copy link
Member

@westonruter westonruter Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but it it is related to the double-escaping issue, because esc_url() does wp_kses_normalize_entities() which includes converting all & to &amp; among other things. For example, these:

var_dump( esc_url( 'http://example.com/?foo&amp;bar' ) );
var_dump( esc_url( 'http://example.com/?foo&#038;bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );

All result in:

string(32) "http://example.com/?foo&#038;bar"
string(32) "http://example.com/?foo&#038;bar"
string(32) "http://example.com/?foo&#038;bar"

I don't really see why a URI attribute should be treated differently at the HTML API layer. So this could instead just be:

$escaped_new_value = strtr(
	$value,
	array(
		'<' => '&lt;',
		'>' => '&gt;',
		'&' => '&amp;',
		'"' => '&quot;',
		"'" => '&apos;',
	)
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that the double-escaping issue also exists in esc_url(), that's unfortunate. esc_url() does a lot of things, it doesn't seem as easily replaced as esc_attr() or esc_html(). I suspect more discussion will be necessary regarding URL-related changes which is why I'm reluctant to include those changes here.

See r58472 and r58473.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal why not use &#38; for ampersand to avoid issues in kses? are you not worried about this content then being corrupted on further passes through the sanitizers?

@sirreal
Copy link
Member Author

sirreal commented Oct 8, 2025

&amp; seemed like the most common, expected, and legible encoding. r60616 addressed the KSES character reference issues I was aware aware of. Do you think we should use the numeric form?

@dmsnell
Copy link
Member

dmsnell commented Oct 8, 2025

Do you think we should use the numeric form?

Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now.

*
* For string attributes, the value is escaped using the `esc_attr` function.
* HTML escaping will be performed for string attribute values. The input value should
* not be unescaped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a confusing double-negative here.
can we rearrange this so it’s a positive indication of what to do, rather than a negative impression of what not to do?

 * This function handles all HTML escaping: send normal and unescaped string values.
 * The HTML API will convey input strings such that a browser will read the decoded
 * string values as verbatim copies to what is provided here.
 *
 * Example:
 *
 *     // Renders “Eggs & Milk” in a browser, encoded as “Eggs &amp; Milk”.
 *     $processor->set_attribute( 'title', 'Eggs & Milk' );
 *
 *     // Renders “Eggs &amp; Milk” in a browser, encoded as “Eggs &amp;amp; Milk”.
 *     $processor->set_attribute( 'title', 'Eggs &amp; Milk' );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the description in 0db1374, happy to make further adjustments.

@dmsnell
Copy link
Member

dmsnell commented Oct 9, 2025

I’ve expanded the examples even further to try and make it even more obvious what’s going on when character references are sent into the function. Additionally I have copied this update over to the HTML Processor, as the same changes apply there.

@dmsnell
Copy link
Member

dmsnell commented Oct 9, 2025

Also I’ve added a similar note in the docblock for set_modifiable_test(). This method only exists in the Tag Processor so there’s no corresponding update in the HTML Processor.

pento pushed a commit that referenced this pull request Oct 9, 2025
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent.

After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped.

Developed in #10143
Discussed in https://core.trac.wordpress.org/ticket/64054

Props dmsnell, jonsurrell, westonruter.
Fixes #64054.


git-svn-id: https://develop.svn.wordpress.org/trunk@60919 602fd350-edb4-49c9-b593-d223f7449a82
@github-actions
Copy link

github-actions bot commented Oct 9, 2025

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60919
GitHub commit: 4892d46

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Oct 9, 2025
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Oct 9, 2025
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent.

After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped.

Developed in WordPress/wordpress-develop#10143
Discussed in https://core.trac.wordpress.org/ticket/64054

Props dmsnell, jonsurrell, westonruter.
Fixes #64054.

Built from https://develop.svn.wordpress.org/trunk@60919


git-svn-id: http://core.svn.wordpress.org/trunk@60255 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Oct 9, 2025
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent.

After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped.

Developed in WordPress/wordpress-develop#10143
Discussed in https://core.trac.wordpress.org/ticket/64054

Props dmsnell, jonsurrell, westonruter.
Fixes #64054.

Built from https://develop.svn.wordpress.org/trunk@60919


git-svn-id: https://core.svn.wordpress.org/trunk@60255 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@sirreal sirreal deleted the trac-64054 branch November 4, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants