-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Double escape character refs when setting attribute values #10143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
| $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); | ||
| $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) | ||
| ? esc_url( $value ) | ||
| : htmlspecialchars( $value, ENT_QUOTES | ENT_HTML5 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches how set_modifiable_text() escapes:
wordpress-develop/src/wp-includes/html-api/class-wp-html-tag-processor.php
Lines 3756 to 3761 in abc5dd8
| if ( self::STATE_TEXT_NODE === $this->parser_state ) { | |
| $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( | |
| $this->text_starts_at, | |
| $this->text_length, | |
| htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 ) | |
| ); |
This performs escaping like this:
<=><>=>>&=>&"=>"'=>'
This is more escaping than is strictly necessary because " is always used a few lines later, but it aligns with common practices in WordPress and the other HTML API escaping.
| $updated_attribute = "{$name}=\"{$escaped_new_value}\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t trust the PHP man page descriptions: here is what it says in a special called-out note:
Note that this function does not translate anything beyond what is listed above. For full entity translation, see
htmlentities()
but it does more
$s = "z\x00\x95\xC2\x96b\xED\xa0\x80"; // letter, null, invalid byte, C1 control, letter, surrogate half
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 ) );
string(0) ""
// minimal subpart leaves us with four substitute characters
// (what looks like the second substitute character is being display by the browser, not the bytes forming U+FFFD, as it renders a C1 control)
var_dump( mb_scrub( $s, 'UTF-8' ) );
string(17) "z��b���"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE ) );
string(11) "z��b�"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_IGNORE ) );
string(5) "z�b"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED ) );
string(0) ""given that this is yet-again another PHP function which doesn’t do what it says, and that we generally rely on the browser in the HTML API to handle invalid encodings, maybe we should avoid this here and update set_modifiable_text() to only replace those characters you mention. strtr() and str_replace() should both be fast, I would expect, and possibly faster than htmlspecialchars() anyway, and it doesn’t require deep dives into the PHP source code or creating complex test cases to make the behavior obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed both escapes to use strtr() in 1435b6e.
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The HTML5 encoding uses ' instead of the numeric character reference.
|
The "double encode" behavior of |
This comment was marked as resolved.
This comment was marked as resolved.
| */ | ||
| $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); | ||
| $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) | ||
| ? esc_url( $value ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The esc_url() function also does some stuff with entities, namely:
wordpress-develop/src/wp-includes/formatting.php
Lines 4522 to 4527 in fe2b33f
| // Replace ampersands and single quotes only when displaying. | |
| if ( 'display' === $_context ) { | |
| $url = wp_kses_normalize_entities( $url ); | |
| $url = str_replace( '&', '&', $url ); | |
| $url = str_replace( "'", ''', $url ); | |
| } |
To avoid that, I think using esc_url_raw() is preferable:
| ? esc_url( $value ) | |
| ? htmlspecialchars( esc_url_raw( $value ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is only a formatting change, I broke the ternary onto three lines.
Can we consider that suggestion in a separate ticket and PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, but it it is related to the double-escaping issue, because esc_url() does wp_kses_normalize_entities() which includes converting all & to & among other things. For example, these:
var_dump( esc_url( 'http://example.com/?foo&bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );All result in:
string(32) "http://example.com/?foo&bar"
string(32) "http://example.com/?foo&bar"
string(32) "http://example.com/?foo&bar"I don't really see why a URI attribute should be treated differently at the HTML API layer. So this could instead just be:
$escaped_new_value = strtr(
$value,
array(
'<' => '<',
'>' => '>',
'&' => '&',
'"' => '"',
"'" => ''',
)
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that the double-escaping issue also exists in esc_url(), that's unfortunate. esc_url() does a lot of things, it doesn't seem as easily replaced as esc_attr() or esc_html(). I suspect more discussion will be necessary regarding URL-related changes which is why I'm reluctant to include those changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good.
dmsnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirreal why not use & for ampersand to avoid issues in kses? are you not worried about this content then being corrupted on further passes through the sanitizers?
|
|
Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now. |
| * | ||
| * For string attributes, the value is escaped using the `esc_attr` function. | ||
| * HTML escaping will be performed for string attribute values. The input value should | ||
| * not be unescaped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a confusing double-negative here.
can we rearrange this so it’s a positive indication of what to do, rather than a negative impression of what not to do?
* This function handles all HTML escaping: send normal and unescaped string values.
* The HTML API will convey input strings such that a browser will read the decoded
* string values as verbatim copies to what is provided here.
*
* Example:
*
* // Renders “Eggs & Milk” in a browser, encoded as “Eggs & Milk”.
* $processor->set_attribute( 'title', 'Eggs & Milk' );
*
* // Renders “Eggs & Milk” in a browser, encoded as “Eggs &amp; Milk”.
* $processor->set_attribute( 'title', 'Eggs & Milk' );There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the description in 0db1374, happy to make further adjustments.
|
I’ve expanded the examples even further to try and make it even more obvious what’s going on when character references are sent into the function. Additionally I have copied this update over to the HTML Processor, as the same changes apply there. |
|
Also I’ve added a similar note in the docblock for |
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent. After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped. Developed in #10143 Discussed in https://core.trac.wordpress.org/ticket/64054 Props dmsnell, jonsurrell, westonruter. Fixes #64054. git-svn-id: https://develop.svn.wordpress.org/trunk@60919 602fd350-edb4-49c9-b593-d223f7449a82
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent. After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped. Developed in WordPress/wordpress-develop#10143 Discussed in https://core.trac.wordpress.org/ticket/64054 Props dmsnell, jonsurrell, westonruter. Fixes #64054. Built from https://develop.svn.wordpress.org/trunk@60919 git-svn-id: http://core.svn.wordpress.org/trunk@60255 1a063a9b-81f0-0310-95a4-ce76da25c4cd
The HTML API has relied on `esc_attr()` and `esc_html()` when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent. After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped. Developed in WordPress/wordpress-develop#10143 Discussed in https://core.trac.wordpress.org/ticket/64054 Props dmsnell, jonsurrell, westonruter. Fixes #64054. Built from https://develop.svn.wordpress.org/trunk@60919 git-svn-id: https://core.svn.wordpress.org/trunk@60255 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: https://core.trac.wordpress.org/ticket/64054
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.