Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Nov 21, 2024

This is not ready for final review but is ready for early feedback, especially around the open questions listed below.

Introduce CSS selector based traversal of HTML documents in the HTML API. Add new select_all and select methods to the Tag Processor and HTML Processor.

// With select_all to traverse a document stopping on matching selectors
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
foreach ( $processor->select_all( 'p, [att], em > *' ) as $_ ) {
	assert( $processor->get_attribute( 'match' ) );
}

// With select to move to a matching selector
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
assert( $processor->select( 'p, [att], em > *' ) );
assert( $processor->get_attribute( 'match' ) );
assert( 'P' === $processor->get_tag() );

A subset of the CSS selector grammar is available as described here:

* This class is analogous to <compound-selector-list> in the grammar. The supported grammar is:
*
* <selector-list> = <complex-selector-list>
* <complex-selector-list> = <complex-selector>#
* <compound-selector-list> = <compound-selector>#
* <complex-selector> = [ <type-selector> <combinator>? ]* <compound-selector>
* <compound-selector> = [ <type-selector>? <subclass-selector>* ]!
* <combinator> = '>' | [ '|' '|' ]
* <type-selector> = <ident-token> | '*'
* <subclass-selector> = <id-selector> | <class-selector> | <attribute-selector>
* <id-selector> = <hash-token>
* <class-selector> = '.' <ident-token>
* <attribute-selector> = '[' <ident-token> ']' |
* '[' <ident-token> <attr-matcher> [ <string-token> | <ident-token> ] <attr-modifier>? ']'
* <attr-matcher> = [ '~' | '|' | '^' | '$' | '*' ]? '='
* <attr-modifier> = i | s

Notable variations from selectors specification:

Pseudo-element selectors are not supported. Pseudo elements will not exist in the HTML and it's unclear what benefit they would bring.

Pseudo-class selectors are not supported. Pseudo classes could be useful, but the logic to parse and match pseudo-class selectors would add significant complexity. There's also a lot of variety in pseudo-selectors, and rather than supporting simpler selectors (e.g. :empty) and not supporting more complex selectors (e.g. :nth-child()), pseudo-class selectors are completely unsupported. This is a clear and simple rule that greatly simplifies the implementation.

Complex selectors have limited support. Complex selectors are combined selectors using one of the combinators: (whitespace), >, +, or ~. The Tag Processor does not support complex selectors at all, it has no concept of document structure, and all complex selectors are structural. The HTML Processor does not support the sibling combinators + or ~, it only supports a parent/child or ancestor/descendant relationship. These selectors can be handled without tracking additional state in the document by analyzing breadcrumbs. Finally, only type selectors are allowed in non-final position, again because this allows matching against breadcrumbs without tracking additional state:

  • Supported: body heading > h1.page-title[attribute]
  • Unsupported (class / id selector in non final position): #page > main, .page main
  • Unsupported (sibling selectors not supported): ul li ~ li, ul li + li.

Importantly, the selectors supported by the HTML Processor should be sufficient to support all core block attribute selectors according to this PR:

/**
* Existing Core Selectors
*
* .blocks-gallery-caption
* .blocks-gallery-item
* .blocks-gallery-item__caption
* .book-author
* .message
* a
* a[download]
* audio
* blockquote
* cite
* code
* div
* figcaption
* figure > a
* figure a
* figure img
* figure video,figure img
* h1,h2,h3,h4,h5,h6
* img
* li
* ol,ul
* p
* pre
* tbody tr
* td,th
* tfoot tr
* thead tr
* video
*/

Most of the listed selectors are also supported by the Tag Processor with the exception of the complex selectors:

figure > a
figure a
figure img
figure video,figure img
tbody tr
tfoot tr
thead tr

Open questions

Implementation

The implementation introduces a number of classes. The classes roughly correspond to different parts of the selector grammar. Parsing is handled in the _list classes that represent the top of the grammar. Matching logic is handled by each selector class. The selector classes implement a matches interface.

I'm not happy with how the implementation is distributed. I'd like to move in one of two directions:

Move the parsing logic for selectors into the selector classes, so each selector class is responsible for parsing itself.

I've implemented this change.

OR

Move matching logic into the top level *_list classes to that the lower level selectors are only data containers.

My preference at this time (and the original implementation) would be the former.

Selector traversal APIs

  • select_all is implemented as a generator and (except _doing_it_wrong) has no way to differentiate between "nothing matched" and "the selector was invalid or unsupported". It simply yields nothing in both cases.
  • select uses select_all internally and has the same limitation.
  • select_all expects the document position to remain the same in order to visit all matching tags. Because none of the supported selectors rely on stateful logic, this is not an issue at this time.

Todo:

  • Get and address feedback on open questions.
  • Split into smaller PRs for review. Specifically, the selector lists can be split into sepearate PRs starting with compound and following with complex selectors.

sirreal/html-api-debugger#5 can be used for testing the parsing and matching behavior.

Trac ticket: https://core.trac.wordpress.org/ticket/62653


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal sirreal force-pushed the html-api/add-css-selector-parser branch 2 times, most recently from 184ad41 to 8d9d9e0 Compare November 25, 2024 17:33
Comment on lines 30 to 52
* - The following combinators:
* - descendant (e.g. `.parent .descendant`)
* - child (`.parent > .child`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be supported initially. Maybe no combinators at all, only simple selectors.

These two combinators can probably be supported by the HTML API as long as they non-final selector is an element selector:

  • div > .className ✅ supported
  • .className > div ⛔️ not supported

Tags can be "seen" via bookmarks, while things like IDs, classes, attributes etc. would require seeking or more advanced internal tracking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem that we'd need to keep track of all the attributes requested by the selector in the entire breadcrumbs trail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Either breadcrumbs would need to store all of their attributes or we'd need to seek around the document to check them which would be very costly.

All of the supported selectors are things that can be checked from given tag in the document without any seeking as the HTML API is implemented now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we knew the selector upfront, we could always store the attributes needed to match it, e.g. store the class attribute if we know we'll only ever look for div > .className. Only that wouldn't be practical.

However, keeping track of classnames in all the breadcrumbs elements seems okay-ish resource-wise.

Perhaps we could even have a "CSS-enabled" mode where we'd keep track of all the breadcrumbs attributes shorter than, say, 100 bytes? It wouldn't be perfect but should suffice for most attribute-based matches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be ways to support more selectors, one good idea is to set some state to track data for certain selectors as the document is traversed.

One tricky thing with that implementation is that it may be necessary to return to the start of the document depending on the selectors used and scan again from there so that the necessary data is collected during traversal. This probably involves analyzing the selectors, deciding what data needs to be stored, seeking to the beginning if necessary, seeking back to the current location, and then trying to find the selector.

The complexity here is significant. It can be overcome, but I think the set of selectors supported in this PR as it is now is a good starting place with very limited matching complexity. Iteration to improve support can be added in future PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Everything I said I meant for the future, no need to hold up this work :)

On seeking - sounds reasonable, although it won't combine well with streaming. What would an example of a selector that requires backtracking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would an example of a selector that requires backtracking?

This is mostly necessary to support selecting from arbitrary positions in the document. If we knew we were at the start of the document, we could determine what additional state needed to be tracked and start collecting it.

  • The sibling combinators + and ~ select based on preceding elements, so it would be necessary to back up to check for matches.
  • Support for subclass selectors (ID, class, and attribute) in non-final position, like #my-id div would require backing up to parse some attributes along with breadcrumbs.
  • If pseudo-class selectors were added, I'd expect many of them to require more understanding of the document structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in earlier work I did I actually proposed a list-interface for selecting elements and that list took all selectors of interest. that’s not a great solution to mandate, but it does allow for up-front analysis of the set of selectors, and then a crawl to find each one in succession.

I was going to make a comment about select_all() and the generator it yields and how I worry it miscommunicates the ability to chain selection at that point. if we start selecting a different selector from within the foreach, we move the underlying HTML Processor and now it will select a different element on the next iteration than it would have (potentially) if we hadn’t called one from inside.

to that end, if the underlying HTML Processor is moving, can we not get select_all() behavior simply by calling select() in succession in while ( $processor->select() )?

then, one thing is more explicit, that if we call select() from inside that loop, we kind of acknowledge we might change the outer loop because the $processor is shared and stateful.

it doesn’t solve the problem of the unknown necessary context from above.

thinking that the FirTree representation might be able to help us by tracking attribute spans, making it faster to re-scan parent nodes, but also we could say someone can register selectors before traversal, and we maintain the relevant attributes as we go, regardless of which select() is running…

@sirreal sirreal force-pushed the html-api/add-css-selector-parser branch 3 times, most recently from 370539b to 9ecaab5 Compare November 29, 2024 15:25
@sirreal sirreal force-pushed the html-api/add-css-selector-parser branch from 8659898 to e5e4c7e Compare December 4, 2024 20:57
@sirreal sirreal changed the title HTML API: Add CSS selector parsing HTML API: Add CSS selector support Dec 5, 2024
@sirreal sirreal force-pushed the html-api/add-css-selector-parser branch from e5e4c7e to b7e032e Compare December 5, 2024 13:52
Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some leftover thoughts we can get back to later, or in person


if ( null === $this->value ) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it reads as if null is a sentinel value for “as long as the attribute exists” but without reading the code I would have assumed passing null would mean “only match if the attribute doesn’t exist on the tag”

what do you think about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute selectors have the form [attr] (presence) or one of the matching forms like [att=val] (att attribute is an exact match for the value val).

The [attr] form is represented here with a null matcher (there's no matching logic) as well as a null value (there's no value to match). Those must either both be null or there must be both a matcher and a value.

I believe this is interchangeable here in case it's clearer:

if ( null === $this->matcher ) {
	return true;
}

Another option would be to add another matcher value like MATCH_PRESENCE to indicate that type of matcher.

Either way, the null value seems appropriate to me. The selector does not have a value to match. There is no negative attribute selection.

This may be a question of perspective, for example the HTML API returns null if a an unset attribute is requested, true for a boolean attribute, or a string for an attribute with a value. Attribute selectors don't follow the same logic. There's no way to only match a boolean or to match elements without an attribute, so I don't think we should try to align on those types of values. There may or may not be a string value to match.

This should mostly be an internal detail, it's not something users should really be interacting with.

I'm open to discuss this further. Either way, it's likely a good idea to clarify the null value in this property's documentation block.

return true;
}
}
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this seems fine for now, I suspect that at some point we will prefer to crawl through the attribute value comparing as we go with substr_compare() rather than allocate each substring. we can remember that in some documents we see DoS attacks due to long attribute values.

$this->whitespace_delimited_list() could also return a starting byte offset into the value instead of the substring, which would resolve this without changing much of the calling interface.

* will be updated if the parse is successful.
* @return static|null The selector instance, or null if the parse was unsuccessful.
*/
public static function parse( string $input, int &$offset ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not immediately a fan of mutating the offset passed into the function. did you consider some other examples of passing in something like &$bytes_parsed? for example, the HTML decoder does this which leaves the input variables untouched while still communicating &$match_byte_length

}
++$updated_offset;

self::parse_whitespace( $input, $updated_offset );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above: would be nice if instead of parse_whitespace we had this function return the length of the whitespace bytes and then this line could use the same variable names as the HTML API with

$at += self::skip_whitespace( $input, $at );

$attr_matcher = WP_CSS_Attribute_Selector::MATCH_SUFFIXED_BY;
$updated_offset += 2;
break;
case '*':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I explored this long ago, I actually felt like the symbols in use in CSS provided reasonable literal values in the code vs. the use of consts. it didn’t have the same explanatory power in the code, but it provided a more direct correspondence with the CSS, which was nice, plus involved fewer translations/CPU cycles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to that, what do you have in mind, a single character?

  • =: exact
  • *: contains
  • etc.

@sirreal sirreal force-pushed the html-api/add-css-selector-parser branch from d7e840c to 96c9d09 Compare May 30, 2025 09:45
sirreal and others added 17 commits July 10, 2025 19:27
This is a more appropriate name for the type of match.

> `[att|=val]` Represents an element with the att attribute, its value
> either being exactly "val" or beginning with "val" immediately
> followed by "-" (U+002D). This is primarily intended to allow language
> subcode matches (e.g., the hreflang attribute on the a element in
> HTML) as described in BCP 47 ([BCP47]) or its successor.
Matches the calling interface for the other HTML API classes, avoids
creating the `Generator`, using a static var to avoid re-parsing the
selector string instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants