Make WordPress Core

Opened 18 months ago

Closed 3 weeks ago

#61401 closed feature request (fixed)

Blocks: Efficiently find and traverse blocks in a document.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version:
Component: Editor Keywords: has-patch has-unit-tests commit dev-reviewed
Focuses: Cc:

Description

The existing block parser reliably parses blocks, but it's also a heavy operation, involving a full parse of the entire document, the construction of a full block tree which includes multiple copies of different spans of the source content, and the JSON-parsing of every span of block attributes.

In many situations it's only necessary to find specific blocks within a document, or find where they start or end, or build a map of document structure without needing the full parse.

WordPress should provide a reliable and efficient way to traverse the Blocks in an HTML document in a streaming manner.

Change History (40)

This ticket was mentioned in PR #6760 on WordPress/wordpress-develop by @dmsnell.


18 months ago
#1

  • Keywords has-patch added

Trac ticket: Core-61401

In this patch two new functions are introduced for the purpose of returning a PCRE pattern that can be used to quickly and efficiently find blocks within an HTML document without having to parse the entire document and without building a full block tree.

These new functions enable more efficient processing for work that only needs to examine document structure or know a few things about a document without knowing everything, including but not limited to:

  • Finding the URL of the first image block in a document.
  • Inserting hooked blocks.
  • Analyzing block counts.

This ticket was mentioned in PR #6760 on WordPress/wordpress-develop by @dmsnell.


6 months ago
#2

Trac ticket: Core-61401.
Theme: _Everything could be streaming, single-pass, reentrant, and low-overhead._

## Todo

  • [ ] While this is developed here, any changes to the Block Parser need to coordinate with the package in the Gutenberg repository.

## Breaking changes

  • When encountering a block delimiter with a closing flag and also a void flag, the existing parser prefers returning as a void block, but this returns the block closer. This is an edge case when things are already erroneous, but it makes more sense to me when writing this that we should prefer closing to introducing a void, as the void flag is more likely to be a mistake, and because if we treat a closer as a void we could lead to deep chains of unclosed blocks. This is something I'd like to re-examine as a whole with the block parsing, taking lessons from HTML's stack machine, but not in this change (for example, treat it as a closer if there's an open block of the given name).

## Related

  • Core-45312 where whitespace-only blocks are surprising with parse_blocks()

## Summary

In this patch two new functions are introduced for the purpose of returning a PCRE pattern that can be used to quickly and efficiently find blocks within an HTML document without having to parse the entire document and without building a full block tree.

These new functions enable more efficient processing for work that only needs to examine document structure or know a few things about a document without knowing everything, including but not limited to:

  • Finding the URL of the first image block in a document.
  • Inserting hooked blocks.
  • Analyzing block counts.

Further, a new class is introduced to further manage the process of finding block comment delimiters, one based on a hand-crafted parser designed for high performance: WP_Parsed_Block_Delimiter_Info.

This class provides a number of conveniences:

  • It performs zero allocations beyond a static set of numeric indices.
  • It holds onto the reference of the text it scanned, but can be detached to release that text. When detaching, it creates a substring of the text containing the full delimiter match.
  • It can indicate if the delimiter is for a given block type without performing any allocations.
  • It returns a lazy JSON parser by default for the attributes (not implemented yet) for more efficient interaction with the block attributes.
  • Inasmuch as is possible, all costs are explicit and only paid when requested by the calling code.

https://github.com/WordPress/wordpress-develop/assets/5431237/aa7b122c-30ad-469e-83a1-c8b32a57bc1f

## Example

// Get the first image in a post with the PCRE pattern.
while ( 1 === preg_match( get_named_block_delimiter_regex( 'image' ), $post_content, $matches, null, $at ) ) {
        if ( '/' === $matches['closer'] ) {
                $at += strlen( $matches[0] );
                continue;
        }
        
        $attrs = json_parse( $matches['attrs'] );
        if ( isset( $attrs['url'] ) ) {
                return $attrs['url'];
        }
}

return null;
// Get the first image in a post with the utility class.
$image = null;
$at    = 0;
while ( ! isset( $image ) ) {
        $image = WP_Parsed_Block_Delimiter_Info::next_delimiter( $post_content, $at, $next_delimiter_at, $next_delimiter_length );
        if (
                'opener' === $image->get_delimiter_type() &&
                $image->is_block_type( 'core/image' )
        ) {
                break;
        }

        $image = null;
        $at    = $next_delimiter_at + $next_delimiter_length;
}

This ticket was mentioned in PR #9105 on WordPress/wordpress-develop by @dmsnell.


6 months ago
#3

  • Keywords has-unit-tests added

Replaces #6760

Trac ticket: Core-61401

The Block Scanner follows the HTML API in providing a streaming,
near-zero-overhead, lazy, re-entrant parser for traversing block
structure. This class provides an alternate interface to
parse_blocks() which is more amenable to a number of common
server-side operations on posts, such as:

  • Generating an excerpt from only the first N blocks in a post.
  • Determining which block types are present in a post.
  • Determining which posts contain a block of a given type.
  • Generating block supports content for a post.

#4 @dmsnell
6 months ago

With over a year in the meantime and significant learning along the way, I am introducing a new proposal for a WP_Block_Scanner class, which achieves the goal of this ticket.

#5 @Mamaduka
4 months ago

Just finished a quick read on unit tests, and I like what I'm seeing. Fantastic work, @dmsnell!

I really liked the examples for the initial PR. Maybe we should include similar examples for WP_Block_Scanner and showcase the real world applications for core/extenders.

This ticket was mentioned in PR #9841 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#6

Trac ticket: Core-61401
See: #9105

Adds a higher-level sibling to the Block Scanner class, providing block semantics over block delimiter traversal.

This ticket was mentioned in PR #9842 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#7

Trac ticket: Core-61401
See: #9105, #9841, (#9842), #9843

This ticket was mentioned in PR #9843 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#8

Trac ticket: Core-61401
See: #9105, #9841, #9842, (#9843)

@gziolo commented on PR #9105:


2 months ago
#10

Can you elaborate on your thinking process for splitting into WP_Block_Scanner and WP_Block_Processor? I'm not sure whether the latter contains anything unique at this point.

@dmsnell commented on PR #9105:


2 months ago
#11

thinking process for splitting into WP_Block_Scanner and WP_Block_Processor?

this is great feedback here because I don’t have a solid feel for splitting or combining them. the core of it is the question of intent: does calling code want to scan the delimiter's or does the calling code want to navigate through the blocks?

this is the same with the HTML API:

  • the Tag Processor only visits real textual tokens, and it visits tokens which may be ignored, ones which may be ignored contextually even, when parsed as HTML.
  • the HTML Processor visits tokens that aren’t in the text itself, implicit openers and closers defined by the parser. it also renames a tag and performs auto-closing.

now with blocks there is far less that the higher-level cousin needs to do:

  • auto-close blocks when parsing errors are present
  • separate innerHTML from freeform HTML content
  • create block structure as parse_blocks() would produce, which _could_ be built out of the class too

the performance cliff between these two is substantially smaller than with HTML tokenization and parsing. is it worth the separation or not? is the intent-cliff shallow enough? right now I think it’s at least working through the new questions that arise from a single class to see if it seems to fall into place once done, but I’m not ruling out separating them again.

@dmsnell commented on PR #9105:


2 months ago
#12

@gziolo something I was thinking about on my walk this morning was an idea I’ve ignored for a while, but maybe is worth it here: by default we skip freeform HTML content. the way we could signal to visit it is not with a boolean attribute, but using a special block type in next_delimiter().

// visit all explicit block delimiters
$scanner->next_delimiter();

// visit all block delimiters plus freeform blocks.
$scanner->next_delimiter( '*' );

// visit all freeform blocks: a possible special token meaning “no explicit block”
$scanner->next_delimiter( '#freeform' );

// visit all HTML spans, including `innerHTML`
$scanner->next_token();

this would collapse the difference between next_token() and next_delimiter() to whether they visit inner HTML. for any performance ideas about visiting freeform blocks, we could indicate that from within next_delimiter() and keep it internal.

I will try and explore this angle.

@gziolo commented on PR #9105:


2 months ago
#13

// visit all freeform blocks: a possible special token meaning “no explicit block”
$scanner->next_delimiter( '#freeform' );

We have this concept of freeform content handler in JavaScript which you can even modify with wp.blocks.setFreeformContentHandlerName and it defaults to core/freeform. I don't think we ever did anything like that on the server, so either #freeform or core/freeform resonates with me. Maybe folks could even be able to set this alias because I noticed that, for example, in the widgets editor, the freeform handler gets set to core/html. It probably is safer to go more meta with #freeform or ::freeform (like CSS pseudo-elements).

@dmsnell commented on PR #9105:


2 months ago
#14

We have this concept of freeform content handler in JavaScript which you can even modify with wp.blocks.setFreeformContentHandlerName and it defaults to core/freeform. I don't think we ever did anything like that on the server

That’s something I wish had been more hard-coded and less extensible, based on my work on the two-stage block parsing and loading inside the block editor, both in WordPress and in Tumblr. But I guess JS is different because we have to load _something_ into the editor, whereas on the server a _non-block_ can remain _not a block_.

Since yesterday I’ve been using * and deferring an approach which visits only the HTML spans. Something that makes me nervous about using #freeform is that we have a relatively short timeline if we want to get this in for 6.9 and properly choosing a name for those seems rushed. It would be an easy thing to add later, I would hope. Something I didn’t mention is why I proposed that syntax:

  • block names cannot start with or have the octothorp character anywhere
  • this mirrors the #text and other _token type_ descriptions in the HTML API

When I left off last night I was working on my extract_block() method, whose tests were failing. I think there is still some ambiguity in the code around the distinction between innerHTML and freeform blocks. What I plan on doing today is to try and do something I’ve been avoiding because of the increased internal state requirements: track HTML spans and pause when we match them. Currently this has all worked by recognizing HTML spans only after matching the next delimiter. With a special state value we can then _visit_ the HTML and move on without re-scanning. There are some complications doing this with incomplete inputs as well, so it’s probably good and time to make this change.

Hopefully that clears up all of the remaining issues.

@dmsnell commented on PR #9105:


2 months ago
#15

This has been updated significantly to clarify the interface and to merge the block scanner and block processor into one.

It would be really helpful to have review over the public methods, their docblocks, and the usage in the test suite.

I plan on going back to update the class/module docblock with more substantial general guidance for the class.

Also I plan on doing a separate type annotation pass, so please disregard any comments currently on type annotations.

@dmsnell commented on PR #9105:


2 months ago
#16

With significant haste in the eleventh hour I have finished making some major changes to this proposal:

  • There is now only a single WP_Block_Processor class (thanks @swissspidy for renaming the PR).
  • This singular class differentiates between freeform content and inner HTML, making it possible for calling code to inquire which is which.
    • A new block type wildcard * can be passed to the next_ functions to ensure that they visit top-level freeform content.
  • The docs have all been updated, names have been changed, and type annotations have been added.
  • The processor treats HTML spans like void blocks; previously it created a separate opener and closer.
  • Almost all references to static:: have been replaced by self::. After consideration, as a default first step I think it would be wise to treat most of a this class as final and discourage subclasses from reimplementing methods like next_token() the way subclasses do in the HTML API. Without _needing_ a higher-level subclass this makes it easier.
    • The exception are a few methods intended to be replaced by subclasses, including the missing lazy JSON parser for static::get_attributes().

Personally I’m quite happy with this revision and I thank you all for your input on this. I think it’s in a much more coherent and usable state now. At this point I believe it’s ready for final review and merge, though I think it would be best only to block merging now for design issues with the class, the methods, the things which must be maintained _ad infinitum_. For other issues, I would prefer if we can address them during the beta phase after merging.

Of course, I invite all feedback and will address things if I can in a timely manner, but right now I would love to see this out there in trunk to get it in use and testing.

#17 @dmsnell
2 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 60939:

Blocks: Introduce WP_Block_Processor for efficiently parsing blocks.

The Block Processor follows the HTML API in providing a streaming, near-zero-overhead, lazy, re-entrant parser for traversing block structure. This class provides an alternate interface to parse_blocks() which is more amenable to a number of common server-side operations on posts, especially those needing to operate on only a part of a full post.

Developed in https://github.com/WordPress/wordpress-develop/pull/9105
Discussed in https://core.trac.wordpress.org/ticket/61401

Props dmsnell, gziolo, jonsurrell, soean, tjnowell, westonruter.
Fixes #61401.

This ticket was mentioned in PR #6760 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#18

Trac ticket: Core-61401.
Theme: _Everything could be streaming, single-pass, reentrant, and low-overhead._

## Todo

  • [ ] While this is developed here, any changes to the Block Parser need to coordinate with the package in the Gutenberg repository.

## Breaking changes

  • When encountering a block delimiter with a closing flag and also a void flag, the existing parser prefers returning as a void block, but this returns the block closer. This is an edge case when things are already erroneous, but it makes more sense to me when writing this that we should prefer closing to introducing a void, as the void flag is more likely to be a mistake, and because if we treat a closer as a void we could lead to deep chains of unclosed blocks. This is something I'd like to re-examine as a whole with the block parsing, taking lessons from HTML's stack machine, but not in this change (for example, treat it as a closer if there's an open block of the given name).

## Related

  • Core-45312 where whitespace-only blocks are surprising with parse_blocks()

## Summary

In this patch two new functions are introduced for the purpose of returning a PCRE pattern that can be used to quickly and efficiently find blocks within an HTML document without having to parse the entire document and without building a full block tree.

These new functions enable more efficient processing for work that only needs to examine document structure or know a few things about a document without knowing everything, including but not limited to:

  • Finding the URL of the first image block in a document.
  • Inserting hooked blocks.
  • Analyzing block counts.

Further, a new class is introduced to further manage the process of finding block comment delimiters, one based on a hand-crafted parser designed for high performance: WP_Parsed_Block_Delimiter_Info.

This class provides a number of conveniences:

  • It performs zero allocations beyond a static set of numeric indices.
  • It holds onto the reference of the text it scanned, but can be detached to release that text. When detaching, it creates a substring of the text containing the full delimiter match.
  • It can indicate if the delimiter is for a given block type without performing any allocations.
  • It returns a lazy JSON parser by default for the attributes (not implemented yet) for more efficient interaction with the block attributes.
  • Inasmuch as is possible, all costs are explicit and only paid when requested by the calling code.

https://github.com/WordPress/wordpress-develop/assets/5431237/aa7b122c-30ad-469e-83a1-c8b32a57bc1f

## Example

// Get the first image in a post with the PCRE pattern.
while ( 1 === preg_match( get_named_block_delimiter_regex( 'image' ), $post_content, $matches, null, $at ) ) {
        if ( '/' === $matches['closer'] ) {
                $at += strlen( $matches[0] );
                continue;
        }
        
        $attrs = json_parse( $matches['attrs'] );
        if ( isset( $attrs['url'] ) ) {
                return $attrs['url'];
        }
}

return null;
// Get the first image in a post with the utility class.
$image = null;
$at    = 0;
while ( ! isset( $image ) ) {
        $image = WP_Parsed_Block_Delimiter_Info::next_delimiter( $post_content, $at, $next_delimiter_at, $next_delimiter_length );
        if (
                'opener' === $image->get_delimiter_type() &&
                $image->is_block_type( 'core/image' )
        ) {
                break;
        }

        $image = null;
        $at    = $next_delimiter_at + $next_delimiter_length;
}

This ticket was mentioned in PR #10291 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#19

Trac ticket: Core-61401
See: ~#9105~, ~#9841~, (~#9842~), ~#9843~

This ticket was mentioned in PR #10292 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#20

Trac ticket: Core-61401
See: ~#9105~, ~#9841~, (~#9842~), ~#9843~

#21 @dmsnell
8 weeks ago

  • Milestone changed from Awaiting Review to 6.9
  • Resolution fixed deleted
  • Status changed from closed to reopened

This ticket was mentioned in Slack in #core by jorbin. View the logs.


8 weeks ago

#23 @jorbin
8 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Since the main work for this was completed with [60939], i am remarking this as fixed. I think it could be worth a follow-up task ticket for the new PRs, but I'll defer to the 6.9 sqaud of if that is appropriate.

This ticket was mentioned in Slack in #core by dmsnell. View the logs.


3 weeks ago

#25 @dmsnell
3 weeks ago

@dlh raised a question about the naming of extract_block(), how it surprised him that it advanced. after some brief discussion in Slack we considered it might be rather urgent to address this before WordPress 6.9 ships, at which point it will be much more difficult to rename the method later.

One suggestion is extract_full_block_and_advance() as a way of communicating that the Block Processor steps forward entirely past the block being extracted.

If anyone has thoughts on this (including whether a change should be made right now or what name it should be changed to) please share ASAP. Thank you everyone!

This ticket was mentioned in Slack in #core by dmsnell. View the logs.


3 weeks ago

This ticket was mentioned in PR #10538 on WordPress/wordpress-develop by @dmsnell.


3 weeks ago
#27

Trac ticket: Core-61401

In testing during the release candidacy for WordPress 6.9 it was found that the extract_block() method may do more work than is expected based off of its name.

This change renames the method to extract_full_block_and_advance() to communicate that it does move the Block Processor forward and to hint at the fact that it also encompasses all inner blocks during that advance.

#28 @dmsnell
3 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

This ticket was mentioned in Slack in #core by wildworks. View the logs.


3 weeks ago

#30 @wildworks
3 weeks ago

This ticket was featured in today's 6.9 Bug Scrub.

@dlh raised a question about the naming of extract_block(), how it surprised him that it advanced. after some brief discussion in Slack we considered it might be rather urgent to address this before WordPress 6.9 ships, at which point it will be much more difficult to rename the method later.

I think we need to know what was actually discussed in this regard, so I'd like to mention two Slack links:

@jonsurrell commented on PR #10538:


3 weeks ago
#31

I've approved this PR. I should note that this assumes it will be backported to 6.9.

If, for some reason, it's not included in 6.9 then the previous method would need to be restored as a deprecated method.

This ticket was mentioned in Slack in #core by welcher. View the logs.


3 weeks ago

#33 @welcher
3 weeks ago

  • Keywords commit added

This was discussed in the 6.9 bug scrub. It's ready for commit and there will need to be a backport.

#34 @dmsnell
3 weeks ago

In 61294:

Block Processor: Rename extract_block() method for clearer documentation.

In testing during the release candidacy for WordPress 6.9 it was found
that the extract_block() method may do more work than is expected
based off of its name.

This change renames the method to extract_full_block_and_advance() to
communicate that it does move the Block Processor forward and to hint at
the fact that it also encompasses all inner blocks during that advance.

Developed in https://github.com/WordPress/wordpress-develop/pull/10538
Discussed in https://core.trac.wordpress.org/ticket/61401

Follow-up to [60939].

Props dlh, dmsnell, jonsurrell, westonruter.

See #61401.

#35 @dmsnell
3 weeks ago

  • Keywords dev-feedback added; commit removed

Resetting for back porting [61294] into the 6.9 branch.

This ticket was mentioned in Slack in #core by dmsnell. View the logs.


3 weeks ago

#38 @desrosj
3 weeks ago

  • Keywords commit dev-reviewed added; dev-feedback removed

[61294] looks good to backport. Thanks everyone!

#39 @dmsnell
3 weeks ago

In 61296:

Block Processor: Rename extract_block() method for clearer documentation.

In testing during the release candidacy for WordPress 6.9 it was found
that the extract_block() method may do more work than is expected
based off of its name.

This change renames the method to extract_full_block_and_advance() to
communicate that it does move the Block Processor forward and to hint at
the fact that it also encompasses all inner blocks during that advance.

Developed in https://github.com/WordPress/wordpress-develop/pull/10538
Discussed in https://core.trac.wordpress.org/ticket/61401

Follow-up to [60939].

Reviewed by desrosj, jonsurrell.
Merges [61294] to the 6.9 branch.

Props dlh, dmsnell, jonsurrell, westonruter.

See #61401.

#40 @dmsnell
3 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.