Use the real tokenizer and tree builder for meta prescan

Gecko bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1701828

At present, Gecko implements `meta` encoding declaration prescan per spec. However, WebKit and Blink don't—AFAICT due to not re-implementing this part when implementing the tokenizer and tree builder from the spec and retaining pre-spec WebKit `meta` prescan behavior since before the Blink fork.

For both Web compat (to avoid different script side effects compared to WebKit and Blink) and performance reasons (to avoid late meta reloads), I intend to change Gecko to align with WebKit and Blink. Except, for performance and implementation simplicity reasons, I'd like to make it so that when the encoding is UTF-8, the work done for the prescan is directly usable as a speculation than can be taken into use as part of the actual parse.

AFAICT, WebKit and Blink use the real tokenizer for their `meta` prescan and don't use the real tree builder, but the behavior could be as-if explained by running the real tree builder in the scripting disabled mode. However, since the Web generally runs with scripting enabled, running the tree builder with scripting disabled for the prescan would prevent implementations from speculatively making use of the prescan as part of the real parse when the result is the most likely one (UTF-8).

How do WebKit and Blink developers feel about specifying the prescan in terms of the real tree builder but in the scripting enabled mode? As noted, AFAICT, the only change that WebKit and Blink would need to as-if align would be ignoring encoding `meta` inside `noscript`.

Demos: https://hsivonen.com/test/moz/meta/

Specifically, I suggest the following:

----

Let _stop extended prescan_ be false.

For the first 1024 bytes of the stream:

Start parsing the stream with UTF-8 as the encoding and without exposing the resulting tree. If the parsing algorithm inserts a `template` element, set _stop extended prescan_ to true. If the `head` element is popped off the tree builder's stack, set _stop extended prescan_ to true. If a `meta` element that constitutes an internal encoding declaration is inserted, start the real parse with the encoding that was declared (after substitutions from _change the encoding_) and abort these steps.

After the first 1024 bytes if _stop extended prescan_ is false:

If the parsing algorithm inserts a `template` element, abort these steps. If the `head` element is popped off the tree builder's stack, abort these steps. If a `meta` element that constitutes an internal encoding declaration is inserted, start the real parse with the encoding that was declared (after substitutions from _change the encoding_) and abort these steps.

If the above steps didn't start the real parse, proceed to bogo XML declaration handling and, if that fails, to encoding detection from byte patterns. Remove the spec text about late `meta` triggering _change the encoding_.

----

Open question: What should happen with case where the real parse is in the scripting disabled mode? (E.g. XHR.)

Tagging @mfreed7 from Blink due to previous bogo XML declaration involvement in this area, @gsnedders from WebKit due to discussion on Matrix, and @zorpan for parsing details in general.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the real tokenizer and tree builder for meta prescan #6962

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use the real tokenizer and tree builder for meta prescan #6962

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions