-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Use the real tokenizer and tree builder for meta prescan #6962
Description
At present, Gecko implements meta encoding declaration prescan per spec. However, WebKit and Blink don't—AFAICT due to not re-implementing this part when implementing the tokenizer and tree builder from the spec and retaining pre-spec WebKit meta prescan behavior since before the Blink fork.
For both Web compat (to avoid different script side effects compared to WebKit and Blink) and performance reasons (to avoid late meta reloads), I intend to change Gecko to align with WebKit and Blink. Except, for performance and implementation simplicity reasons, I'd like to make it so that when the encoding is UTF-8, the work done for the prescan is directly usable as a speculation than can be taken into use as part of the actual parse.
AFAICT, WebKit and Blink use the real tokenizer for their meta prescan and don't use the real tree builder, but the behavior could be as-if explained by running the real tree builder in the scripting disabled mode. However, since the Web generally runs with scripting enabled, running the tree builder with scripting disabled for the prescan would prevent implementations from speculatively making use of the prescan as part of the real parse when the result is the most likely one (UTF-8).
How do WebKit and Blink developers feel about specifying the prescan in terms of the real tree builder but in the scripting enabled mode? As noted, AFAICT, the only change that WebKit and Blink would need to as-if align would be ignoring encoding meta inside noscript.
Demos: https://hsivonen.com/test/moz/meta/
Specifically, I suggest the following:
Let stop extended prescan be false.
For the first 1024 bytes of the stream:
Start parsing the stream with UTF-8 as the encoding and without exposing the resulting tree. If the parsing algorithm inserts a template element, set stop extended prescan to true. If the head element is popped off the tree builder's stack, set stop extended prescan to true. If a meta element that constitutes an internal encoding declaration is inserted, start the real parse with the encoding that was declared (after substitutions from change the encoding) and abort these steps.
After the first 1024 bytes if stop extended prescan is false:
If the parsing algorithm inserts a template element, abort these steps. If the head element is popped off the tree builder's stack, abort these steps. If a meta element that constitutes an internal encoding declaration is inserted, start the real parse with the encoding that was declared (after substitutions from change the encoding) and abort these steps.
If the above steps didn't start the real parse, proceed to bogo XML declaration handling and, if that fails, to encoding detection from byte patterns. Remove the spec text about late meta triggering change the encoding.
Open question: What should happen with case where the real parse is in the scripting disabled mode? (E.g. XHR.)
Tagging @mfreed7 from Blink due to previous bogo XML declaration involvement in this area, @gsnedders from WebKit due to discussion on Matrix, and @zorpan for parsing details in general.