Use the real tokenizer and tree builder for meta scan#7205
Use the real tokenizer and tree builder for meta scan#7205hsivonen wants to merge 9 commits intowhatwg:mainfrom
Conversation
|
I expect the PR to require more cross-reference checking and editiorial polish. |
55510e2 to
0f5887d
Compare
annevk
left a comment
There was a problem hiding this comment.
I gave this a relatively quick look and left some high-level feedback. This probably needs to be reviewed by someone more familiar with the parser section.
I also hope that at some point we can make the parser a more concrete class as the current setup is rather vague about state and parameters and such.
source
Outdated
| <span data-x="concept-encoding-confidence">confidence</span> <i>tentative</i>.</p></li> | ||
| <li><p>Let <var>templatePushedOrHeadPopped</var> be false</p>.</li> | ||
|
|
||
| <li><p>Let <var>metaCharset</var> be null</p>.</li> |
There was a problem hiding this comment.
The way this variable works is quite strange. It seems that metaParser somehow ends up setting it, but how is it in scope?
There was a problem hiding this comment.
The handling of a start tag token named "meta" sets it if that step runs inside a tree builder belonging to metaParser. Not sure what the variable visibility rules for spec English are.
There was a problem hiding this comment.
There was a problem hiding this comment.
How should I make the metaCharset writable from the tree builder? How should I make the tree builder check whether it belongs to metaParser?
There was a problem hiding this comment.
Oh, and templatePushedOrHeadPopped needs to be writable from the tree builder push/pop operations.
source
Outdated
| <p>If the user agent supports <span>autodetecting the character encoding from byte pattern | ||
| analysis</span>, the user agent should perform the detection on the start of the stream up | ||
| to and including the byte pointed to by <var>sniffingLimit</var>, unless the stream is from | ||
| a local file in which case the stream may be examined up to a further well-defined limit. |
There was a problem hiding this comment.
I suspect you want <span>implementation-defined</span> in a couple places here.
There was a problem hiding this comment.
Fixed one instance. What's the other?
f683499 to
31dd13d
Compare
|
The corresponding change landed in Gecko, so considering the trunks of the major engines, this PR strictly moves the spec closer to the implementation reality. I've rebased the changes. I've added a changeset to adjust the conformance requirements per discussion on the issue to
This PR still needs a solution for the variable updates in a way that's permissible in spec English. WPTs landed in web-platform-tests/wpt#31927 . The tests don't test the details where WebKit and Blink differ from the spec text ( |
|
Thanks. I don't think we should merge this if it intentionally aligns with 1/3 engines instead of 2/3. If you could work on updating it to match 2/3 engines that'd be great and something we could accept per the working mode. Otherwise if you don't feel that's a good use of your time (which is understandable given that you work on Gecko) we can try to make time as the editors to align it with the majority, probably after the holidays. |
|
I think it would be useful to hear from WebKit and Blink implementors about their thoughts on whether they want to lock the In Gecko, using the real tree builder has the benefit of having the machinery for processing speculative loads and preloads before the Considering non-browser implementations, reusing the real tree builder means not having to also create a different mode that turns off foreign content and means being able not to retokenize in the happy case (even if speculative loads and prescan considerations don't apply). AFAICT, the main benefit of speccing different tree builder behavior for meta prescan and for the real parse would be to write the spec so that WebKit and Blink wouldn't need to change anything to comply perfectly. I think it somewhat relevant that they'd already be way closer to the text in the PR right now than to the text in the spec right now. (As noted, I left the differing edge cases out of the WPTs.) So far, there's no evidence of a Web compat argument either way. |
Ok, so I've read over this PR and the associated issue twice now. From my point of view, I'm not sure we should make this change to the spec, at least as-is. From the HTTPArchive analysis of the encoding-declaration-inside- |
What's your opinion on making the points of difference implementation-defined? That is:
|
Assuming these truly are far-corner case situations, I think I'm ok with this. It sounds like an accurate description of the state of browsers, and won't practically impact real sites. |
Thanks. I added two "may" paragraphs. |
|
I'm not comfortable with implementation-defined behaviors, when we have a clear path toward interoperability that 2/3 engines are already implementing. |
|
I think the problem here is that were Chromium and WebKit to reimplement this code, they would likely want this to be changed as well. I thought that's what we discussed in the call. |
|
Perhaps, although my recollection was Mason had compat concerns that might influence such a rewrite. Regardless, such a rewrite is not forthcoming, so I don't think speccing as if one is happening makes sense. |
|
So far there aren't known real-world cases of the points of divergence mattering for Web compat, and @zcorpan's httparchive check suggested that one of them at least doesn't matter (and the other one is very corner casey). Given the lack of real-world Web compat motivation for any of Gecko, WebKit, and Blink to change, changing would be a matter of complying with the spec on principle. For Gecko to match WebKit and Blink or for WebKit and Blink to match Gecko on the two points that I've written as implementation-defined in the PR, the amount of engineering work required is unreasonably large for an "on principle" change. Therefore, based on the current understanding of compat impact, it's unrealistic to expect any of WebKit, Blink, or Gecko to change. In that situation, leaving the points that I've made implementation-defined in the PR would best document the implementation reality. Considering that the WebKit way (inherited to Blink) now deviates more from reusing the parts of the main parse than the Gecko way, I think it would be courteous to non-browser implementations to acknowledge that reusing the parts of the main parse with fewer deviations is viable/permitted. And it would be good to document in the spec which corners are intentionally not tested one way or the other by WPTs. |
|
Sorry, I understand that's your perspective but I disagree. Our goal is interop, not courteously acknowledging that implementation choices may make it hard to achieve interop. Again, I think it's fine if Gecko doesn't want to change to match the (desired) spec. But that choice to diverge should not be codified into the spec. We should spec the 2/3 behavior, and write web platform tests for it. |
By "courteous" I meant giving non-browser implementations useful information. I think it's useful information that it's (as far is is currently known) workable to reuse the real tree builder with fewer exceptions for the
I think "desired" is highly subjective here, and "2/3" is more like "1/2" in terms of what implementations have been independently implemented. From an implementability perspective, it's arguably desirable for the It seems we are in agreement that none of Gecko, WebKit, or Blink is going to change based on whether the two points are specified implementation-defined or not. Consider that someone is writing a fourth implementation: Is it useful to let them know that in practice they have the option to make the
I still think it's somewhat relevant that what the spec currently says is very far from what WebKit and Blink do, and the general area has a much more substantial implementation-defined point of divergence both before and after this PR (how the lack of an encoding declaration is handled). In that sense, I find it somewhat odd that this area of the spec can have a huge "implementation-defined" hole (what happens if there's no declaration) and have no real relation to what WebKit and Blink do for years, but then the gap can't be narrowed by documenting what implementations do. |
|
Since web compat doesn't dictate any particular behavior for the edge cases identified here, I think if WebKit or Blink one day were to rewrite their meta prescan (for whatever purpose, maybe switching to Rust or so), it would be nice to at least inform in the spec that the current behavior of WebKit and Blink doesn't match the real parse, and that it can be changed so that it does (with a link to a PR or a delta spec). Or if those implementers want the spec to require the ideal behavior and be non-conforming for now, we can make the spec require that behavior despite not matching 2/3. The shortest path to interop is often a useful strategy but we don't have to apply it blindly. |
|
Sure, adding such a note seems reasonable. I'm not sure how much it generalizes (i.e., is the principle to add such a note for every non-interoperable part of the platform?). But if it's important to Gecko it doesn't cost much.
This is explicitly not the case, per the earlier discussions in this thread and in the triage meetings. |
If there's a desire by at least one vendor to change the spec and it has been researched and deemed possible to change in terms of web compat. |
What practical effect would you like to achieve by writing WPTs for the two details at issue ( |
|
Clearly delineating how browsers are non-interoperable. |
It's unclear to me what practical effect is sought by "delineating". Do you think it's of practical importance that implementations that aren't Gecko, WebKit, or Blink implement the What I'm trying to get at is: Is this about a principle of always going with "2/3" (even when, for historical reasons, it's really 1/2 as far as independent implementations go) and avoiding creating a precedent of "2/3" being situationally debatable or about a practical matter with the specific behaviors at issue here? |
That's not the ask. What we want is to allow for one to happen and not render implementations that do it non-compliant. If @mfreed7 can substantiate the compatibility risk that would prevent such an approach that would make it easier to accept the Chromium/WebKit implementation, but failing that I don't think that should be allowed the only way forward here. |
|
The approach described in this PR has been in Firefox release for 3.5 years now. During that time, there have been no problems reported regarding the CDATA in A Web compat issue arising from the Despite that one counter example, I think it's safe to say that the approach documented in this PR is sufficiently Web-compatible. |
Closes #6962.
(See WHATWG Working Mode: Changes for more details.)
/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/references.html ( diff )
/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/references.html ( diff )
/semantics.html ( diff )