Skip to content

Use the real tokenizer and tree builder for meta scan#7205

Open
hsivonen wants to merge 9 commits intowhatwg:mainfrom
hsivonen:metacharset
Open

Use the real tokenizer and tree builder for meta scan#7205
hsivonen wants to merge 9 commits intowhatwg:mainfrom
hsivonen:metacharset

Conversation

@hsivonen
Copy link
Copy Markdown
Member

@hsivonen hsivonen commented Oct 12, 2021

Closes #6962.

(See WHATWG Working Mode: Changes for more details.)


/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/references.html ( diff )


/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/references.html ( diff )
/semantics.html ( diff )

@hsivonen
Copy link
Copy Markdown
Member Author

I expect the PR to require more cross-reference checking and editiorial polish.

Copy link
Copy Markdown
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a relatively quick look and left some high-level feedback. This probably needs to be reviewed by someone more familiar with the parser section.

I also hope that at some point we can make the parser a more concrete class as the current setup is rather vague about state and parameters and such.

source Outdated
<span data-x="concept-encoding-confidence">confidence</span> <i>tentative</i>.</p></li>
<li><p>Let <var>templatePushedOrHeadPopped</var> be false</p>.</li>

<li><p>Let <var>metaCharset</var> be null</p>.</li>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this variable works is quite strange. It seems that metaParser somehow ends up setting it, but how is it in scope?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of a start tag token named "meta" sets it if that step runs inside a tree builder belonging to metaParser. Not sure what the variable visibility rules for spec English are.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I make the metaCharset writable from the tree builder? How should I make the tree builder check whether it belongs to metaParser?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and templatePushedOrHeadPopped needs to be writable from the tree builder push/pop operations.

source Outdated
<p>If the user agent supports <span>autodetecting the character encoding from byte pattern
analysis</span>, the user agent should perform the detection on the start of the stream up
to and including the byte pointed to by <var>sniffingLimit</var>, unless the stream is from
a local file in which case the stream may be examined up to a further well-defined limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you want <span>implementation-defined</span> in a couple places here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed one instance. What's the other?

@hsivonen
Copy link
Copy Markdown
Member Author

The corresponding change landed in Gecko, so considering the trunks of the major engines, this PR strictly moves the spec closer to the implementation reality.

I've rebased the changes.

I've added a changeset to adjust the conformance requirements per discussion on the issue to

  1. Require first-child position
  2. Waive the 1024-byte requirement
  3. Waive the non-NCR requirement

This PR still needs a solution for the variable updates in a way that's permissible in spec English.

WPTs landed in web-platform-tests/wpt#31927 . The tests don't test the details where WebKit and Blink differ from the spec text (meta in noscript and end of head followed by svg or math followed by <![CDATA[ followed by > followed by a meta starting within the first 1024 bytes).

@domenic
Copy link
Copy Markdown
Member

domenic commented Dec 23, 2021

Thanks. I don't think we should merge this if it intentionally aligns with 1/3 engines instead of 2/3. If you could work on updating it to match 2/3 engines that'd be great and something we could accept per the working mode. Otherwise if you don't feel that's a good use of your time (which is understandable given that you work on Gecko) we can try to make time as the editors to align it with the majority, probably after the holidays.

@hsivonen hsivonen added the agenda+ To be discussed at a triage meeting label Jan 7, 2022
@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Jan 7, 2022

I think it would be useful to hear from WebKit and Blink implementors about their thoughts on whether they want to lock the meta prescan to different tree builder behavior than the actual parse or if they see potential benefit from making the prescan match the real parser for noscript and CDATA in foreign content.

In Gecko, using the real tree builder has the benefit of having the machinery for processing speculative loads and preloads before the meta. Also, it means being able to use the prescan work as the real parse in the happy case, although one might argue that the prescan is little enough computation that it's OK to retokenize even when not logically necessary.

Considering non-browser implementations, reusing the real tree builder means not having to also create a different mode that turns off foreign content and means being able not to retokenize in the happy case (even if speculative loads and prescan considerations don't apply).

AFAICT, the main benefit of speccing different tree builder behavior for meta prescan and for the real parse would be to write the spec so that WebKit and Blink wouldn't need to change anything to comply perfectly. I think it somewhat relevant that they'd already be way closer to the text in the PR right now than to the text in the spec right now. (As noted, I left the differing edge cases out of the WPTs.) So far, there's no evidence of a Web compat argument either way.

@past past removed the agenda+ To be discussed at a triage meeting label Jan 14, 2022
@mfreed7
Copy link
Copy Markdown
Collaborator

mfreed7 commented Feb 2, 2022

AFAICT, the main benefit of speccing different tree builder behavior for meta prescan and for the real parse would be to write the spec so that WebKit and Blink wouldn't need to change anything to comply perfectly. I think it somewhat relevant that they'd already be way closer to the text in the PR right now than to the text in the spec right now. (As noted, I left the differing edge cases out of the WPTs.) So far, there's no evidence of a Web compat argument either way.

Ok, so I've read over this PR and the associated issue twice now. From my point of view, I'm not sure we should make this change to the spec, at least as-is.

From the HTTPArchive analysis of the encoding-declaration-inside-<noscript> use case, this is a corner case that never gets hit on the open Web. And from your comment about the "meta in CDATA in SVG with > before the meta within the CDATA section" corner case, this is even less likely to get hit. If both of those statements are correct, then this seems like more of a philosophical issue than a practical one. In that case, I'd really prefer not trying to require a change to two engines in order to allow an optimization in the third. Let's spec what 2 of 3 engines currently do, and (from the analysis above) it doesn't practically matter that the third is faster (e.g. re-uses the tokenization) but would break (i.e. not match the spec) on corner cases that don't exist in the wild. We don't even need to have a WPT that examines these corner cases, if they're not ones that ever really occur in the wild.

@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Feb 3, 2022

Let's spec what 2 of 3 engines currently do,

What's your opinion on making the points of difference implementation-defined?

That is:

  1. Saying that the scripting flag for the meta parser is either the same as for the real parser or false, depending on implementation.
  2. Saying that the meta parser may choose to make svg and math not transition to foreign content.

@mfreed7
Copy link
Copy Markdown
Collaborator

mfreed7 commented Feb 3, 2022

Let's spec what 2 of 3 engines currently do,

What's your opinion on making the points of difference implementation-defined?

That is:

  1. Saying that the scripting flag for the meta parser is either the same as for the real parser or false, depending on implementation.
  2. Saying that the meta parser may choose to make svg and math not transition to foreign content.

Assuming these truly are far-corner case situations, I think I'm ok with this. It sounds like an accurate description of the state of browsers, and won't practically impact real sites.

@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Feb 3, 2022

Assuming these truly are far-corner case situations, I think I'm ok with this. It sounds like an accurate description of the state of browsers, and won't practically impact real sites.

Thanks. I added two "may" paragraphs.

@domenic
Copy link
Copy Markdown
Member

domenic commented Feb 3, 2022

I'm not comfortable with implementation-defined behaviors, when we have a clear path toward interoperability that 2/3 engines are already implementing.

@annevk
Copy link
Copy Markdown
Member

annevk commented Mar 2, 2022

I think the problem here is that were Chromium and WebKit to reimplement this code, they would likely want this to be changed as well. I thought that's what we discussed in the call.

@domenic
Copy link
Copy Markdown
Member

domenic commented Mar 2, 2022

Perhaps, although my recollection was Mason had compat concerns that might influence such a rewrite. Regardless, such a rewrite is not forthcoming, so I don't think speccing as if one is happening makes sense.

@hsivonen
Copy link
Copy Markdown
Member Author

So far there aren't known real-world cases of the points of divergence mattering for Web compat, and @zcorpan's httparchive check suggested that one of them at least doesn't matter (and the other one is very corner casey).

Given the lack of real-world Web compat motivation for any of Gecko, WebKit, and Blink to change, changing would be a matter of complying with the spec on principle. For Gecko to match WebKit and Blink or for WebKit and Blink to match Gecko on the two points that I've written as implementation-defined in the PR, the amount of engineering work required is unreasonably large for an "on principle" change. Therefore, based on the current understanding of compat impact, it's unrealistic to expect any of WebKit, Blink, or Gecko to change.

In that situation, leaving the points that I've made implementation-defined in the PR would best document the implementation reality. Considering that the WebKit way (inherited to Blink) now deviates more from reusing the parts of the main parse than the Gecko way, I think it would be courteous to non-browser implementations to acknowledge that reusing the parts of the main parse with fewer deviations is viable/permitted. And it would be good to document in the spec which corners are intentionally not tested one way or the other by WPTs.

@domenic
Copy link
Copy Markdown
Member

domenic commented Mar 28, 2022

Sorry, I understand that's your perspective but I disagree. Our goal is interop, not courteously acknowledging that implementation choices may make it hard to achieve interop. Again, I think it's fine if Gecko doesn't want to change to match the (desired) spec. But that choice to diverge should not be codified into the spec.

We should spec the 2/3 behavior, and write web platform tests for it.

@hsivonen
Copy link
Copy Markdown
Member Author

Our goal is interop, not courteously acknowledging that implementation choices may make it hard to achieve interop.

By "courteous" I meant giving non-browser implementations useful information. I think it's useful information that it's (as far is is currently known) workable to reuse the real tree builder with fewer exceptions for the meta scan.

Again, I think it's fine if Gecko doesn't want to change to match the (desired) spec.

I think "desired" is highly subjective here, and "2/3" is more like "1/2" in terms of what implementations have been independently implemented. From an implementability perspective, it's arguably desirable for the meta scan to reuse algorithms of the real parse with fewer than more deviations. The thing that makes more deviations "desired" is that one implementation got copied as two and, therefore, can be counted as "2/3" majority.

It seems we are in agreement that none of Gecko, WebKit, or Blink is going to change based on whether the two points are specified implementation-defined or not. Consider that someone is writing a fourth implementation: Is it useful to let them know that in practice they have the option to make the meta phase have fewer deviations from the main parse or is it useful to tell them that they must have more deviations between those two parts?

But that choice to diverge should not be codified into the spec.

I still think it's somewhat relevant that what the spec currently says is very far from what WebKit and Blink do, and the general area has a much more substantial implementation-defined point of divergence both before and after this PR (how the lack of an encoding declaration is handled). In that sense, I find it somewhat odd that this area of the spec can have a huge "implementation-defined" hole (what happens if there's no declaration) and have no real relation to what WebKit and Blink do for years, but then the gap can't be narrowed by documenting what implementations do.

@zcorpan
Copy link
Copy Markdown
Member

zcorpan commented Mar 29, 2022

Since web compat doesn't dictate any particular behavior for the edge cases identified here, I think if WebKit or Blink one day were to rewrite their meta prescan (for whatever purpose, maybe switching to Rust or so), it would be nice to at least inform in the spec that the current behavior of WebKit and Blink doesn't match the real parse, and that it can be changed so that it does (with a link to a PR or a delta spec). Or if those implementers want the spec to require the ideal behavior and be non-conforming for now, we can make the spec require that behavior despite not matching 2/3.

The shortest path to interop is often a useful strategy but we don't have to apply it blindly.

@domenic
Copy link
Copy Markdown
Member

domenic commented Mar 29, 2022

Sure, adding such a note seems reasonable. I'm not sure how much it generalizes (i.e., is the principle to add such a note for every non-interoperable part of the platform?). But if it's important to Gecko it doesn't cost much.

Or if those implementers want the spec to require the ideal behavior and be non-conforming for now,

This is explicitly not the case, per the earlier discussions in this thread and in the triage meetings.

@zcorpan
Copy link
Copy Markdown
Member

zcorpan commented Mar 29, 2022

I'm not sure how much it generalizes (i.e., is the principle to add such a note for every non-interoperable part of the platform?).

If there's a desire by at least one vendor to change the spec and it has been researched and deemed possible to change in terms of web compat.

@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Apr 1, 2022

We should spec the 2/3 behavior, and write web platform tests for it.

What practical effect would you like to achieve by writing WPTs for the two details at issue (noscript behavior and svg/math plus CDATA behavior)?

@domenic
Copy link
Copy Markdown
Member

domenic commented Apr 1, 2022

Clearly delineating how browsers are non-interoperable.

@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Apr 4, 2022

Clearly delineating how browsers are non-interoperable.

It's unclear to me what practical effect is sought by "delineating".

Do you think it's of practical importance that implementations that aren't Gecko, WebKit, or Blink implement the meta scan in a way that diverges from the real parse for 1) noscript and/or 2) CDATA in svg/math?

What I'm trying to get at is: Is this about a principle of always going with "2/3" (even when, for historical reasons, it's really 1/2 as far as independent implementations go) and avoiding creating a precedent of "2/3" being situationally debatable or about a practical matter with the specific behaviors at issue here?

@annevk
Copy link
Copy Markdown
Member

annevk commented Apr 5, 2022

Regardless, such a rewrite is not forthcoming, so I don't think speccing as if one is happening makes sense.

That's not the ask. What we want is to allow for one to happen and not render implementations that do it non-compliant. If @mfreed7 can substantiate the compatibility risk that would prevent such an approach that would make it easier to accept the Chromium/WebKit implementation, but failing that I don't think that should be allowed the only way forward here.

@hsivonen
Copy link
Copy Markdown
Member Author

hsivonen commented Sep 26, 2025

The approach described in this PR has been in Firefox release for 3.5 years now.

During that time, there have been no problems reported regarding the CDATA in svg/math difference (nothing reported in the sense that no such report has been triaged to reach me).

A Web compat issue arising from the noscript difference has been reported once and not on a major site. It was not an issue of meta in noscript, which is what @mfreed7 looked for. Instead, it was a case of div inside noscript such that the div affected where head ends, so that an incorrect meta after 1024 bytes and outside head if scripting disabled and inside head if scripting enabled was ignored by Chromium so that the detector took effect and not ignored by Firefox so that the incorrect explicit declaration took effect. In Safari, whether the page worked or not depended on the fallback encoding pref.

Despite that one counter example, I think it's safe to say that the approach documented in this PR is sufficiently Web-compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Use the real tokenizer and tree builder for meta prescan

6 participants