Page MenuHomePhabricator

Bug 1701828 - meta charset rewrite.
ClosedPublic

Authored by hsivonen on Sep 16 2021, 8:19 AM.
Referenced Files
Unknown Object (File)
Fri, Apr 10, 12:32 PM
Unknown Object (File)
Fri, Apr 10, 8:27 AM
Unknown Object (File)
Thu, Apr 9, 11:02 PM
Unknown Object (File)
Mon, Mar 23, 10:57 AM
Unknown Object (File)
Fri, Mar 20, 12:44 PM
Unknown Object (File)
Fri, Mar 20, 6:42 AM
Unknown Object (File)
Thu, Mar 19, 6:33 PM
Unknown Object (File)
Feb 10 2026, 11:39 PM

Details

Summary

Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.

The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.

Differences from WebKit and Blink:

  • WebKit and Blink honor <meta charset> in <noscript>. This implementation does not.
  • WebKit and Blink look for meta as if the tree builder was unaware of foreign content. This implementation is foreign content-aware. This makes a difference for CDATA sections that contain a > before the meta as well as style and script elements within foreign content. This could happen if the CDATA section that has mysteriously been introduced around a what looks like a meta tag also contains another prior tag-looking run of text.
  • This implementation processes rel=preload and speculative loads that are seen before <meta charset> has been seen. WebKit and Blink instead first look for the meta and rewind before starting speculative parsing.
  • Unlike WebKit, if there is neither an honored meta nor syntax resembling an XML declaration, detection from content takes place (as in Blink).
  • Unlike Blink, if there is neither an honored meta nor syntax resembling an XML declaration, the detection from content is not dependent of network buffer boundaries.
  • Unlike Blink, detection from content can trigger a reload at the end of the stream if the guess made at that point differs from the first guess. (See below for the definition of the input to the first guess.)

Differences from the old spec and Gecko previously:

  • Meta inside script and RCDATA elements is no longer honored.
  • Late meta is now ignored and no longer triggers a reload.
  • Later meta counts as early enough meta: In addition to the previous meta within the first 1024 bytes, now a meta that started within the first 1024 bytes counts as early enough. Additionally, if by then there hasn't been a template start tag and head hasn't ended, meta occurring before the earlier of the end of the head or a template start tag counts as early enough.
  • Meta now counts as not-late even if the encoding label has numeric character reference escapes.
  • Syntax resembling an XML declaration longer than a kilobyte is honored if there is no honored meta.
  • If there is neither an honored meta nor syntax resembling an XML declaration, the initial chardetng scan is potentially longer than before: the first 1024 bytes, the token spanning the 1024-byte boundary if there is such a token, and, if by then head hasn't ended and there hasn't been a template start tag until the end of the template start tag or the end of the token that causes head to end, ever comes first. However, if the token implying the end of the head is a text token, bytes only to the end of the previous non-text token is considered. (This definition avoids depending on network buffer boundaries.)
  • XML View Source now uses the code for syntax resembling an XML declaration instead of expat for extracting the internal encoding label.

Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.

An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

"Bogo XML declarations" isn't a spec term. I wonder if the comment could use something from the spec (or the spec pr).

There isn't a defined spec term. A note says "syntax resembling an XML declaration".

hsivonen edited the summary of this revision. (Show Details)

"Bogo XML declarations" isn't a spec term. I wonder if the comment could use something from the spec (or the spec pr).

Used "syntax resembling an XML declaration" for consistency with the spec.

Given that the commit message is anyhow rather long, "This makes a difference for CDATA sections that contain a > before the meta as well as style and script elements within foreign content." could use some concrete example. That would make it easier for the reader to understand what case behaves differently.

Mentioned that this can happen if the CDATA section wraps not only what looks like a meta tag but also another prior tag-looking run of text.

NOTE: Documentation was modified in diff 506588

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 506589

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 507810

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 508802

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 512163

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 513804

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

smaug requested changes to this revision.Dec 1 2021, 9:10 PM

Could you update some comments (and that would explain me some cases).
I still want to re-read ProcessLookingForMetaCharset() (I can do that in parallel when the patch is being updated)

parser/html/javasrc/Tokenizer.java
3184

This suspendIfRequestedAfterCurrentNonTextToken call could use a comment.

6198

This suspendIfRequestedAfterCurrentNonTextToken call could use a comment.

6997

It might be good to explain the relationship of
suspendIfRequestedAfterCurrentNonTextToken() and
suspendAfterCurrentTokenIfNotInText().

Looks like in practice the latter one is called only in C++ code when we're at kb boundary and then if the flag is set there, the first one sets
shouldSuspend to true.

parser/html/nsHtml5Highlighter.cpp
144–146

How is that existing code? FlushOps() is added in this patch.

parser/html/nsHtml5Highlighter.h
339

The ownership model of mOpSink isn't very easy to understand. Could you perhaps add a comment about it.
I guess saying something like that nsHtml5TreeOpExecutor owns the sink or so.

(Nothing seems to clear mOpSink, so I hope one can't access the deleted object)

parser/html/nsHtml5SpeculativeLoad.cpp
147

I hope there are tests exercising all this high/low handling.

parser/html/nsHtml5StreamParser.cpp
826

So this lock is protecting what all? Could you expand the comment a bit to explain which all method calls need to be protected?
And is the lock around the minimal possible set of things?

1937

"bogo xml" isn't very clear to random reader. Same also below

2065

Oh, this hints something about the buffers.

parser/html/nsHtml5StreamParser.h
325

Just wondering why this couldn't take Encoding&, if the param is guaranteed to be non-null.

361

This is hard to understand. What prefix where?
I tried and failed to understand why mBufferedBytes[1] is special.

393

aPrefix isn't documented and it is hard to guess what it is supposed to be.
Is that the sniffing buffer?
The caller also passes something called prefix, but then there is comment "Buffer for replaying past bytes based on state machine state."
so I guess it is the sniffing buffer.

Please document what aPrefix is.

425–426

(in another patch I reviewed recently the change from tuple to real struct made the code immediately easier to read and use. But fine, I can live with the tuple here. )

564

What is the ownership model for mGtBuffer, please add some comment.
I guess it is at least initially set to some buffer which is kept alive by mFirstBuffer, but mFirstBuffer gets a new value in many places, so what guarantees mGtBuffer doesn't point to a deleted object?
(or perhaps make this RefPtr, that would be less error prone.)

This revision now requires changes to proceed.Dec 1 2021, 9:10 PM
hsivonen added inline comments.
parser/html/nsHtml5Highlighter.cpp
144–146

I was thinking of StartLayout as soon as body opens as pre-existing code. Anyway, we don't need to optimize View Source that much but we do need to make sure the there isn't stuff in the queue as we switch from generating ops on the main thread to generating ops on the parser thread.

hsivonen added inline comments.
parser/html/nsHtml5SpeculativeLoad.cpp
147

parser/htmlparser/tests/mochitest/test_bug672453.html checks for line numbers, yes.

hsivonen added inline comments.
parser/html/nsHtml5StreamParser.h
325

Statics like UTF_8_ENCODING need to be NotNull<const Encoding*> instead of const Encoding& to be legal. For consistency, NotNull<const Encoding*> is used for known-not-to-be-null Encoding throughout. This way, it's not necessary to think if a given argument needs to take a static like UTF_8_ENCODING in some cases.

smaug added a project: testing-approved.

This is a bit scary. I wonder if we should ask fuzzer folks to test this.

parser/html/nsHtml5StreamParser.cpp
1981

So we can't ever get here because of network package boundary?
The caller somehow ensures that I guess ?

1982

AtKilobyteBoundary is called only here. Is it guaranteed that when we have buffered exactly UNCONDITIONAL_META_SCAN_BOUNDARY, this gets called?
I guess it is, if I'm follow the code in DoDataAvailableBuffer correctly. First max 1024 is buffered and then some tail is added.

1988

I wonder if there are tests for really long xml declarations . Something where there is for example tons of whitespace before encoding. Just to ensure that doesn't cause crashes.

2249

I don't pretend to understand how mAtEOF and the true param here are related.
I guess we somehow may re-enter this method when the first call sets mAtEOF to true and then the next time we just return early.

This revision is now accepted and ready to land.Dec 2 2021, 3:49 PM
parser/html/nsHtml5StreamParser.cpp
826

This is a great question. mSpeculationMutex is supposed to protect mSpeculations only, but in practice it's systematically held more broadly.

I suggest landing this patch with the same mutex holding pattern as in the existing pieces of code and then as a follow-up, making a consistent change across all uses of mSpeculationMutex--not just the one here. Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1744194

NOTE: Documentation was modified in diff 515246

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

NOTE: Documentation was modified in diff 515256

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

Thank you for the r+!

parser/html/javasrc/Tokenizer.java
3184

Added a comment.

6198

Added a comment.

6988

Fixed typo.

6997

Added an explanation.

parser/html/nsHtml5Highlighter.h
339

Documented the ownership model.

hsivonen added inline comments.
parser/html/nsHtml5StreamParser.cpp
1937

Reworded.

1981

Added comment.

1982

The added comment covers this.

1988

There's a test for a 1025-byte bogo XML decl. This is the deletion of xmldecl-2.html.ini under WPT.

2065

Added more text to the comment.

2249

Added a comment about what ensures that ProcessLookingForMetaCharset(true) isn't called twice.

parser/html/nsHtml5StreamParser.h
361

Added comments both here and on the call site.

393

Documented.

564

Documented.

hsivonen marked 2 inline comments as done.
hsivonen added inline comments.
dom/locales/en-US/chrome/layout/htmlparser.properties
9

Renamed.

This revision is now accepted and ready to land.Dec 7 2021, 10:12 AM
NOTE: Documentation was modified in diff 516833

It can be previewed here for one week.


If you see a problem in this automated review, please report it here.

hsivonen added a commit: Restricted Diffusion Commit.Feb 18 2022, 7:42 PM
hsivonen added a commit: Restricted Diffusion Commit.Feb 18 2022, 7:47 PM