-
Notifications
You must be signed in to change notification settings - Fork 6k
[web] Line break algorithm using unicode properties #18795
Conversation
| } | ||
|
|
||
| /// Normalizes properties that behave the same way into one common property. | ||
| LineCharProperty _normalizeLineProperty(LineCharProperty prop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this substitution be made in the _packedLineBreakProperties so we never need to convert at run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This normalization is part of the algorithm. I want to keep codegen independent.
The other reason is that in some cases we still need to know the original property (pre-normalization). e.g. rawCurr == LineCharProperty.LF.
There are a few optimizations that I can think of. I expect them to have marginal perf improvements but nothing drastic.
Some ideas for my future reference:
- Normalize properties during codegen.
- Reduce the amount of binary searches by either caching at runtime, or codegen'ing a hash map for the most common characters, or both.
- For paragraphs that can fit in a single line, we can avoid a whole bunch of computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about inside UnicodePropertyLookup.fromPackedData?
I'm just throwing ideas out. The PR LGTM. This comment is non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnicodePropertyLookup.fromPackedData is used by both line breaker and word breaker. They have different properties and different rules for normalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your general point though. And I agree we could do some optimizations there.
| // TODO: Use the 2d table now. See https://www.unicode.org/reports/tr14/tr14-22.html#ExampleTable | ||
|
|
||
| // TODO: After using the 2d table, do: | ||
| // hasSpaces = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these todos be github issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are coming in the next PR.
3b5dcb7 to
b2b172c
Compare
b2b172c to
db7f8d5
Compare
|
It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat. Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing. |
Switch to using the unicode data instead of checking manually for new lines and whitespace.
Note: this PR uses minimal parts of the unicode data just to cover the main things (e.g. whitespace, NL, BK, LF, CR, etc). There's another PR coming that will implement the full algorithm that'll take advantage of the entire unicode data.
Didn't add any tests because I'm relying on existing tests to verify that the switch to unicode data doesn't break existing behavior.
TODO: