-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Characters Package #53381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Characters Package #53381
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any performance concern with nextCharacter and previousCharacter? They iterate the entire string in order to avoid bugs where the extent is in the middle of an extended grapheme cluster. If we can assume that won't happen, or we are ok with these methods not finding the end/beginning of the grapheme cluster when it does happen, then we can optimize by starting our search from extent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we kept track a flag around that indicated if the string contained anything aside from UTF-16 characters, then we could opt to use simple algorithms.
For strings that contained surrogate pairs or extended grapheme clusters: if we could scan backwards from an arbitrary index and discover if we were in the middle of such a character, then finding the next character (or previous character) index would be cheap. So long as there was a small limit to how far we had to scan backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a proof-of-concept and it seemed like something like this would work and would be potentially much faster. I'll follow up after this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to improve performance using the characters package as expected. It can find characters by iterating from the start or end of a string, but at some offset into a string, it doesn't know where the character boundaries are. Performance is likely to be worse iterating from the start/end of a long string rather than using the existing logarithmic algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below
|
I'm not sure if the failures are due to something special I'm missing because I added a new package, or if it's the typical versioning failure. I'm going to squash all the commits and see if it goes away. |
d5c2db6 to
470569c
Compare
|
|
|
I'm going to split this into 2 PRs, so I'm putting this back into WIP. The other PR will simply add the characters package to the repo, and I'll update this one once it's merged. |
β¦ up-to-date master
|
#59267 was merged and I updated this PR, so we're ready for review again. |
| // checks if the value represents a UTF16 glyph by itself or is a 'surrogate'. | ||
| bool _isUtf16Surrogate(int value) { | ||
| static bool _isUtf16Surrogate(int value) { | ||
| return value & 0xF800 == 0xD800; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really sufficient to detect a half of a UTF-16 surrogate pair? According to Wikipedia, the first byte of surrogate pair values fall into one of two ranges:
high surrogates (0xD800β0xDBFF), low surrogates (0xDC00β0xDFFF
Not sure if this test is sufficient? Note also: I realize that this code hasn't really changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a separate issue for this and assigned myself. I'll look into it and merge a small PR separately if it is indeed wrong. #59513
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code does detect surrogates values among UTF-16 code units.
The documentation talks about "two or more UTF-16 codeunits", which is confusing because glyphs can indeed be comprised of more than one code point, but that's unrelated to surrogates.
I'd say that the documentation is almost entirely wrong, but the function does what the name says it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I'll update the documentation on #59513.
| /// characters. | ||
| @visibleForTesting | ||
| static int nextCharacter(int extent, String string, [bool includeWhitespace = true]) { | ||
| if (extent >= string.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably string.length is assumed to be beyond the end of the string, even though string.characters.length might have a smaller value than string.length: what are callers supposed to do with the return value here? In other words, what return-value is expected if the next character past extent is beyond the end of the string?
We should have tests for nextCharacter(s.length, s), nextCharacter(s.length -1, s), previousCharacter(0, s), previousCharacter(s.length, s). Or maybe invalid extents should cause an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this more straightforward in b3cdde2.
- I assert that the input must be between 0 and string.length, inclusive.
- Asking for the previous character from index 0 returns 0. Asking for the next character after string.length returns string.length.
- I clarified that the input and output of this method are all positions in the string (not character indices).
- I tested all missing edge cases, including cases that assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we kept track a flag around that indicated if the string contained anything aside from UTF-16 characters, then we could opt to use simple algorithms.
For strings that contained surrogate pairs or extended grapheme clusters: if we could scan backwards from an arbitrary index and discover if we were in the middle of such a character, then finding the next character (or previous character) index would be cheap. So long as there was a small limit to how far we had to scan backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below
| /// Setting includeWhitespace to false will only return the index of non-space | ||
| /// characters. | ||
| @visibleForTesting | ||
| static int previousCharacter(int extent, String string, [bool includeWhitespace = true]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method and the nextCharacter method should make meaning of their extent parameter and return value really clear, i.e. are they string indices or string.character indices. They should make how they handle edge cases clear as well.
HansMuller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| if (includeWhitespace) { | ||
| return false; | ||
| } | ||
| return _isWhitespace(currentString.characters.first.toString().codeUnitAt(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line is equivalent to:
return _isWhitespace(currentString.codeUnitAt(0));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be fixed when I reland this PR in #59778.
This reverts commit a99d146.
This reverts commit e0ed12c.
This reverts commit e0ed12c.
Description
Flutter currently has poor support for grapheme clusters like π¨βπ©βπ¦, which it considers to be multiple characters even though it appears to the user as one. This PR is a proposal for adding support via the characters package.
The design doc has more details.
Changes
Related Issues
Engine PR (independent):
flutter/engine#17420
Closes #32240
Closes #55670
Closes #54240
Tests
Breaking Change
Currently I don't think this is a breaking change. There may be broken visual diff tests in the rare case where they include a grapheme cluster.