Skip to content

Conversation

@justinmc
Copy link
Contributor

@justinmc justinmc commented Mar 27, 2020

Description

Flutter currently has poor support for grapheme clusters like πŸ‘¨β€πŸ‘©β€πŸ‘¦, which it considers to be multiple characters even though it appears to the user as one. This PR is a proposal for adding support via the characters package.

Screen Shot 2020-03-26 at 2 26 01 PM

The design doc has more details.

Changes

  • Character count.
  • Error state based on character count.
  • Pasting excessive characters places the cursor correctly.

Related Issues

Engine PR (independent):
flutter/engine#17420

Closes #32240
Closes #55670
Closes #54240

Tests

  • Character count for maxLength works with/without grapheme clusters and surrogate pairs.
  • Enforced character limit for maxLength works with/without grapheme clusters and surrogate pairs.
  • Error state for maxLength shows at the right length with/without grapheme clusters and surrogate pairs.

Breaking Change

Currently I don't think this is a breaking change. There may be broken visual diff tests in the rare case where they include a grapheme cluster.

@justinmc justinmc added a: text input Entering text in a text field or keyboard related problems framework flutter/packages/flutter repository. See also f: labels. work in progress; do not review labels Mar 27, 2020
@justinmc justinmc self-assigned this Mar 27, 2020
@fluttergithubbot fluttergithubbot added f: material design flutter/packages/flutter/material repository. c: contributor-productivity Team-specific productivity, code health, technical debt. labels Mar 27, 2020
@justinmc
Copy link
Contributor Author

justinmc commented Apr 22, 2020

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance concern with nextCharacter and previousCharacter? They iterate the entire string in order to avoid bugs where the extent is in the middle of an extended grapheme cluster. If we can assume that won't happen, or we are ok with these methods not finding the end/beginning of the grapheme cluster when it does happen, then we can optimize by starting our search from extent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we kept track a flag around that indicated if the string contained anything aside from UTF-16 characters, then we could opt to use simple algorithms.

For strings that contained surrogate pairs or extended grapheme clusters: if we could scan backwards from an arbitrary index and discover if we were in the middle of such a character, then finding the next character (or previous character) index would be cheap. So long as there was a small limit to how far we had to scan backwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a proof-of-concept and it seemed like something like this would work and would be potentially much faster. I'll follow up after this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to improve performance using the characters package as expected. It can find characters by iterating from the start or end of a string, but at some offset into a string, it doesn't know where the character boundaries are. Performance is likely to be worse iterating from the start/end of a long string rather than using the existing logarithmic algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

@justinmc justinmc changed the title WIP Characters Proposal Characters Proposal Jun 8, 2020
@justinmc justinmc requested review from GaryQian and HansMuller June 8, 2020 23:19
@justinmc
Copy link
Contributor Author

justinmc commented Jun 9, 2020

I'm not sure if the failures are due to something special I'm missing because I added a new package, or if it's the typical versioning failure. I'm going to squash all the commits and see if it goes away.

@justinmc justinmc force-pushed the characters branch 2 times, most recently from d5c2db6 to 470569c Compare June 9, 2020 19:51
@justinmc justinmc changed the title Characters Proposal Characters Package Jun 9, 2020
@fluttergithubbot fluttergithubbot added the tool Affects the "flutter" command-line tool. See also t: labels. label Jun 9, 2020
@justinmc
Copy link
Contributor Author

justinmc commented Jun 9, 2020

I could use advice on the test failure if anyone has any ideas. Stack frames parsed incorrectly https://github.com/flutter/flutter/pull/53381/checks?check_run_id=755670351

@justinmc justinmc changed the title Characters Package WIP Characters Package Jun 11, 2020
@justinmc
Copy link
Contributor Author

I'm going to split this into 2 PRs, so I'm putting this back into WIP. The other PR will simply add the characters package to the repo, and I'll update this one once it's merged.

@justinmc justinmc changed the title WIP Characters Package Characters Package Jun 15, 2020
@justinmc
Copy link
Contributor Author

#59267 was merged and I updated this PR, so we're ready for review again.

// checks if the value represents a UTF16 glyph by itself or is a 'surrogate'.
bool _isUtf16Surrogate(int value) {
static bool _isUtf16Surrogate(int value) {
return value & 0xF800 == 0xD800;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really sufficient to detect a half of a UTF-16 surrogate pair? According to Wikipedia, the first byte of surrogate pair values fall into one of two ranges:

high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF

Not sure if this test is sufficient? Note also: I realize that this code hasn't really changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a separate issue for this and assigned myself. I'll look into it and merge a small PR separately if it is indeed wrong. #59513

Copy link
Contributor

@lrhn lrhn Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does detect surrogates values among UTF-16 code units.

The documentation talks about "two or more UTF-16 codeunits", which is confusing because glyphs can indeed be comprised of more than one code point, but that's unrelated to surrogates.
I'd say that the documentation is almost entirely wrong, but the function does what the name says it does.

Copy link
Contributor Author

@justinmc justinmc Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I'll update the documentation on #59513.

/// characters.
@visibleForTesting
static int nextCharacter(int extent, String string, [bool includeWhitespace = true]) {
if (extent >= string.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably string.length is assumed to be beyond the end of the string, even though string.characters.length might have a smaller value than string.length: what are callers supposed to do with the return value here? In other words, what return-value is expected if the next character past extent is beyond the end of the string?

We should have tests for nextCharacter(s.length, s), nextCharacter(s.length -1, s), previousCharacter(0, s), previousCharacter(s.length, s). Or maybe invalid extents should cause an assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this more straightforward in b3cdde2.

  1. I assert that the input must be between 0 and string.length, inclusive.
  2. Asking for the previous character from index 0 returns 0. Asking for the next character after string.length returns string.length.
  3. I clarified that the input and output of this method are all positions in the string (not character indices).
  4. I tested all missing edge cases, including cases that assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we kept track a flag around that indicated if the string contained anything aside from UTF-16 characters, then we could opt to use simple algorithms.

For strings that contained surrogate pairs or extended grapheme clusters: if we could scan backwards from an arbitrary index and discover if we were in the middle of such a character, then finding the next character (or previous character) index would be cheap. So long as there was a small limit to how far we had to scan backwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

/// Setting includeWhitespace to false will only return the index of non-space
/// characters.
@visibleForTesting
static int previousCharacter(int extent, String string, [bool includeWhitespace = true]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and the nextCharacter method should make meaning of their extent parameter and return value really clear, i.e. are they string indices or string.character indices. They should make how they handle edge cases clear as well.

Copy link
Contributor

@HansMuller HansMuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fluttergithubbot fluttergithubbot merged commit e0ed12c into flutter:master Jun 16, 2020
@justinmc justinmc deleted the characters branch June 16, 2020 23:56
renyou added a commit that referenced this pull request Jun 17, 2020
if (includeWhitespace) {
return false;
}
return _isWhitespace(currentString.characters.first.toString().codeUnitAt(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line is equivalent to:

   return _isWhitespace(currentString.codeUnitAt(0));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, good call!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be fixed when I reland this PR in #59778.

renyou added a commit that referenced this pull request Jun 17, 2020
@justinmc justinmc mentioned this pull request Jun 17, 2020
justinmc added a commit to justinmc/flutter that referenced this pull request Jun 18, 2020
zljj0818 pushed a commit to zljj0818/flutter that referenced this pull request Jun 22, 2020
zljj0818 pushed a commit to zljj0818/flutter that referenced this pull request Jun 22, 2020
mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020
mingwandroid pushed a commit to mingwandroid/flutter that referenced this pull request Sep 6, 2020
@justinmc justinmc mentioned this pull request Oct 5, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

a: text input Entering text in a text field or keyboard related problems c: contributor-productivity Team-specific productivity, code health, technical debt. f: material design flutter/packages/flutter/material repository. framework flutter/packages/flutter repository. See also f: labels. tool Affects the "flutter" command-line tool. See also t: labels.

Projects

None yet

5 participants