Summary
Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the TextEncoder and TextDecoder classes in JavaScript.
Related issues: #7046, #22330, #28832, #31370, #31954
What is changing:
- When decoding UTF-8 data with the
Utf8Codec or Utf8Decoder class, the input is considered malformed if it contains an encoded surrogate character (code point in the range U+D800-U+DFFF, encoded in UTF-8 as a 3-byte character encoding where the first byte is 0xED and the second byte is in the range 0xA0-0xBF).
- When encoding a string as UTF-8 with the
Utf8Codec or Utf8Encoder class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD, encoded in UTF-8 as 0xEF, 0xBF, 0xBD) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.
- When decoding malformed UTF-8 data with
allowMalformed set to true, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either
- a prefix of a valid encoding of a character, or
- a single byte that is not a prefix of a valid encoding of a character.
- When decoding malformed UTF-8 data with
allowMalformed set to false, the offset in the resulting FormatException will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, the message of the FormatException will sometimes be different from what it is currently. If the input contains more than one error, the FormatException may point to a different error than before.
Why is this changing?
Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.
This strategy has a number of drawbacks:
- The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
- When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
- The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS
TextEncoder and TextDecoder classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.
- Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.
The purpose of the change is thus to:
- ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
- enable faster UTF-8 encoding and decoding for both JS and VM targets.
Expected impact
Programs manipulating strings through usual string operations are unlikely to be affected.
A program may be affected by this change if it does any of the following:
- Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
- Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
- Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
- Relies on the number of replacement characters produced by decoding invalid UTF-8 data.
Mitigation
For the scenarios listed above:
- UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
- For encoding in multiple chunks, use the chunked conversion API.
- Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
- Same as 3.
Variations
An optional allowSurrogates parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to false. This could introduce further breakage for programs implementing the Utf8Codec interface (unless we only put the flag on the constructors).
If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when allowMalformed is set to true.
Summary
Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the
TextEncoderandTextDecoderclasses in JavaScript.Related issues: #7046, #22330, #28832, #31370, #31954
What is changing:
Utf8CodecorUtf8Decoderclass, the input is considered malformed if it contains an encoded surrogate character (code point in the rangeU+D800-U+DFFF, encoded in UTF-8 as a 3-byte character encoding where the first byte is0xEDand the second byte is in the range0xA0-0xBF).Utf8CodecorUtf8Encoderclass, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD, encoded in UTF-8 as0xEF,0xBF,0xBD) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.allowMalformedset totrue, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is eitherallowMalformedset tofalse, theoffsetin the resultingFormatExceptionwill point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, themessageof theFormatExceptionwill sometimes be different from what it is currently. If the input contains more than one error, theFormatExceptionmay point to a different error than before.Why is this changing?
Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.
This strategy has a number of drawbacks:
TextEncoderandTextDecoderclasses. It must do some or all of the conversion in Dart code, which has a significant performance cost.The purpose of the change is thus to:
Expected impact
Programs manipulating strings through usual string operations are unlikely to be affected.
A program may be affected by this change if it does any of the following:
Mitigation
For the scenarios listed above:
Variations
An optional
allowSurrogatesparameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default tofalse. This could introduce further breakage for programs implementing theUtf8Codecinterface (unless we only put the flag on the constructors).If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when
allowMalformedis set totrue.