Fix and improve Unicode escape sequence info (C#)#13162
Merged
BillWagner merged 1 commit intodotnet:masterfrom Jul 1, 2019
Merged
Fix and improve Unicode escape sequence info (C#)#13162BillWagner merged 1 commit intodotnet:masterfrom
BillWagner merged 1 commit intodotnet:masterfrom
Conversation
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair raises an exception, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
* Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
* Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/deoylQ)
* In creating the test noted above, I found a bug in the Mono C\# compiler, so I submitted that here:
["\U" Unicode escape sequence for strings accepts invalid value instead of raising error dotnet#15456](mono/mono#15456)
* Runnable example code showing that invalid code point (U+110000) raises an exception, on [IDE One](https://ideone.com/jpVxL4)
2. Correctly indicated that `\U` is for a 4-byte UTF-32 value, and `\u` is for a 2-byte UTF-16 value.
3. Show the pattern _and_ an example to be more readable / helpful. Please note that `\U00nnnnnn` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.
4. Properly formatted escape sequences as being inline-code
5. Added warning about using `\x` escape with less than 4 hex digits. For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.wordpress.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)
BillWagner
approved these changes
Jul 1, 2019
Member
BillWagner
left a comment
There was a problem hiding this comment.
Thank you for adding these clarifying comments @srutzky
We appreciate it.
I’ve reviewed the changes, and I’ll
now.
Thanks again!
Contributor
Author
|
@BillWagner You are welcome. I forgot to mention that this update has a companion F# update: #13168 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove erroneous note regarding
\Ubeing used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.\Uescape can also be used for BMP characters.\U0001F47E, and its surrogate pair via\UD83DDC7Edoes not, on IDE One"\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456
Correctly indicated that
\Uis for a 4-byte UTF-32 value, and\uis for a 2-byte UTF-16 value.Show the pattern and an example to be more readable / helpful. Please note that
\U00nnnnnnhas two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.Properly formatted escape sequences as being inline-code
Added warning about using
\xescape with less than 4 hex digits. For more info on this, please see:Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)