Fix warnings and errors caused by unicode characters. by ZgblKylin · Pull Request #549 · microsoft/terminal

ZgblKylin · 2019-05-08T02:00:36Z

Encoding of modified files is switched from UTF-8 to UTF-8 BOM.

msftclas · 2019-05-08T02:00:49Z

All CLA requirements met.

fghzxm · 2019-05-08T06:14:12Z

Though effective here, I don't think playing with source encoding is a good idea — UTF-8 with BOM is not recommended according to the Unicode standard. I would rather not like to see this branch getting merged into master (in fact we should probably do the opposite — scan and prune all source files of BOM — as Git does such a bad job at dealing with BOM when checking in files :)), even though it has been very helpful for us all. (I have an open PR #458 that should address the same issue, though mine has its own unresolved issues as well.)

ZgblKylin · 2019-05-08T06:36:27Z

Though effective here, I don't think playing with source encoding is a good idea — UTF-8 with BOM is not recommended according to the Unicode standard. I would rather not like to see this branch getting merged into master (in fact we should probably do the opposite — scan and prune all source files of BOM — as Git does such a bad job at dealing with BOM when checking in files :)), even though it has been very helpful for us all. (I have an open PR #458 that should address the same issue, though mine has its own unresolved issues as well.)

Yes, you gave a better solution. I'm not familiar with VS project files, so just provide some workaround.

In fact, change the file encoding into UTF-16 LE also works, and it satisfied the standard, but it's not friendly to some editors and static-analyzer(clang).

FrankHB · 2019-05-10T18:14:38Z

FYI, here is the upstream issue.

However, technically, treating a file without special magic (here, the BOM) with guesses can be considered as a feature rather than a bug for a conforming implementation of C++, albeit codepages are almost always annoying. This is actually the choice of VC++ at current. (See the remarks section here.) So, it is still reasonable to make changes in this repo.

The canonical way to work around is to specify the source character set. This is supported by passing /source-charset:utf-8 option to the comiler driver. Or futher, use /utf-8 to set the execution character set as well.

This method could be sufficient for this repo. But in general, it has some inconvenience because it adds some extra dependencies on project configurations (e.g. .vcxproj files here). It occasionally works in reality because most other implementations just treat plain text (without BOM) as "text encoded in UTF-8", so the configurations are not relied on. In other word, "plain text" with unspecified encoding for C++ source is essentially buggy and not portable by default (when there is no idea about which implementations are supported). I'm afraid there is no other way to make the source files themselves (rather than source files plus project configurations) portable unless standardized magics like BOM to the files are enforced.

@fghzxm It is true that Unicode standard recommends no BOM should used for UTF-8. However, the SO answer is wrong on the topic.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

True.

According to the Unicode standard, BOM for UTF-8 files is not recommended

False. The referenced Unicode standard actually does not talk about UTF-8 files on this topic.

There is nothing else than a sequence of code units assumed as UTF-8 data when the BOM is excluded from a file. There is also no integrity guarantees at the level of the file system to ensure such a sequence is indeed well-formed UTF-8.

So, if you do only get a file without a signature like BOM, you don't have knowledge that it must work with UTF-8 before the data is actually checked (again).

Once the check is done, there is something to indicate the encoding scheme is known. When saved externally after serialization, the BOM used as a signature is a simple way for Unicode encoding schemes, and I don't find a better alternative, because other forms of metadata is even less portable.

(If this still sounds obscure to you, think twice why we need type safety and how strong typing help. BOM here works exactly as the metadata used in the external representation or the implementation of some type systems.)

dlong11 · 2019-05-11T04:04:58Z

src/tools/vtpipeterm/main.cpp

        switch(lang)
        {
            case TEST_LANG_CYRILLIC:
-                str = "Лорем ипсум долор сит амет, пер цлита поссит ех, ат мунере фабулас петентиум сит.";


Another alternative is to manually encode the hard coded strings as UTF-8 and use them in the source. This is how I would handle this in a test that doesn't have a resource file. Look at TEST_LANG_GOOD_POUND case below as an example. In my day job we follow the Google Coding Standards and here is what they recommend in this scenario. (They require all of their source code be encoded as UTF-8 - which is common in cross-platform code)

fghzxm · 2019-05-11T04:37:17Z

@FrankHB You are right in that marking UTF-8 source files with BOM makes MSVC happier and won't infuriate other compilers after all. (In reality however, most UTF-8 text files in the world are not marked with BOM, so if we accept the idea that a UTF-8 file should self-describe this fact, most existing UTF-8 files are already non-portable.)

I do not like UTF-8 BOM because

It is not straightforward to control BOM behaviors when saving files with most text editors (I'm saying most mainstream editors other than Code), and
Git does not strip BOMs when checking in files, resulting in spurious file changes; moreover,
UTF-8 without BOM is the de facto lingua franca in the world of encodings, nearly universally supported, and it's easier to tame MSVC with a single /source-charset:utf-8 option in a .props file than to mess up the Git database.

MouriNaruto · 2019-05-11T11:04:35Z

@fghzxm
If we are developing an app which supports multi platforms, your advice is good.
But I think we should use UTF-8 with BOM is better because this is a Windows-specific project and use some msvc-specific features.
It can make plenty of apps in Windows happier, especially the old ones.

Mouri.

FrankHB · 2019-05-15T08:58:28Z

@fghzxm

In reality, files solely containing UTF-8 data streams are often (inappropriately) treated as "UTF-8 files". I think it inappropriate because I don't find any industrial standards guarantee such assumption must work in general, and there do exist legitimate use of UTF-8 BOM in files (despite that the name of "BOM" itself is somewhat misleading).

Note that the treatment of "text = data" is only a traditional Unixy choice. (Although POSIX does specify the text mode as same as the binary mode, it is merely the convention involved with some I/O API. And whether insisting on this convention or not, one will still meet CR+LF compatibility problems with various protocols sooner or later.) I don't think it can work seamlessly whenever any additional metadata (like encoding) is needed (because there is no room inside the data stream), even though I agree UTF-8 is the de facto lingua franca of text files in the UNIX world.

If there are any really portable notions of UTF-8 files, they must be in some more restrictive forms, e.g. as the container to be a UTF-8 stream. As most other file formats in general, there are no guarantees the "concatenate" operation is closed in the set of UTF-8 files (namely, you cannot expect cat two UTF-8 files will result in one UTF-8 file; OTOH, UTF-8 data can be concatenated to form new UTF-8 streams arbitrarily). UTF-8 + BOM as the prefix is probably the simplest well-known and straightforward way to implement this idea, which is also compatible with the intentional use needing no changes to implementations in the current POSIX systems.

IMO, making wrong assumptions does not ease anyone in essential. I agree UTF-8 BOM in the beginning of the files used as C++ source make things complicated and uncomfortable, but merely ignoring the BOM is not the fix. It is only a workaround until the Unicode Consortium drops the specification of BOM as the prefix signature completely. Alternatively, you will have additional conventions, which also makes things complicated.

As of the editors, it is the issue of tooling. It does not reveal the defects of the idea to separate UTF-8 files from plain data streams explicitly.

That said, specific to repos like this, adding things to .props (as the additional convention) still makes life easier in some sorts. I'm not strongly against to the change to avoid BOM for this reason.

FrankHB · 2019-05-15T09:28:43Z

@MouriNaruto They depend.

Note that .props is also "some msvc-specific features" as Windows codepages. There is trade-off. (But anyway, UTF-8 + BOM is actually not Microsoft-specific. OTOH, missing handling of it is more or less UNIX-or-some-other-non-MS-specific.)

Also cmd does not accept batch files with UTF-8 BOM as prefixes.

Well BTW, I think the latter is a defect of cmd, not of the idea to use BOM in the scripts. UNIX shebang is more heavyweight (so more flexible) than BOM, except it is printable text. (Or perhaps binfmt magics are more obvious to illustrate the intention of BOM in this sense?)

miniksa · 2019-07-11T20:13:53Z

I took care of the UTF-8 problem in #1929 so I'll close this.

Fix warnings and errors caused by CJK characters.

7c78123

ZgblKylin mentioned this pull request May 8, 2019

Guide for build and installation #489

Closed

szylover approved these changes May 8, 2019

View reviewed changes

dlong11 reviewed May 11, 2019

View reviewed changes

miniksa mentioned this pull request Jul 11, 2019

Set utf-8 for the entire project #1929

Merged

5 tasks

miniksa closed this Jul 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix warnings and errors caused by unicode characters.#549

Fix warnings and errors caused by unicode characters.#549
ZgblKylin wants to merge 1 commit intomicrosoft:masterfrom
ZgblKylin:master

ZgblKylin commented May 8, 2019 •

edited

Loading

Uh oh!

msftclas commented May 8, 2019 •

edited

Loading

Uh oh!

fghzxm commented May 8, 2019 •

edited

Loading

Uh oh!

ZgblKylin commented May 8, 2019

Uh oh!

FrankHB commented May 10, 2019 •

edited

Loading

Uh oh!

dlong11 May 11, 2019

Uh oh!

fghzxm commented May 11, 2019

Uh oh!

MouriNaruto commented May 11, 2019 •

edited

Loading

Uh oh!

FrankHB commented May 15, 2019 •

edited

Loading

Uh oh!

FrankHB commented May 15, 2019 •

edited

Loading

Uh oh!

miniksa commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

ZgblKylin commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msftclas commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fghzxm commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZgblKylin commented May 8, 2019

Uh oh!

FrankHB commented May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlong11 May 11, 2019

Choose a reason for hiding this comment

Uh oh!

fghzxm commented May 11, 2019

Uh oh!

MouriNaruto commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankHB commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankHB commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miniksa commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ZgblKylin commented May 8, 2019 •

edited

Loading

msftclas commented May 8, 2019 •

edited

Loading

fghzxm commented May 8, 2019 •

edited

Loading

FrankHB commented May 10, 2019 •

edited

Loading

MouriNaruto commented May 11, 2019 •

edited

Loading

FrankHB commented May 15, 2019 •

edited

Loading

FrankHB commented May 15, 2019 •

edited

Loading