Skip to content

Fix warnings and errors caused by unicode characters.#549

Closed
ZgblKylin wants to merge 1 commit intomicrosoft:masterfrom
ZgblKylin:master
Closed

Fix warnings and errors caused by unicode characters.#549
ZgblKylin wants to merge 1 commit intomicrosoft:masterfrom
ZgblKylin:master

Conversation

@ZgblKylin
Copy link

@ZgblKylin ZgblKylin commented May 8, 2019

Encoding of modified files is switched from UTF-8 to UTF-8 BOM.

@msftclas
Copy link

msftclas commented May 8, 2019

CLA assistant check
All CLA requirements met.

@fghzxm
Copy link
Contributor

fghzxm commented May 8, 2019

Though effective here, I don't think playing with source encoding is a good idea — UTF-8 with BOM is not recommended according to the Unicode standard. I would rather not like to see this branch getting merged into master (in fact we should probably do the opposite — scan and prune all source files of BOM — as Git does such a bad job at dealing with BOM when checking in files :)), even though it has been very helpful for us all. (I have an open PR #458 that should address the same issue, though mine has its own unresolved issues as well.)

@ZgblKylin
Copy link
Author

Though effective here, I don't think playing with source encoding is a good idea — UTF-8 with BOM is not recommended according to the Unicode standard. I would rather not like to see this branch getting merged into master (in fact we should probably do the opposite — scan and prune all source files of BOM — as Git does such a bad job at dealing with BOM when checking in files :)), even though it has been very helpful for us all. (I have an open PR #458 that should address the same issue, though mine has its own unresolved issues as well.)

Yes, you gave a better solution. I'm not familiar with VS project files, so just provide some workaround.

In fact, change the file encoding into UTF-16 LE also works, and it satisfied the standard, but it's not friendly to some editors and static-analyzer(clang).

@FrankHB
Copy link

FrankHB commented May 10, 2019

FYI, here is the upstream issue.

However, technically, treating a file without special magic (here, the BOM) with guesses can be considered as a feature rather than a bug for a conforming implementation of C++, albeit codepages are almost always annoying. This is actually the choice of VC++ at current. (See the remarks section here.) So, it is still reasonable to make changes in this repo.

The canonical way to work around is to specify the source character set. This is supported by passing /source-charset:utf-8 option to the comiler driver. Or futher, use /utf-8 to set the execution character set as well.

This method could be sufficient for this repo. But in general, it has some inconvenience because it adds some extra dependencies on project configurations (e.g. .vcxproj files here). It occasionally works in reality because most other implementations just treat plain text (without BOM) as "text encoded in UTF-8", so the configurations are not relied on. In other word, "plain text" with unspecified encoding for C++ source is essentially buggy and not portable by default (when there is no idea about which implementations are supported). I'm afraid there is no other way to make the source files themselves (rather than source files plus project configurations) portable unless standardized magics like BOM to the files are enforced.

@fghzxm It is true that Unicode standard recommends no BOM should used for UTF-8. However, the SO answer is wrong on the topic.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

True.

According to the Unicode standard, BOM for UTF-8 files is not recommended

False. The referenced Unicode standard actually does not talk about UTF-8 files on this topic.

There is nothing else than a sequence of code units assumed as UTF-8 data when the BOM is excluded from a file. There is also no integrity guarantees at the level of the file system to ensure such a sequence is indeed well-formed UTF-8.

So, if you do only get a file without a signature like BOM, you don't have knowledge that it must work with UTF-8 before the data is actually checked (again).

Once the check is done, there is something to indicate the encoding scheme is known. When saved externally after serialization, the BOM used as a signature is a simple way for Unicode encoding schemes, and I don't find a better alternative, because other forms of metadata is even less portable.

(If this still sounds obscure to you, think twice why we need type safety and how strong typing help. BOM here works exactly as the metadata used in the external representation or the implementation of some type systems.)

switch(lang)
{
case TEST_LANG_CYRILLIC:
str = "Лорем ипсум долор сит амет, пер цлита поссит ех, ат мунере фабулас петентиум сит.";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to manually encode the hard coded strings as UTF-8 and use them in the source. This is how I would handle this in a test that doesn't have a resource file. Look at TEST_LANG_GOOD_POUND case below as an example. In my day job we follow the Google Coding Standards and here is what they recommend in this scenario. (They require all of their source code be encoded as UTF-8 - which is common in cross-platform code)

@fghzxm
Copy link
Contributor

fghzxm commented May 11, 2019

@FrankHB You are right in that marking UTF-8 source files with BOM makes MSVC happier and won't infuriate other compilers after all. (In reality however, most UTF-8 text files in the world are not marked with BOM, so if we accept the idea that a UTF-8 file should self-describe this fact, most existing UTF-8 files are already non-portable.)

I do not like UTF-8 BOM because

  1. It is not straightforward to control BOM behaviors when saving files with most text editors (I'm saying most mainstream editors other than Code), and
  2. Git does not strip BOMs when checking in files, resulting in spurious file changes; moreover,
  3. UTF-8 without BOM is the de facto lingua franca in the world of encodings, nearly universally supported, and it's easier to tame MSVC with a single /source-charset:utf-8 option in a .props file than to mess up the Git database.

@MouriNaruto
Copy link

MouriNaruto commented May 11, 2019

@fghzxm
If we are developing an app which supports multi platforms, your advice is good.
But I think we should use UTF-8 with BOM is better because this is a Windows-specific project and use some msvc-specific features.
It can make plenty of apps in Windows happier, especially the old ones.

Mouri.

@FrankHB
Copy link

FrankHB commented May 15, 2019

@fghzxm

In reality, files solely containing UTF-8 data streams are often (inappropriately) treated as "UTF-8 files". I think it inappropriate because I don't find any industrial standards guarantee such assumption must work in general, and there do exist legitimate use of UTF-8 BOM in files (despite that the name of "BOM" itself is somewhat misleading).

Note that the treatment of "text = data" is only a traditional Unixy choice. (Although POSIX does specify the text mode as same as the binary mode, it is merely the convention involved with some I/O API. And whether insisting on this convention or not, one will still meet CR+LF compatibility problems with various protocols sooner or later.) I don't think it can work seamlessly whenever any additional metadata (like encoding) is needed (because there is no room inside the data stream), even though I agree UTF-8 is the de facto lingua franca of text files in the UNIX world.

If there are any really portable notions of UTF-8 files, they must be in some more restrictive forms, e.g. as the container to be a UTF-8 stream. As most other file formats in general, there are no guarantees the "concatenate" operation is closed in the set of UTF-8 files (namely, you cannot expect cat two UTF-8 files will result in one UTF-8 file; OTOH, UTF-8 data can be concatenated to form new UTF-8 streams arbitrarily). UTF-8 + BOM as the prefix is probably the simplest well-known and straightforward way to implement this idea, which is also compatible with the intentional use needing no changes to implementations in the current POSIX systems.

IMO, making wrong assumptions does not ease anyone in essential. I agree UTF-8 BOM in the beginning of the files used as C++ source make things complicated and uncomfortable, but merely ignoring the BOM is not the fix. It is only a workaround until the Unicode Consortium drops the specification of BOM as the prefix signature completely. Alternatively, you will have additional conventions, which also makes things complicated.

As of the editors, it is the issue of tooling. It does not reveal the defects of the idea to separate UTF-8 files from plain data streams explicitly.

That said, specific to repos like this, adding things to .props (as the additional convention) still makes life easier in some sorts. I'm not strongly against to the change to avoid BOM for this reason.

@FrankHB
Copy link

FrankHB commented May 15, 2019

@MouriNaruto They depend.

Note that .props is also "some msvc-specific features" as Windows codepages. There is trade-off. (But anyway, UTF-8 + BOM is actually not Microsoft-specific. OTOH, missing handling of it is more or less UNIX-or-some-other-non-MS-specific.)

Also cmd does not accept batch files with UTF-8 BOM as prefixes.

Well BTW, I think the latter is a defect of cmd, not of the idea to use BOM in the scripts. UNIX shebang is more heavyweight (so more flexible) than BOM, except it is printable text. (Or perhaps binfmt magics are more obvious to illustrate the intention of BOM in this sense?)

@miniksa miniksa mentioned this pull request Jul 11, 2019
5 tasks
@miniksa
Copy link
Member

miniksa commented Jul 11, 2019

I took care of the UTF-8 problem in #1929 so I'll close this.

@miniksa miniksa closed this Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants