Fix warnings and errors caused by unicode characters.#549
Fix warnings and errors caused by unicode characters.#549ZgblKylin wants to merge 1 commit intomicrosoft:masterfrom ZgblKylin:master
Conversation
|
Though effective here, I don't think playing with source encoding is a good idea — UTF-8 with BOM is not recommended according to the Unicode standard. I would rather not like to see this branch getting merged into master (in fact we should probably do the opposite — scan and prune all source files of BOM — as Git does such a bad job at dealing with BOM when checking in files :)), even though it has been very helpful for us all. (I have an open PR #458 that should address the same issue, though mine has its own unresolved issues as well.) |
Yes, you gave a better solution. I'm not familiar with VS project files, so just provide some workaround. In fact, change the file encoding into |
|
FYI, here is the upstream issue. However, technically, treating a file without special magic (here, the BOM) with guesses can be considered as a feature rather than a bug for a conforming implementation of C++, albeit codepages are almost always annoying. This is actually the choice of VC++ at current. (See the remarks section here.) So, it is still reasonable to make changes in this repo. The canonical way to work around is to specify the source character set. This is supported by passing This method could be sufficient for this repo. But in general, it has some inconvenience because it adds some extra dependencies on project configurations (e.g. @fghzxm It is true that Unicode standard recommends no BOM should used for UTF-8. However, the SO answer is wrong on the topic.
True.
False. The referenced Unicode standard actually does not talk about UTF-8 files on this topic. There is nothing else than a sequence of code units assumed as UTF-8 data when the BOM is excluded from a file. There is also no integrity guarantees at the level of the file system to ensure such a sequence is indeed well-formed UTF-8. So, if you do only get a file without a signature like BOM, you don't have knowledge that it must work with UTF-8 before the data is actually checked (again). Once the check is done, there is something to indicate the encoding scheme is known. When saved externally after serialization, the BOM used as a signature is a simple way for Unicode encoding schemes, and I don't find a better alternative, because other forms of metadata is even less portable. (If this still sounds obscure to you, think twice why we need type safety and how strong typing help. BOM here works exactly as the metadata used in the external representation or the implementation of some type systems.) |
| switch(lang) | ||
| { | ||
| case TEST_LANG_CYRILLIC: | ||
| str = "Лорем ипсум долор сит амет, пер цлита поссит ех, ат мунере фабулас петентиум сит."; |
There was a problem hiding this comment.
Another alternative is to manually encode the hard coded strings as UTF-8 and use them in the source. This is how I would handle this in a test that doesn't have a resource file. Look at TEST_LANG_GOOD_POUND case below as an example. In my day job we follow the Google Coding Standards and here is what they recommend in this scenario. (They require all of their source code be encoded as UTF-8 - which is common in cross-platform code)
|
@FrankHB You are right in that marking UTF-8 source files with BOM makes MSVC happier and won't infuriate other compilers after all. (In reality however, most UTF-8 text files in the world are not marked with BOM, so if we accept the idea that a UTF-8 file should self-describe this fact, most existing UTF-8 files are already non-portable.) I do not like UTF-8 BOM because
|
|
@fghzxm Mouri. |
|
In reality, files solely containing UTF-8 data streams are often (inappropriately) treated as "UTF-8 files". I think it inappropriate because I don't find any industrial standards guarantee such assumption must work in general, and there do exist legitimate use of UTF-8 BOM in files (despite that the name of "BOM" itself is somewhat misleading). Note that the treatment of "text = data" is only a traditional Unixy choice. (Although POSIX does specify the text mode as same as the binary mode, it is merely the convention involved with some I/O API. And whether insisting on this convention or not, one will still meet CR+LF compatibility problems with various protocols sooner or later.) I don't think it can work seamlessly whenever any additional metadata (like encoding) is needed (because there is no room inside the data stream), even though I agree UTF-8 is the de facto lingua franca of text files in the UNIX world. If there are any really portable notions of UTF-8 files, they must be in some more restrictive forms, e.g. as the container to be a UTF-8 stream. As most other file formats in general, there are no guarantees the "concatenate" operation is closed in the set of UTF-8 files (namely, you cannot expect IMO, making wrong assumptions does not ease anyone in essential. I agree UTF-8 BOM in the beginning of the files used as C++ source make things complicated and uncomfortable, but merely ignoring the BOM is not the fix. It is only a workaround until the Unicode Consortium drops the specification of BOM as the prefix signature completely. Alternatively, you will have additional conventions, which also makes things complicated. As of the editors, it is the issue of tooling. It does not reveal the defects of the idea to separate UTF-8 files from plain data streams explicitly. That said, specific to repos like this, adding things to |
|
@MouriNaruto They depend. Note that Also Well BTW, I think the latter is a defect of |
|
I took care of the UTF-8 problem in #1929 so I'll close this. |
Encoding of modified files is switched from
UTF-8toUTF-8 BOM.