Add lint to detect wrongly encoded diacritics due to UTF-8 mistaken for Latin-1#931
Merged
christopher-henderson merged 37 commits intozmap:masterfrom Mar 23, 2025
Merged
Conversation
Added //nolint:all to comment block to avoid golangci-lint to complain about duplicate words in comment
Fixed import block
Fine to me. Co-authored-by: Christopher Henderson <[email protected]>
As per Chris Henderson's suggestion, to "improve readability".
As per Chris Henderson's suggestion.
Added CABFEV_Sec9_2_8_Date
christopher-henderson
approved these changes
Mar 23, 2025
Member
christopher-henderson
left a comment
There was a problem hiding this comment.
Naivety regarding encoding is perhaps right up there with off-by-one errors. The ASCII notion that 1 byte == 1 character is both elegant and a death trap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This lint addresses the error where a UTF8String field in a certificate (e.g. Subject:organizationName) contains, instead of a certain diacritic, the Latin-1 (or Windows-1252) equivalent of the individual bytes that make up the UTF-8 encoding of that diacritic. For example, instead of the single letter "ü" whose UTF-8 encoding is 0xC3 0xBC, we find the two characters "Ã" and "¼" (which correspond to 0xC3 and 0xBC in the Latin-1 character set, respectively).
This mix-up is likely caused by processing UTF-8 strings with obsolete and/or buggy software that mistakenly assumes them to be Latin-1 or Windows-1252 strings (or, at any rate, that they are made up of 1-byte characters).
Numerous "ever-trusted" certificates affected by this error can be found on Censys, most quite old but some issued as late as December 2023.
Even though it's a problem that can hardly go unnoticed, I think it is useful to introduce a lint for it.
This lint is based on the fact that the two-character sequences resulting from the wrong encoding are highly unlikely in the real names of organizations, localities, persons, etc., so their occurrence in some field of the Subject is an almost certain signal that the mix-up has occurred.