Skip to content

Add lint to detect wrongly encoded diacritics due to UTF-8 mistaken for Latin-1#931

Merged
christopher-henderson merged 37 commits intozmap:masterfrom
defacto64:utf8_latin1_mixup
Mar 23, 2025
Merged

Add lint to detect wrongly encoded diacritics due to UTF-8 mistaken for Latin-1#931
christopher-henderson merged 37 commits intozmap:masterfrom
defacto64:utf8_latin1_mixup

Conversation

@defacto64
Copy link
Copy Markdown
Contributor

This lint addresses the error where a UTF8String field in a certificate (e.g. Subject:organizationName) contains, instead of a certain diacritic, the Latin-1 (or Windows-1252) equivalent of the individual bytes that make up the UTF-8 encoding of that diacritic. For example, instead of the single letter "ü" whose UTF-8 encoding is 0xC3 0xBC, we find the two characters "Ã" and "¼" (which correspond to 0xC3 and 0xBC in the Latin-1 character set, respectively).

This mix-up is likely caused by processing UTF-8 strings with obsolete and/or buggy software that mistakenly assumes them to be Latin-1 or Windows-1252 strings (or, at any rate, that they are made up of 1-byte characters).

Numerous "ever-trusted" certificates affected by this error can be found on Censys, most quite old but some issued as late as December 2023.

Even though it's a problem that can hardly go unnoticed, I think it is useful to introduce a lint for it.

This lint is based on the fact that the two-character sequences resulting from the wrong encoding are highly unlikely in the real names of organizations, localities, persons, etc., so their occurrence in some field of the Subject is an almost certain signal that the mix-up has occurred.

defacto64 and others added 30 commits March 8, 2024 16:07
Added //nolint:all to comment block to avoid golangci-lint to complain about duplicate words in comment
Fine to me.

Co-authored-by: Christopher Henderson <[email protected]>
As per Chris Henderson's suggestion, to "improve readability".
As per Chris Henderson's suggestion.
Added CABFEV_Sec9_2_8_Date
Copy link
Copy Markdown
Member

@christopher-henderson christopher-henderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naivety regarding encoding is perhaps right up there with off-by-one errors. The ASCII notion that 1 byte == 1 character is both elegant and a death trap.

@christopher-henderson christopher-henderson merged commit 7a0479c into zmap:master Mar 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants