Skip to content

SP5: improve international font detection #3220

@swharden

Description

@swharden

Maybe the solution is to replace MatchCharacter(String, Char) with MatchCharacter(String, Int32) because some "characters" don't fit in the 16-byte char type

EDIT: Tests below indicate this is not the case

Additional context from Discord (Thanks @prime167 and ChrisL)

WpfPlot1.Plot.Axes.Left.Label.FontName = Fonts.Detect("测试");   not work

"that string is a mix of UTF16 and UTF32 encodings": If we're talking about the Chinese characters "测 试 时 间", they all fit in 16 bits each. Not sure how much you already know about Unicode, so sorry if I'm overexplaining, but maybe it will be interesting to other people in that case. No guarantees that it is perfectly accurate, let me know if I got any details wrong.

"Character" is confusing. To users it probably means a single "character" on the screen. To a programmer it might mean uint8, uint16, C# 'char', UTF-8 / UTF-16 / UTF-32.

What you see on the screen can be called "grapheme cluster" (C# "text element") instead to be precise.

A "grapheme cluster" consists of one or more Unicode code points. Code points can be combined in various ways to create complex grapheme clusters. In theory, a grapheme cluster can require an unlimited amount of code points to represent.

A Unicode code point is a logical 32-bit value. It can be physically encoded using UTF-8 / UTF-16 / UTF-32.

UTF-8 and UTF-16 are not uint8 / uint16. Instead, they are variable length encodings of a logical 32-bit Unicode code point. They only require one uint8 or uint16 for more common code points, but can require 2-4x uint8 or 2x uint16 in some cases.

The individual uint8 / uint16 values are called code units (not code points). One or two UTF-16 code units (=uint16) are required to represent a UTF-16 code point.

A C# 'char' is always 16 bits. It represents a UTF-16 code unit. If your UTF-16 code point value requires 2 UTF-16 code units, then it cannot be placed in a C# 'char'.

Example: The string "☠️" can be represented as a single Unicode code point = 1x UTF-16. Since it is not in the Basic Multilingual Plane (BMP, the first 64k code points), it requires two UTF-16 code units.

Example: The string "👩🏽‍🚒" is represented by four Unicode code points and contains seven C# 'char' instances:

U+1F469 WOMAN
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
U+200D ZERO WIDTH JOINER
U+1F692 FIRE ENGINE

--ChrisL

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions