Skip to content

[Epic] Text Buffer rewrite  #8000

@DHowett

Description

@DHowett

This is the issue tracking the great buffer rewrite of 202x.

Aims

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

Other issues that might just be fixed by this

Metadata

Metadata

Assignees

No one assigned

    Labels

    Area-OutputRelated to output processing (inserting text into buffer, retrieving buffer text, etc.)Area-ServerDown in the muck of API call servicing, interprocess communication, eventing, etc.Issue-ScenarioNeeds-Tag-FixDoesn't match tag requirementsProduct-ConhostFor issues in the Console codebaseProduct-TerminalThe new Windows Terminal.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions