Skip to content

Commit ef6e079

Browse files
Let string splitters respect East_Asian_Width property (#3445)
This patch changes the preview style so that string splitters respect Unicode East Asian Width[^1] property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples. Traditionally, East Asian characters (including punctuation) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters: ``` abcdefg. 글、字。 ``` The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their *width* rather than the number of characters. Again, the following characters: ``` 글、字。 ``` These are just 4 characters, but their total width is 8. Suppose we want to maintain up to 4 columns per line with the following text: ``` abcdefg. 글、字。 ``` How should it be then? We want it to look like: ``` abcd efg. 글、 字。 ``` However, Black currently turns it into like this: ``` abcd efg. 글、字。 ``` It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named `East_Asian_Width`. Unicode grouped every single character according to their width in fixed-width typeset. This partially addresses #1197, but only for string splitters. The other parts need to be fixed as well in future patches. This was implemented by copying rich's own approach to handling wide characters: generate a table using wcwidth, check it into source control, and use in to drive helper functions in Black's logic. This gets us the best of both worlds: accuracy and performance (and let's us update as per our stability policy too!). Co-authored-by: Jelle Zijlstra <[email protected]>
1 parent 5c064a9 commit ef6e079

File tree

7 files changed

+678
-21
lines changed

7 files changed

+678
-21
lines changed

CHANGES.md

+5
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,11 @@
2323
compared to their non-async version. (#3609)
2424
- `with` statements that contain two context managers will be consistently wrapped in
2525
parentheses (#3589)
26+
- Let string splitters respect [East Asian Width](https://www.unicode.org/reports/tr11/)
27+
(#3445)
28+
- Now long string literals can be split after East Asian commas and periods (`` U+3001
29+
IDEOGRAPHIC COMMA, `` U+3002 IDEOGRAPHIC FULL STOP, & `` U+FF0C FULLWIDTH COMMA)
30+
besides before spaces (#3445)
2631
- For stubs, enforce one blank line after a nested class with a body other than just
2732
`...` (#3564)
2833

scripts/make_width_table.py

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""Generates a width table for Unicode characters.
2+
3+
This script generates a width table for Unicode characters that are not
4+
narrow (width 1). The table is written to src/black/_width_table.py (note
5+
that although this file is generated, it is checked into Git) and is used
6+
by the char_width() function in src/black/strings.py.
7+
8+
You should run this script when you upgrade wcwidth, which is expected to
9+
happen when a new Unicode version is released. The generated table contains
10+
the version of wcwidth and Unicode that it was generated for.
11+
12+
In order to run this script, you need to install the latest version of wcwidth.
13+
You can do this by running:
14+
15+
pip install -U wcwidth
16+
17+
"""
18+
import sys
19+
from os.path import basename, dirname, join
20+
from typing import Iterable, Tuple
21+
22+
import wcwidth
23+
24+
25+
def make_width_table() -> Iterable[Tuple[int, int, int]]:
26+
start_codepoint = -1
27+
end_codepoint = -1
28+
range_width = -2
29+
for codepoint in range(0, sys.maxunicode + 1):
30+
width = wcwidth.wcwidth(chr(codepoint))
31+
if width <= 1:
32+
# Ignore narrow characters along with zero-width characters so that
33+
# they are treated as single-width. Note that treating zero-width
34+
# characters as single-width is consistent with the heuristics built
35+
# on top of str.isascii() in the str_width() function in strings.py.
36+
continue
37+
if start_codepoint < 0:
38+
start_codepoint = codepoint
39+
range_width = width
40+
elif width != range_width or codepoint != end_codepoint + 1:
41+
yield (start_codepoint, end_codepoint, range_width)
42+
start_codepoint = codepoint
43+
range_width = width
44+
end_codepoint = codepoint
45+
if start_codepoint >= 0:
46+
yield (start_codepoint, end_codepoint, range_width)
47+
48+
49+
def main() -> None:
50+
table_path = join(dirname(__file__), "..", "src", "black", "_width_table.py")
51+
with open(table_path, "w") as f:
52+
f.write(
53+
f"""# Generated by {basename(__file__)}
54+
# wcwidth {wcwidth.__version__}
55+
# Unicode {wcwidth.list_versions()[-1]}
56+
import sys
57+
from typing import List, Tuple
58+
59+
if sys.version_info < (3, 8):
60+
from typing_extensions import Final
61+
else:
62+
from typing import Final
63+
64+
WIDTH_TABLE: Final[List[Tuple[int, int, int]]] = [
65+
"""
66+
)
67+
for triple in make_width_table():
68+
f.write(f" {triple!r},\n")
69+
f.write("]\n")
70+
71+
72+
if __name__ == "__main__":
73+
main()

0 commit comments

Comments
 (0)