Skip to content

Replace printables table with unicode_data.rs tables#155527

Open
Jules-Bertholet wants to merge 4 commits intorust-lang:mainfrom
Jules-Bertholet:riir-printable
Open

Replace printables table with unicode_data.rs tables#155527
Jules-Bertholet wants to merge 4 commits intorust-lang:mainfrom
Jules-Bertholet:riir-printable

Conversation

@Jules-Bertholet
Copy link
Copy Markdown
Contributor

This gets rid of the printable.py script, ensuring that unicode-table-generator handles all our Unicode data table generation needs.

There are also some drive-by documentation improvements in library/core/char/methods.rs.

There is one change in behavior: we now consider all characters with the Default_Ignorable_Code_Point property to be unprintable. These characters can be hidden/invisible otherwise.

I've chosen to give each Unicode property its own table, instead of merging them all into one. This is slightly less efficient in terms of space, but should allow us to expose these tables in the future with public methods on char.

@rustbot label A-Unicode

@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 19, 2026

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 19, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 19, 2026

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: @scottmcm, libs
  • @scottmcm, libs expanded to 7 candidates
  • Random selection from Mark-Simulacrum, jhpratt, scottmcm

@rustbot rustbot added the A-Unicode Area: Unicode label Apr 19, 2026
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Jules-Bertholet Jules-Bertholet force-pushed the riir-printable branch 2 times, most recently from ab09b17 to a799ecf Compare April 20, 2026 00:59
@Mark-Simulacrum Mark-Simulacrum added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Apr 26, 2026
@Mark-Simulacrum
Copy link
Copy Markdown
Member

There is one change in behavior: we now consider all characters with the Default_Ignorable_Code_Point property to be unprintable. These characters can be hidden/invisible otherwise.

Nominating for libs-api to FCP this. @Jules-Bertholet can you write up how that affects the public API of std? i.e., where is that unprintability used (only in Debug impls of str)?

Copy link
Copy Markdown
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split out the re-ordering and renaming in char/methods.rs? It's very hard to review the diff for me when methods are moved around in the file. It also seems entirely unrelated to the core change here and I'd rather have separate commits at least.

The changes look broadly reasonable though, I'd be happy to accept them if separated out (including maybe from the libs-api facing change).

View changes since this review

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 26, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Apr 26, 2026

Reminder, once the PR becomes ready for a review, use @rustbot ready.

And rename a struct field.
This gets rid of the `printable.py` script,
ensuring that `unicode-table-generator` handles all our
Unicode data table generation needs.

I've elected to give each Unicode property its own table,
instead of merging them all into one.
This is slightly less efficient in terms of space,
but should allow us to expose these tables in the future
with public methods on `char`.
These characters may be hidden/invisible otherwise.
@Jules-Bertholet
Copy link
Copy Markdown
Contributor Author

Jules-Bertholet commented Apr 26, 2026

@rustbot ready

6079a98 is the libs-API-relevant change. It affects the Debug impls for char and the various string types, as well as the escape_debug() methods on char and str. The following characters are changed to be always escaped: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BDefault_Ignorable_Code_Point%7D-%5Cp%7BCf%7D-%5Cp%7BCn%7D

Note that we may also wish to stop escaping format control characters which are not default-ignorable. The list of characters this would affect: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BCf%7D-%5Cp%7BDefault_Ignorable_Code_Point%7D

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 26, 2026
@nia-e
Copy link
Copy Markdown
Member

nia-e commented Apr 28, 2026

We discussed this in today's @rust-lang/libs-api meeting; +1 for us, but we'd like someone with more unicode knowledge to weigh in to be safe so cc @Manishearth (offtopic, it would be nice to have an @rust-lang/unicode-knowers ping group since these issues arise pretty often)

@Manishearth
Copy link
Copy Markdown
Member

I didn't look too closely, but this seems fine. From a quick look the printability concern is for debug output, and yes, being more conservative there makes sense.

@Manishearth
Copy link
Copy Markdown
Member

While you shouldn't depend on ICU4X in the stdlib, it may be worth using ICU4X to get your unicode properties, instead of fetching them yourself. This does mean you are beholden to ICU4X for unicode updates though.

@Amanieu
Copy link
Copy Markdown
Member

Amanieu commented May 5, 2026

@rfcbot merge libs-api

@Amanieu Amanieu removed the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label May 5, 2026
@rust-rfcbot
Copy link
Copy Markdown
Collaborator

rust-rfcbot commented May 5, 2026

Team member @Amanieu has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rust-rfcbot rust-rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Unicode Area: Unicode disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants