Skip to content

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909

Closed
H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
H5-O5:master
Closed

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909
H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
H5-O5:master

Conversation

@H5-O5
Copy link
Copy Markdown
Contributor

@H5-O5 H5-O5 commented Mar 5, 2026

Description

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Motivation

Fixes #28885

Build API Changes

  • Does this PR affect the Build API? (e.g. Starlark API, providers, command-line flags, native rules)

  • Is the change backward compatible?

    • No.
  • If it's a breaking change, what is the migration plan?

    • This specific case could be an exception where no migration plan is needed. Since:
        1. previous behavior on a UTF-8 file is clearly a bug.
        1. it should be pretty rare for this to affect other encodings as well.

Checklist

  • I have added tests for the new use cases (if any).
  • I have updated the documentation (if applicable).

Release Notes

RELNOTES[INC]: string.splitlines() no longer treat u+0085 as a newline character when internal_starlark_utf_8_byte_strings is set (which defaults to true)

@H5-O5 H5-O5 requested review from brandjon and tetromino as code owners March 5, 2026 23:47
@github-actions github-actions Bot added the awaiting-review PR is awaiting review from an assigned reviewer label Mar 5, 2026
# "包" is U+5305. UTF-8 is E5 8C 85.
# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
# Java's Pattern.compile(".*") stops at \u0085.
assert_eq("abc包def".splitlines(), ["abc包def"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a guarded test that verifies the proper splitting behavior with UTF-16 strings (see

for the pattern)? In that mode NEL should probably still be split at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

if _utf8_byte_strings, then we got [\u0061 (a), (b) (c) \u00E5, \u008C, \u0085, (d), (e), (f)], and we should not split on \u0085 because it is part of a UTF-8 code
if not _utf8_byte_strings, then we have [(a) (b) (c) (包) (d) (e) (f)] and still we should not split on any codepoint.

Currently both bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1 and bazel test //src/test/java/net/starlark/java/eval:ScriptTest pass. I don't think we should add _utf8_byte_strings here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that splitlines is explicitly specced to split on \n, \r, \r\n only. I thought that we would need to continue to split on an actual \u0085 character if not _utf8_byte_strings, but that's not true.

@fmeum
Copy link
Copy Markdown
Collaborator

fmeum commented Mar 6, 2026

@bazel-io fork 9.1.0

@fmeum
Copy link
Copy Markdown
Collaborator

fmeum commented Mar 6, 2026

@bazel-io fork 8.7.0

Copy link
Copy Markdown
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the core problem is that . in Java regex doesn't match NEL?

Thank you for the fix!

@tetromino tetromino added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Mar 6, 2026
@tetromino
Copy link
Copy Markdown
Contributor

I've imported the patch for internal review

@tetromino
Copy link
Copy Markdown
Contributor

This PR appears to break tests for some non-Bazel users of Starlark (e.g. Copybara specifically). Looking into it...

@iancha1992 iancha1992 added team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel and removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally labels Mar 6, 2026
@tetromino
Copy link
Copy Markdown
Contributor

It looks like some Copybara-based scripts internally at Google have come to depend on the incorrect old splitlines() behavior; please wait a bit, I'll need to update them

@copybara-service copybara-service Bot closed this in 77acc77 Mar 9, 2026
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Mar 9, 2026
…TF8 characters (bazelbuild#28909)

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Closes bazelbuild#28909.

RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a
newline character

PiperOrigin-RevId: 880902483
Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Mar 9, 2026
…TF8 characters (bazelbuild#28909)

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Closes bazelbuild#28909.

RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a
newline character

PiperOrigin-RevId: 880902483
Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5
github-merge-queue Bot pushed a commit that referenced this pull request Mar 9, 2026
…itting UTF8 characters (#28909) (#28931)

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte
array mode.

Closes #28909.

RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085
(NEL) as a
newline character

PiperOrigin-RevId: 880902483
Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5

Commit
77acc77

Co-authored-by: H5-O5 <[email protected]>
github-merge-queue Bot pushed a commit that referenced this pull request Mar 9, 2026
…itting UTF8 characters (#28909) (#28932)

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte
array mode.

Closes #28909.

RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085
(NEL) as a
newline character

PiperOrigin-RevId: 880902483
Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5

Commit
77acc77

Co-authored-by: H5-O5 <[email protected]>
eukia Bot pushed a commit to albertocavalcante/starlark-java that referenced this pull request Mar 22, 2026
…TF8 characters (bazelbuild/bazel#28909)

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Closes #28909.

RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a
newline character

PiperOrigin-RevId: 880902483
Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5
Bazel-Commit=77acc77e4a0de240b193954acdfb657f6a01afbe
Upstream-Source=bazelbuild/bazel
GitOrigin-RevId: 77acc77e4a0de240b193954acdfb657f6a01afbe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode

4 participants