starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters by H5-O5 · Pull Request #28909 · bazelbuild/bazel

H5-O5 · 2026-03-05T23:47:16Z

Description

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Motivation

Fixes #28885

Build API Changes

Does this PR affect the Build API? (e.g. Starlark API, providers, command-line flags, native rules)
- Yes. starlark API string.splitlines() has changed. splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode #28885
Is the change backward compatible?
- No.
If it's a breaking change, what is the migration plan?
- This specific case could be an exception where no migration plan is needed. Since:
  - 1. previous behavior on a UTF-8 file is clearly a bug.
  - 1. it should be pretty rare for this to affect other encodings as well.

Checklist

I have added tests for the new use cases (if any).
I have updated the documentation (if applicable).

Release Notes

RELNOTES[INC]: string.splitlines() no longer treat u+0085 as a newline character when internal_starlark_utf_8_byte_strings is set (which defaults to true)

…TF-8 characters

fmeum · 2026-03-06T08:06:43Z

+# "包" is U+5305. UTF-8 is E5 8C 85.
+# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
+# Java's Pattern.compile(".*") stops at \u0085.
+assert_eq("abc包def".splitlines(), ["abc包def"])


Could you add a guarded test that verifies the proper splitting behavior with UTF-16 strings (see

bazel/src/test/java/net/starlark/java/eval/testdata/json.star

Line 33 in 9b2964a

if _utf8_byte_strings else

for the pattern)? In that mode NEL should probably still be split at.

I don't understand.

if _utf8_byte_strings, then we got [\u0061 (a), (b) (c) \u00E5, \u008C, \u0085, (d), (e), (f)], and we should not split on \u0085 because it is part of a UTF-8 code
if not _utf8_byte_strings, then we have [(a) (b) (c) (包) (d) (e) (f)] and still we should not split on any codepoint.

Currently both bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1 and bazel test //src/test/java/net/starlark/java/eval:ScriptTest pass. I don't think we should add _utf8_byte_strings here.

Sorry, I missed that splitlines is explicitly specced to split on \n, \r, \r\n only. I thought that we would need to continue to split on an actual \u0085 character if not _utf8_byte_strings, but that's not true.

fmeum · 2026-03-06T08:07:01Z

@bazel-io fork 9.1.0

fmeum · 2026-03-06T08:07:09Z

@bazel-io fork 8.7.0

tetromino

So the core problem is that . in Java regex doesn't match NEL?

Thank you for the fix!

tetromino · 2026-03-06T17:57:39Z

I've imported the patch for internal review

tetromino · 2026-03-06T18:36:49Z

This PR appears to break tests for some non-Bazel users of Starlark (e.g. Copybara specifically). Looking into it...

tetromino · 2026-03-06T19:43:09Z

It looks like some Copybara-based scripts internally at Google have come to depend on the incorrect old splitlines() behavior; please wait a bit, I'll need to update them

…TF8 characters (bazelbuild#28909) string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode. Closes bazelbuild#28909. RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a newline character PiperOrigin-RevId: 880902483 Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5

…itting UTF8 characters (#28909) (#28931) string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode. Closes #28909. RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a newline character PiperOrigin-RevId: 880902483 Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5 Commit 77acc77 Co-authored-by: H5-O5 <[email protected]>

…itting UTF8 characters (#28909) (#28932) string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode. Closes #28909. RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a newline character PiperOrigin-RevId: 880902483 Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5 Commit 77acc77 Co-authored-by: H5-O5 <[email protected]>

…TF8 characters (bazelbuild/bazel#28909) string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode. Closes #28909. RELNOTES[INC]: string.splitlines() no longer incorrectly treats u+0085 (NEL) as a newline character PiperOrigin-RevId: 880902483 Change-Id: Id7c26c84eb50259c576036f333b0ccdb83f681d5 Bazel-Commit=77acc77e4a0de240b193954acdfb657f6a01afbe Upstream-Source=bazelbuild/bazel GitOrigin-RevId: 77acc77e4a0de240b193954acdfb657f6a01afbe

starlark interpreter: Fix string.splitlines() incorrectly splitting U…

981404e

…TF-8 characters

H5-O5 requested review from brandjon and tetromino as code owners March 5, 2026 23:47

github-actions Bot added the awaiting-review PR is awaiting review from an assigned reviewer label Mar 5, 2026

fmeum reviewed Mar 6, 2026

View reviewed changes

This was referenced Mar 6, 2026

[9.1.0] starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters #28912

Closed

[8.7.0] starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters #28913

Closed

fmeum approved these changes Mar 6, 2026

View reviewed changes

tetromino approved these changes Mar 6, 2026

View reviewed changes

tetromino added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Mar 6, 2026

iancha1992 added team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel and removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally labels Mar 6, 2026

copybara-service Bot closed this in 77acc77 Mar 9, 2026

bazel-io mentioned this pull request Mar 9, 2026

[9.1.0] starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters (https://github.com/bazelbuild/bazel/pull/28909) #28931

Merged

bazel-io mentioned this pull request Mar 9, 2026

[8.7.0] starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters (https://github.com/bazelbuild/bazel/pull/28909) #28932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909
H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
H5-O5:master

H5-O5 commented Mar 5, 2026

Uh oh!

fmeum Mar 6, 2026

Uh oh!

H5-O5 Mar 6, 2026

Uh oh!

fmeum Mar 6, 2026

Uh oh!

fmeum commented Mar 6, 2026

Uh oh!

fmeum commented Mar 6, 2026

Uh oh!

tetromino left a comment

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

H5-O5 commented Mar 5, 2026

Description

Motivation

Build API Changes

Checklist

Release Notes

Uh oh!

fmeum Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

H5-O5 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

fmeum Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

fmeum commented Mar 6, 2026

Uh oh!

fmeum commented Mar 6, 2026

Uh oh!

tetromino left a comment

Choose a reason for hiding this comment

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

tetromino commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants