Description of the bug:
When using the built-in Starlark function string.splitlines(), strings containing certain non-ASCII UTF-8 characters are incorrectly split into multiple lines, even when no actual line breaks ( \n , \r ) are present. This happens in Gazelle's Bzlmod extensions where Starlark files are processed using Bazel's internal "Latin1 byte-string" representation.
# UTF-8 characters with \u0085 in it
# "包" is U+5305. UTF-8 is E5 8C 85.
# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
# Java's Pattern.compile(".*") stops at \u0085.
assert_eq("abc包def".splitlines(), ["abc包def"])
# Error: assert_eq: ["abc�"] != ["abc包def"]
The bug is located in StringModule.java . It uses a regex pattern (?<line>.*)(?<break>(\r\n|\r|\n)?) to identify lines.
- Java's regex engine considers \u0085 to be a valid line terminator. Therefore, the .* quantifier (which matches everything except line terminators) stops matching when it hits 0x85 .
- This causes splitlines() to see a line break where none exists in the original UTF-8 text.
From context around
|
// This is used instead of LATIN1_WHITESPACE when strings are represented as raw UTF-8 byte |
I think that the current behavior is a bug.
Which category does this issue belong to?
Starlark Interpreter
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Add
assert_eq("abc包def".splitlines(), ["abc包def"])
to src/test/java/net/starlark/java/eval/testdata/string_splitlines.star then run
bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1
Which operating system are you running Bazel on?
No response
What is the output of bazel info release?
No response
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
Description of the bug:
When using the built-in Starlark function
string.splitlines(), strings containing certain non-ASCII UTF-8 characters are incorrectly split into multiple lines, even when no actual line breaks ( \n , \r ) are present. This happens in Gazelle's Bzlmod extensions where Starlark files are processed using Bazel's internal "Latin1 byte-string" representation.The bug is located in StringModule.java . It uses a regex pattern
(?<line>.*)(?<break>(\r\n|\r|\n)?)to identify lines.From context around
bazel/src/main/java/net/starlark/java/eval/StringModule.java
Line 210 in 1c4cb2c
Which category does this issue belong to?
Starlark Interpreter
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Add
assert_eq("abc包def".splitlines(), ["abc包def"])to
src/test/java/net/starlark/java/eval/testdata/string_splitlines.starthen runbazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1Which operating system are you running Bazel on?
No response
What is the output of
bazel info release?No response
If
bazel info releasereturnsdevelopment versionor(@non-git), tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD?If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response