Skip to content

splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode #28885

@H5-O5

Description

@H5-O5

Description of the bug:

When using the built-in Starlark function string.splitlines(), strings containing certain non-ASCII UTF-8 characters are incorrectly split into multiple lines, even when no actual line breaks ( \n , \r ) are present. This happens in Gazelle's Bzlmod extensions where Starlark files are processed using Bazel's internal "Latin1 byte-string" representation.

# UTF-8 characters with \u0085 in it
# "包" is U+5305. UTF-8 is E5 8C 85.
# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
# Java's Pattern.compile(".*") stops at \u0085.
assert_eq("abc包def".splitlines(), ["abc包def"])
# Error: assert_eq: ["abc�"] != ["abc包def"]

The bug is located in StringModule.java . It uses a regex pattern (?<line>.*)(?<break>(\r\n|\r|\n)?) to identify lines.

  1. Java's regex engine considers \u0085 to be a valid line terminator. Therefore, the .* quantifier (which matches everything except line terminators) stops matching when it hits 0x85 .
  2. This causes splitlines() to see a line break where none exists in the original UTF-8 text.

From context around

// This is used instead of LATIN1_WHITESPACE when strings are represented as raw UTF-8 byte
I think that the current behavior is a bug.

Which category does this issue belong to?

Starlark Interpreter

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Add

assert_eq("abc包def".splitlines(), ["abc包def"])

to src/test/java/net/starlark/java/eval/testdata/string_splitlines.star then run

bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?


If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions