splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode

### Description of the bug:


When using the built-in Starlark function `string.splitlines()`, strings containing certain non-ASCII UTF-8 characters are incorrectly split into multiple lines, even when no actual line breaks ( \n , \r ) are present. This happens in [Gazelle's Bzlmod extensions](https://github.com/bazel-contrib/bazel-gazelle/blob/a1055c5e56fd985bdc23f387107dcd5be833c14b/internal/bzlmod/go_mod.bzl#L70) where Starlark files are processed using Bazel's internal "Latin1 byte-string" representation.


``` python
# UTF-8 characters with \u0085 in it
# "包" is U+5305. UTF-8 is E5 8C 85.
# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
# Java's Pattern.compile(".*") stops at \u0085.
assert_eq("abc包def".splitlines(), ["abc包def"])
# Error: assert_eq: ["abc�"] != ["abc包def"]
```

The bug is located in StringModule.java . It uses a regex pattern `(?<line>.*)(?<break>(\r\n|\r|\n)?)` to identify lines.

1. Java's regex engine considers \u0085 to be a valid line terminator. Therefore, the .* quantifier (which matches everything except line terminators) stops matching when it hits 0x85 .
2. This causes splitlines() to see a line break where none exists in the original UTF-8 text.

From context around https://github.com/bazelbuild/bazel/blob/1c4cb2c8862149e6aa78871a68f869f0763dd6ff/src/main/java/net/starlark/java/eval/StringModule.java#L210 I think that the current behavior is a bug.

### Which category does this issue belong to?

Starlark Interpreter

### What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.


Add

`assert_eq("abc包def".splitlines(), ["abc包def"])`

to `src/test/java/net/starlark/java/eval/testdata/string_splitlines.star` then run

`bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1`

### Which operating system are you running Bazel on?


_No response_

### What is the output of `bazel info release`?


_No response_

### If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.


_No response_

### What's the output of `git remote get-url origin; git rev-parse HEAD` ?


```text

```

### If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.


_No response_

### Have you found anything relevant by searching the web?

_No response_

### Any other information, logs, or outputs that you want to share?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode #28885

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode #28885

Description

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?