Skip to content

[8.0.0] Force the JVM to use UTF-8 on Windows#24231

Merged
iancha1992 merged 1 commit intobazelbuild:release-8.0.0from
bazel-io:cp24172-8.0.0
Nov 7, 2024
Merged

[8.0.0] Force the JVM to use UTF-8 on Windows#24231
iancha1992 merged 1 commit intobazelbuild:release-8.0.0from
bazel-io:cp24172-8.0.0

Conversation

@bazel-io
Copy link
Copy Markdown
Member

@bazel-io bazel-io commented Nov 6, 2024

This change patches the app manifest of the java.exe launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running bazel info character-encoding and verifying that it prints sun.jnu.encoding = UTF-8.

Work towards #374
Work towards #18293
Work towards #23859

Closes #24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

Commit 7bb8d2b

This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards bazelbuild#374
Work towards bazelbuild#18293
Work towards bazelbuild#23859

Closes bazelbuild#24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
@bazel-io bazel-io requested a review from a team as a code owner November 6, 2024 21:32
@bazel-io bazel-io added team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website awaiting-review PR is awaiting review from an assigned reviewer labels Nov 6, 2024
@bazel-io bazel-io requested a review from meteorcloudy November 6, 2024 21:32
@iancha1992 iancha1992 enabled auto-merge November 6, 2024 21:35
@iancha1992 iancha1992 added this pull request to the merge queue Nov 7, 2024
Merged via the queue into bazelbuild:release-8.0.0 with commit 9e1b87f Nov 7, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 7, 2024
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 8, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
tonyaiuto added a commit to bazelbuild/rules_pkg that referenced this pull request Feb 19, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants