[8.0.0] Fix more Unicode bugs#24403
Merged
meteorcloudy merged 2 commits intobazelbuild:release-8.0.0from Nov 20, 2024
Merged
Conversation
* Use Latin-1 in many native file write rules for consistency with the internal encoding. * Use Latin-1 for the resolved repository file and the JSON profile. * Fix `unused_input_list` handling of non-ASCII characters in file names. * Flip the `legacy_utf8` parameter of `repository_ctx.file` to `False` and make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice. * Change `repository_ctx.template` to operate on raw bytes for consistency with `repository_ctx.read` and to fix substitution with non-ASCII keys/values. * Move some usages of `UTF_8` closer to their usage site to clarify why they are correct. * Fixes parsing of dependency files with Unicode character contents (`/showIncludes` and `.d` files) Closes bazelbuild#24182. PiperOrigin-RevId: 698111811 Change-Id: Ie43bab9eb5963bf81690dd8985d358f544a711c9 (cherry picked from commit 3fdec93)
meteorcloudy
approved these changes
Nov 20, 2024
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 8, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
tonyaiuto
added a commit
to bazelbuild/rules_pkg
that referenced
this pull request
Feb 19, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
unused_input_listhandling of non-ASCII characters in file names.legacy_utf8parameter ofrepository_ctx.filetoFalseand make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice.repository_ctx.templateto operate on raw bytes for consistency withrepository_ctx.readand to fix substitution with non-ASCII keys/values.UTF_8closer to their usage site to clarify why they are correct./showIncludesand.dfiles)Closes #24182.
PiperOrigin-RevId: 698111811
Change-Id: Ie43bab9eb5963bf81690dd8985d358f544a711c9 (cherry picked from commit 3fdec93)
Fixes #24242