-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
export LC_ALL=C is a reasonably-popular way to "disable" anything locale-related. When dealing with sort, it is even recommended. It's what some sysadmins do to just "get rid of" locales. I use it to avoid seeing German-localized error messages when comparing GNU coreutils and uutils. Therefore, as long as we don't support locales, we should probably have the same behavior as GNU coreutils in LC_ALL=C. That's how I ran into this bug by accident.
GNU coreutils ls treats any byte with its eighth bit set in a special way (but not as a "control character", but … something else again). Just try ä, \u00a0, €, 😄, 🫛, \uFFFD, or b"f49fbfbf" (a theoretically-valid four-byte UTF-8 sequence that encodes a number that is not in the Unicode range).
Example:
$ export LC_ALL=C
$ ~/workspace/gnu/src/ls -R dir_*_0xc3a4 dir_*_0xf09f9884
'dir_'$'\303\244''_0xc3a4':
'file_'$'\303\244''_0xc3a4'
'dir_'$'\360\237\230\204''_0xf09f9884':
'file_'$'\360\237\230\204''_0xf09f9884'
$ cargo run -q ls -NR dir_*_0xc3a4 dir_*_0xf09f9884
dir_ä_0xc3a4:
file_ä_0xc3a4
dir_😄_0xf09f9884:
file_😄_0xf09f9884Here's how to create the above "evil" files and many more:
#!/usr/bin/env python3
# How to run:
# $ rm -rf evil/ && mkdir evil && ./make_evil.py
import os
def evil_byte_sequences():
# Skip NUL, as it is forbidden by Linux
for i in range(0x01, 0x30 + 1):
if i == 0x2F:
# Skip NUL, as it is forbidden by Linux
continue
yield bytes([i])
# Skip most digits, as they are uninteresting.
for i in range(0x3A, 0x41 + 1):
yield bytes([i])
# Skip most uppercase letters, as they are uninteresting.
for i in range(0x5A, 0x61 + 1):
yield bytes([i])
# Skip most lowercase letters, as they are uninteresting.
for i in range(0x7A, 0x80):
yield bytes([i])
for i in range(0x80, 0x85 + 1):
yield bytes([i])
# Skip most invalid UTF8-sequences, as they are uninteresting.
yield bytes([0xFE])
yield bytes([0xFF])
yield "ä".encode() # two-byte
yield "\u00a0".encode() # special space
yield "€".encode() # three-byte
yield "😄".encode() # four-byte
yield "🫛".encode() # four-byte, very recent emoji
yield "\uFFFD".encode() # three-byte BOM
yield bytes.fromhex("f49fbfbf") # theoretically-valid four-byte sequence
def run():
for b in evil_byte_sequences():
dirname = b"dir_" + b + b"_0x" + b.hex().encode()
filename = b"file_" + b + b"_0x" + b.hex().encode()
print(f"{dirname=}, {filename=}")
os.mkdir(b"evil/" + dirname)
with open(b"evil/" + dirname + b"/" + filename, "w") as fp:
pass
if __name__ == "__main__":
run()Be careful, these files can have funny effects and can sometimes be quite annoying to deal with!
This probably affects most utilities and not just ls, but I'll mark only ls because that's where I found it.
Found while reviewing #6559.