How are Linux shells and filesystem Unicode-aware?

Question

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.

But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?

Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.

But other encodings do not have that property, do they? So again, do shells assume UTF-8?

Neither I nor Google has heard of "Enhanced UTF-8", what is it? — zwol
@zwol modified UTF-8 differs from UTF-8 in how the null character is encoded - in modified, it is not the zero byte, but a specific two-byte sequence, this way, you can have zero-byte terminated strings of characters. — Mark Galeck

zwol zwol · Accepted Answer · 2016-08-15T01:29:52

It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).

It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:

The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)

UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.

Some programs may also require an additional property:

The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.

UTF-8 has this property, but (I think) EUC-JP doesn't.

It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)

How are Linux shells and filesystem Unicode-aware?

2 Answers