Is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1?

Question

I come across following text from the Details of the String Type page from PHP Manual :

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. String will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII?

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file?

If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason?

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled?

How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On?

It would be better if you could clear my doubts accompanied by suitable examples.

Thank You.

deceze deceze · Accepted Answer · 2018-10-11T13:42:44

String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly:

$ echo -n '<?php echo "' > test.php
$ echo -n 日本語 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php 
<?php echo "??e?g,??";
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5  <?php echo "..e.
00000010: 672c 8a9e 223b 0a                        g,..";.
$ php test.php 
??e?g,??$ 
$ php test.php | iconv -f UTF-16
日本語

This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is.

The bigger problem with this kind of source code is that it's difficult to work with. It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout.

You can also easily get into trouble:

$ echo -n '<?php echo "' > test.php
$ echo -n 漢字 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22  <?php echo "..o"
00000010: 5b57 223b 0a                             [W";.

"漢字" here is encoded to feff 6f22 5b57, which contains 22 or ", a string literal terminator, which means you have a syntax error now.

By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). So you could write your source code in, say, Shift-JIS throughout; probably even with string literals in some other encoding*.

_{* (At which point I'll quit going into details because what is wrong with you?!)}

Summary:

PHP must understand all the PHP code; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well.
The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (e.g. the 22 example above), in which case you need to escape them (with a backslash in the encoding of the general source code).
The string value at runtime will be the raw byte sequence PHP read from the string literal.

Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. At most I'd advice to use ASCII-compatible encodings, e.g.:

echo "日本語";  // UTF-8 encoded (let's hope)

If you must have a non-ASCII-compatible string literal, you should use byte notation:

echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e";

Or conversion:

echo iconv('UTF-8', 'UTF-16', '日本語');

[..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions?

Sure, strings in PHP are raw byte arrays for all intents and purposes. It doesn't matter how you obtained that string. If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it.

Is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1?

3 Answers