8
votes

I come across following text from the Details of the String Type page from PHP Manual :

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. String will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII?

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file?

If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason?

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled?

How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On?

It would be better if you could clear my doubts accompanied by suitable examples.

Thank You.

3

3 Answers

6
votes

String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly:

$ echo -n '<?php echo "' > test.php
$ echo -n 日本語 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php 
<?php echo "??e?g,??";
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5  <?php echo "..e.
00000010: 672c 8a9e 223b 0a                        g,..";.
$ php test.php 
??e?g,??$ 
$ php test.php | iconv -f UTF-16
日本語

This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is.

The bigger problem with this kind of source code is that it's difficult to work with. It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout.

You can also easily get into trouble:

$ echo -n '<?php echo "' > test.php
$ echo -n 漢字 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22  <?php echo "..o"
00000010: 5b57 223b 0a                             [W";.

"漢字" here is encoded to feff 6f22 5b57, which contains 22 or ", a string literal terminator, which means you have a syntax error now.

By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). So you could write your source code in, say, Shift-JIS throughout; probably even with string literals in some other encoding*.

* (At which point I'll quit going into details because what is wrong with you?!)

Summary:

  • PHP must understand all the PHP code; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well.
  • The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (e.g. the 22 example above), in which case you need to escape them (with a backslash in the encoding of the general source code).
  • The string value at runtime will be the raw byte sequence PHP read from the string literal.

Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. At most I'd advice to use ASCII-compatible encodings, e.g.:

echo "日本語";  // UTF-8 encoded (let's hope)

If you must have a non-ASCII-compatible string literal, you should use byte notation:

echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e";

Or conversion:

echo iconv('UTF-8', 'UTF-16', '日本語');

[..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions?

Sure, strings in PHP are raw byte arrays for all intents and purposes. It doesn't matter how you obtained that string. If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it.

2
votes

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII?

It's not true.

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?

As @deceze says, You can easily convert the string to encoding you want via mb_convert_encoding or iconv.

From the Details of string type in PHP Manual, String will be encoded in whatever fashion it is encoded in the script file. PHP built with Zend Multibyte support and mbstring extension can parse and run PHP files that have encoded in non-ASCII compatible encoding like UTF-16, See tests in Zend/multibyte.

Zend/tests/multibyte/multibyte_encoding_003.phpt is demonstrated for running sources with UTF-16 LE encoding that output Hello World correctly.

Zend/tests/multibyte/multibyte_encoding_003.phpt

--TEST--
Zend Multibyte and UTF-16 BOM
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
  die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
  die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
mbstring.internal_encoding=iso-8859-1
--FILE--
<?php
print "Hello World\n";
?>
===DONE===

--EXPECT--
Hello World
===DONE===

$ run-tests.php --keep-php --show-out --show-php Zend/tests/multibyte/multibyte_encoding_003.phpt

 ... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_003.phpt]
========TEST========
<?php
print "Hello World\n";
?>
===DONE===
========DONE========

========OUT========
Hello World
===DONE===
========DONE========
PASS Zend Multibyte and UTF-16 BOM [multibyte_encoding_003.phpt]
=====================================================================
Number of tests :    1                 1
Tests skipped   :    0 (  0.0%) --------
Tests warned    :    0 (  0.0%) (  0.0%)
Tests failed    :    0 (  0.0%) (  0.0%)
Expected fail   :    0 (  0.0%) (  0.0%)
Tests passed    :    1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken      :    0 seconds
=====================================================================

$ file multibyte_encoding_003.php

multibyte_encoding_003.php: PHP script text, Little-endian UTF-16 Unicode text

Another example is Zend/tests/multibyte/multibyte_encoding_004.phpt, It runs source which encoded with Shift JIS.

Zend/tests/multibyte/multibyte_encoding_004.phpt (Note: Some Japanese characters are not display correctly because of mixing encoding in one file and LC_MESSAGE is set to UTF-8)

--TEST--
test for mbstring script_encoding for flex unsafe encoding (Shift_JIS)
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
  die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
  die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
zend.script_encoding=Shift_JIS
mbstring.internal_encoding=Shift_JIS
--FILE--
<?php
        function \\\($)
        {
                echo $;
        }

        \\\("h~t@\");
?>
--EXPECT--
h~t@\

$ run-tests.php --keep-php --show-out --show-php
./multibyte_encoding_004.phpt

 ... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_004.phpt]
========TEST========
<?php
        function \\\($)
        {
                echo $;
        }

        \\\("h~t@\");
?>
========DONE========

========OUT========
h~t@\
========DONE========
PASS test for mbstring script_encoding for flex unsafe encoding (Shift_JIS) [multibyte_encoding_004.phpt]
=====================================================================
Number of tests :    1                 1
Tests skipped   :    0 (  0.0%) --------
Tests warned    :    0 (  0.0%) (  0.0%)
Tests failed    :    0 (  0.0%) (  0.0%)
Expected fail   :    0 (  0.0%) (  0.0%)
Tests passed    :    1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken      :    0 seconds
=====================================================================

$ file Zend/tests/multibyte/multibyte_encoding_004.php

multibyte_encoding_004.php: PHP script text, Non-ISO extended-ASCII text

$ cat Zend/tests/multibyte/multibyte_encoding_004.php | iconv -f SJIS -t utf-8

<?php
        function 予蚕能($引数)
        {
                echo $引数;
        }

        予蚕能("ドレミファソ");
?>

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?

The answer to the first question is yes, The tests for Zend Multibyte is convincingly demonstrated. The answer for the second question is also yes if given the correct encoding hints to mb_string_*.

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file?

If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason?

Yes, The output generated by second command is UTF-32 encoding (Represents single character as 4 bytes)

$ echo -e '<?php\necho "Hello 中文";' | php  | hexdump -C
00000000  48 65 6c 6c 6f 20 e4 b8  ad e6 96 87              |Hello ......|
0000000c

$ echo '<?php\\necho "Hello 中文";' | iconv -t utf-16 | php -d zend.multibyte=1 -d zend.script_encoding=UTF-16 -d mbstring.internal_encoding=UTF-32 | hexdump -C
00000000  00 00 00 48 00 00 00 65  00 00 00 6c 00 00 00 6c  |...H...e...l...l|
00000010  00 00 00 6f 00 00 00 20  00 00 4e 2d 00 00 65 87  |...o... ..N-..e.|
00000020

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled?

Zend Multibyte feature is implemented on Zend/zend_multibyte.c, Let Zend engine knows more encoding other than Ascii and UTF-8, It is only the interface for encoding stuff, because the default implementation is dummy function, The real implementation is the mbstring extension, Therefore, mbstring is mandatory extension to get multibyte support when loaded.

$ php -m | grep mbstring
mbstring
$ php -n -m | grep mbstring # -n disable mbstring, No configuration (ini) files will be used.
$ echo -e '<?php\n echo "Hello 中文\n"; ' | iconv -t utf-16 | php -n -d zend.multibyte=1

Fatal error: Could not convert the script from the detected encoding "UTF-32LE" to a compatible encoding in Unknown on line 0

How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On?

Declare zend.multibyte=1 in php.ini will enable parsing of source files in multibyte encodings, Also you can pass -d zend.multibyte=1 to PHP cli executable as above example to enable multibyte support in PHP Zend engine.

0
votes

How to enable the Zend Multibyte?

Compile PHP using the --enable-zend-multibyte flag (before PHP 5.4) and activate the zend.multibyte setting in the php.ini.

Cf. https://secure.php.net/manual/en/ini.core.php#ini.zend.multibyte and https://secure.php.net/manual/en/configure.about.php#configure.options.php