So my doubt is, is it true that string literals in PHP can only be
encoded in an encoding which is a compatible superset of ASCII, such
as UTF-8 or ISO-8859-1 and not in an encoding which is not a
compatible superset of ASCII?
It's not true.
Is it possible to encode string literals in PHP in some non-ASCII
compatible encoding like UTF-16, UTF-32 or some other such non-ASCII
compatible encoding? If yes then will the strings literals encoded in
such one of the non-ASCII compatible encoding work with mb_string_*
functions? If no, then what's the reason?
As @deceze says, You can easily convert the string to encoding you want via mb_convert_encoding or iconv.
From the Details of string type in PHP Manual, String will be encoded in whatever fashion it is encoded in the script file. PHP built with Zend Multibyte
support and mbstring
extension can parse and run PHP files that have encoded in non-ASCII compatible encoding like UTF-16, See tests in Zend/multibyte.
Zend/tests/multibyte/multibyte_encoding_003.phpt
is demonstrated for running sources with UTF-16 LE encoding that output Hello World correctly.
Zend/tests/multibyte/multibyte_encoding_003.phpt
--TEST--
Zend Multibyte and UTF-16 BOM
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
mbstring.internal_encoding=iso-8859-1
--FILE--
<?php
print "Hello World\n";
?>
===DONE===
--EXPECT--
Hello World
===DONE===
$ run-tests.php --keep-php --show-out --show-php Zend/tests/multibyte/multibyte_encoding_003.phpt
... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_003.phpt]
========TEST========
<?php
print "Hello World\n";
?>
===DONE===
========DONE========
========OUT========
Hello World
===DONE===
========DONE========
PASS Zend Multibyte and UTF-16 BOM [multibyte_encoding_003.phpt]
=====================================================================
Number of tests : 1 1
Tests skipped : 0 ( 0.0%) --------
Tests warned : 0 ( 0.0%) ( 0.0%)
Tests failed : 0 ( 0.0%) ( 0.0%)
Expected fail : 0 ( 0.0%) ( 0.0%)
Tests passed : 1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken : 0 seconds
=====================================================================
$ file multibyte_encoding_003.php
multibyte_encoding_003.php: PHP script text, Little-endian UTF-16 Unicode text
Another example is Zend/tests/multibyte/multibyte_encoding_004.phpt
, It runs source which encoded with Shift JIS.
Zend/tests/multibyte/multibyte_encoding_004.phpt (Note: Some Japanese characters are not display correctly because of mixing encoding in one file and LC_MESSAGE
is set to UTF-8
)
--TEST--
test for mbstring script_encoding for flex unsafe encoding (Shift_JIS)
--SKIPIF--
<?php
if (!in_array("zend.detect_unicode", array_keys(ini_get_all()))) {
die("skip Requires configure --enable-zend-multibyte option");
}
if (!extension_loaded("mbstring")) {
die("skip Requires mbstring extension");
}
?>
--INI--
zend.multibyte=1
zend.script_encoding=Shift_JIS
mbstring.internal_encoding=Shift_JIS
--FILE--
<?php
function \\\($)
{
echo $;
}
\\\("h~t@\");
?>
--EXPECT--
h~t@\
$ run-tests.php --keep-php --show-out --show-php
./multibyte_encoding_004.phpt
... skip some trivial message ...
Running selected tests.
TEST 1/1 [multibyte_encoding_004.phpt]
========TEST========
<?php
function \\\($)
{
echo $;
}
\\\("h~t@\");
?>
========DONE========
========OUT========
h~t@\
========DONE========
PASS test for mbstring script_encoding for flex unsafe encoding (Shift_JIS) [multibyte_encoding_004.phpt]
=====================================================================
Number of tests : 1 1
Tests skipped : 0 ( 0.0%) --------
Tests warned : 0 ( 0.0%) ( 0.0%)
Tests failed : 0 ( 0.0%) ( 0.0%)
Expected fail : 0 ( 0.0%) ( 0.0%)
Tests passed : 1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken : 0 seconds
=====================================================================
$ file Zend/tests/multibyte/multibyte_encoding_004.php
multibyte_encoding_004.php: PHP script text, Non-ISO extended-ASCII text
$ cat Zend/tests/multibyte/multibyte_encoding_004.php | iconv -f SJIS -t utf-8
<?php
function 予蚕能($引数)
{
echo $引数;
}
予蚕能("ドレミファソ");
?>
Is it possible to encode string literals in PHP in some non-ASCII
compatible encoding like UTF-16, UTF-32 or some other such non-ASCII
compatible encoding? If yes then will the strings literals encoded in
such one of the non-ASCII compatible encoding work with mb_string_*
functions? If no, then what's the reason?
The answer to the first question is yes, The tests for Zend Multibyte
is convincingly demonstrated. The answer for the second question is also yes if given the correct encoding hints to mb_string_*
.
Suppose, Zend Multibyte is enabled and I've set the internal encoding
to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some
other non-ASCII compatible encoding. Now, can I declare the encoding
which is not a compatible superset of ASCII, such as UTF-16 or UTF-32
in the script file?
If yes, then in this case what encoding the string literals would get
encoded in? If no, then what's the reason?
Yes, The output generated by second command is UTF-32 encoding (Represents single character as 4 bytes)
$ echo -e '<?php\necho "Hello 中文";' | php | hexdump -C
00000000 48 65 6c 6c 6f 20 e4 b8 ad e6 96 87 |Hello ......|
0000000c
$ echo '<?php\\necho "Hello 中文";' | iconv -t utf-16 | php -d zend.multibyte=1 -d zend.script_encoding=UTF-16 -d mbstring.internal_encoding=UTF-32 | hexdump -C
00000000 00 00 00 48 00 00 00 65 00 00 00 6c 00 00 00 6c |...H...e...l...l|
00000010 00 00 00 6f 00 00 00 20 00 00 4e 2d 00 00 65 87 |...o... ..N-..e.|
00000020
Also, explain me how does this encoding thing work for string literals
if Zend Multibyte is enabled?
Zend Multibyte feature is implemented on Zend/zend_multibyte.c, Let Zend engine knows more encoding other than Ascii and UTF-8, It is only the interface for encoding stuff, because the default implementation is dummy function, The real implementation is the mbstring
extension, Therefore, mbstring
is mandatory extension to get multibyte support when loaded.
$ php -m | grep mbstring
mbstring
$ php -n -m | grep mbstring # -n disable mbstring, No configuration (ini) files will be used.
$ echo -e '<?php\n echo "Hello 中文\n"; ' | iconv -t utf-16 | php -n -d zend.multibyte=1
Fatal error: Could not convert the script from the detected encoding "UTF-32LE" to a compatible encoding in Unknown on line 0
How to enable the Zend Multibyte? What's the main intention behind
turning it On? When it is required to turn it On?
Declare zend.multibyte=1 in php.ini will enable parsing of source files in multibyte encodings, Also you can pass -d zend.multibyte=1
to PHP cli executable as above example to enable multibyte support in PHP Zend engine.