I'm playing around with transliteration in PHP using iconv. Particularly I want to normalise accented characters and Romanize other scripts from UTF-8 to plain ASCII.
While many characters work, (such as Ž
->Z
) others are giving odd results or raising errors.
For example, E ACUTE é
(U+00E9) transliterates to ASCII with a single quote (U+0027) preceding the e
as if it's trying to represent the diacritic mark I'm trying to get rid of.
$utf_8 = "\xC3\xA9"; // <- é
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// returns "'e", not "e"
Non-latin scripts are worse, for example Greek sigma Σ
(U+03A3) which should transliterate to latin S
is not recognised at all and raises an error:
$utf_8 = "\xCE\xA3"; // <- Σ
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// Raises notice: iconv(): Detected an illegal character in input string
I can just about cope with the first one, but how can I transliterate "Σ" to "S", and do this reliably across other scripts that have equivalent characters?
I don't mind generating my own tables if there is a good source that works for most european languages.
Note that I've tried various collation tables, which are useful for normalising accented latin characters, but they don't work for transliterating between scripts.
strtr
wouldn't work for the multibyte characters in my example anyway. – Tim