How to transliterate non-latin scripts?

Question

I'm playing around with transliteration in PHP using iconv. Particularly I want to normalise accented characters and Romanize other scripts from UTF-8 to plain ASCII.

While many characters work, (such as Ž->Z) others are giving odd results or raising errors.

For example, E ACUTE é (U+00E9) transliterates to ASCII with a single quote (U+0027) preceding the e as if it's trying to represent the diacritic mark I'm trying to get rid of.

$utf_8 = "\xC3\xA9"; // <- é
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// returns "'e", not "e"

Non-latin scripts are worse, for example Greek sigma Σ (U+03A3) which should transliterate to latin S is not recognised at all and raises an error:

$utf_8 = "\xCE\xA3"; // <- Σ
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// Raises notice: iconv(): Detected an illegal character in input string

I can just about cope with the first one, but how can I transliterate "Σ" to "S", and do this reliably across other scripts that have equivalent characters?

I don't mind generating my own tables if there is a good source that works for most european languages.

Note that I've tried various collation tables, which are useful for normalising accented latin characters, but they don't work for transliterating between scripts.

may be able to get some love from strtr. would just have to supply a custom map from one character to another. See here for example stackoverflow.com/questions/17850603/… — Orangepill
it's getting the custom maps I'm worried about. Coding it's not the problem. strtr wouldn't work for the multibyte characters in my example anyway. — Tim

Halcyon Halcyon · Accepted Answer · 2013-07-25T16:44:35

I've not had much luck using iconv. It always manages to throw a bunch of notices.

The best luck I've had is with using a custom transliteration table. It's far from perfect but at least you'll feel like you have some solid ground.

I've not found a good single source for transliteration tables. My unfamiliarity with anything but the latin script isn't helping.

How to transliterate non-latin scripts?

2 Answers