How to remove accents and turn letters into “plain” ASCII characters? [duplicate]

46

votes

What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?

Is there a simple, built in way that I'm missing or a regular expression?

@Peeps: telling users to search google is against Stack Overflow's etiquette. If the question doesn't exist on the website it's better for everyone if it is asked, even if the OP already knows the answer, since it will increase our number of non-duplicate questions. So maybe next time if someone searches it with google they will find this very question, and we will have one more user. – Thomas Bonini

@Andreas good point. However, this is most certainly a SO duplicate, so Peeps kind of has a small point :) I'm too lazy to search for it right now, though. – Pekka

55

votes

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

54

votes

I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
    "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. ﬁ ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "

see: http://www.php.net/normalizer

EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.

EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

20

votes

Reposting this on request of @palantir ...

I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...

function toASCII( $str )
{
    return strtr(utf8_decode($str), 
        utf8_decode(
        'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

13

votes

You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))

12

votes

Note: I'm reposting this from another similar question in the hope that it's helpful to others.

I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:

https://github.com/jbroadway/urlify

Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.

How to remove accents and turn letters into “plain” ASCII characters? [duplicate]

5 Answers