Detect language from string in PHP

votes

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

You want to test if a string has non-English characters? Can you define what "English" is? – strager

"the problem with the French is they have no word for entrepreneur" – Pete Kirkham

Basically what I wanna do is, I have an array of incoming user comments, which each comment could be in different language. on the PHP backend, I want to set a flag if the comment is not English (like in French or Japanese), and the frontend will show a translate button if the flag is set – Beier

What you want to do is possible completely with javascript and google. You don't need to do anything more than an include. – Esteban Küber

you might want to try google's cld2! – Steel Brain

17 Answers

votes

You can not detect the language from the character type. And there are no foolproof ways to do this.

With any method, you're just doing an educated guess. There are available some math related articles out there

votes

I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

votes

I know this is an old post, but here is what I developed after not finding any viable solution.

other suggestions are all too heavy and too cumbersome for my situation
I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
So I want a solution with minimal false positives - but don't care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code - Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // French word list
      // from https://1000mostcommonwords.com/1000-most-common-french-words/
      $wordList['fr'] = array ('comme', 'que',  'tait',  'pour',  'sur',  'sont',  'avec',
                         'tre',  'un',  'ce',  'par',  'mais',  'que',  'est',
                         'il',  'eu',  'la', 'et', 'dans');

      // Spanish word list
      // from https://spanishforyourjob.com/commonwords/
      $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                         'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
                         'los', 'con', 'para', 'est', 'eso', 'las');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }

votes

You could do this entirely client side with ~~Google's AJAX Language API~~ (now defunct).

With the AJAX Language API, you can translate and detect the language of blocks of text within a webpage using only Javascript. In addition, you can enable transliteration on any textfield or textarea in your web page. For example, if you were transliterating to Hindi, this API will allow users to phonetically spell out Hindi words using English and have them appear in the Hindi script.

You can detect automatically a string's language

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And translate any string written in one of the ~~supported languages~~ (also defunct)

google.language.translate("Hello world", "en", "es", function(result) {
  if (!result.error) {
    var container = document.getElementById("translation");
    container.innerHTML = result.translation;
  }
});

votes

As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:

http://detectlanguage.com

votes

I tried the Text_LanguageDetect library and the results I got were not very good (for instance, the text "test" was identified as Estonian and not English).

I can recommend you try the Yandex Translate API which is FREE for 1 million characters for 24 hours and up to 10 million characters a month. It supports (according to the documentation) over 60 languages.

<?php
function identifyLanguage($text)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/detect?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (strlen($outputJson->lang) > 0)
            {
                return $outputJson->lang;
            }
        }
    }

    return "unknown";
}

function translateText($text, $targetLang)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/translate?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text) . "&lang=" . urlencode($targetLang);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (count($outputJson->text) > 0 && strlen($outputJson->text[0]) > 0)
            {
                return $outputJson->text[0];
            }
        }
    }

    return $text;
}

header("content-type: text/html; charset=UTF-8");

echo identifyLanguage("エクスペリエンス");
echo "<br>";
echo translateText("エクスペリエンス", "en");
echo "<br>";
echo translateText("エクスペリエンス", "es");
echo "<br>";
echo translateText("エクスペリエンス", "zh");
echo "<br>";
echo translateText("エクスペリエンス", "he");
echo "<br>";
echo translateText("エクスペリエンス", "ja");
echo "<br>";
?>

votes

Text_LanguageDetect pear package produced terrible results: "luxury apartments downtown" is detected as Portuguese...

Google API is still the best solution, they give 300$ free credit and warn before charging you anything

Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

Execute:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/

This is a simple example for short phrases to get you going. For more complex applications you'll want to restrict your API key and use the library obviously.

votes

You can probably use the Google Translate API to detect the language and translate it if necessary.

votes

You can see how to detect language for a string in php using the Text_LanguageDetect Pear Package or downloading to use it separately like a regular php library.

votes

One approach might be to break the input string into words and then look up those words in an English dictionary to see how many of them are present. This approach has a few limitations:

proper nouns may not be handled well
spelling errors can disrupt your lookups
abbreviations like "lol" or "b4" won't necessarily be in the dictionary

votes

Perhaps submit the string to this language guesser:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser

votes

I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.

I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).

votes

try to use ascii encode. i use that code to determine ru\en languages in my social bot project

function language($string) {
        $ru = array("208","209","208176","208177","208178","208179","208180","208181","209145","208182","208183","208184","208185","208186","208187","208188","208189","208190","208191","209128","209129","209130","209131","209132","209133","209134","209135","209136","209137","209138","209139","209140","209141","209142","209143");
        $en = array("97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122");
        $htmlcharacters = array("<", ">", "&amp;", "&lt;", "&gt;", "&");
        $string = str_replace($htmlcharacters, "", $string);
        //Strip out the slashes
        $string = stripslashes($string);
        $badthings = array("=", "#", "~", "!", "?", ".", ",", "<", ">", "/", ";", ":", '"', "'", "[", "]", "{", "}", "@", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "|", "`");
        $string = str_replace($badthings, "", $string);
        $string = mb_strtolower($string);
        $msgarray = explode(" ", $string);
        $words = count($msgarray);
        $letters = str_split($msgarray[0]);
        $letters = ToAscii($letters[0]);
        $brackets = array("[",",","]");
        $letters = str_replace($brackets,  "", $letters);
        if (in_array($letters, $ru)) {
            $result = 'Русский' ; //russian
        } elseif (in_array($letters, $en)) {
            $result = 'Английский'; //english
        } else {
            $result = 'ошибка' . $letters; //error
        }} return $result;

votes

I have had good results with https://github.com/patrickschur/language-detection and am using it in production:

It uses ngrams in languages to detect the most likely language (the longer your string / the more words, the more accurate it will be), which seems like a solid proven method.
110 languages are supported, but you can also limit the number of languages to only those you are interested in.
Trainer and Language detector can easily be improved / customized. It uses the Universal Declaration of Human Rights in each of the languages as the foundation to detect a language, but if you know what type of sentences you experience you can easily extend or replace the used texts in each language and get better results fast. "Training" this library to become better is easy.
I would suggest to increase setMaxNgrams (I set it to 9000) in the Trainer and run it once, and then also use that setting in the Language detector class. Changing the ngrams number is a bit unintuitive (I had to look through the code to find out how it works), which is a drawback, and the default (310) is always too low in my opinion. More ngrams makes the guessing a lot better.
Because the library is very small, it was relatively easy to understand what is happening and how to tweak it.

My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.

Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:

Add additional common phrases to the (relevant) language samples, like "Greetings", "Best regards", "Sincerely". These kind of expressions are not used in the Universal Declaration of Human Rights. Commonly used phrases help the language recognition a lot, especially formulaic ones used often my humans ("Hello", "Have a nice day") if you are analyzing human communication.
Set the maximum ngram length to 4 (instead of the default 3).
Keep the maxNgrams at 9000 as before.

These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.

votes

You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php. If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.

votes

Additional words for French and Spanish to Swiss Mister's answer:

    // Franch word list
    // from https://1000mostcommonwords.com/1000-most-common-french-words/
    $wordList['fr'] = array ('comme', 'que',  'était',  'pour',  'sur',  'sont',  'avec',
                             'être',  'à',  'un',  'ce',  'par',  'mais',  'que',  'est',
                             'il',  'eu',  'la', 'et', 'dans');

    // Spanish word list
    // from https://spanishforyourjob.com/commonwords/
    $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                             'en', 'lo', 'un', 'por', 'qué', 'si', 'una',
                             'los', 'con', 'para', 'está', 'eso', 'las');

votes

My answer is for specific case. Here is what I wrote to find if string is in specific language, but there is one condition - different languages have different alphabets. In my case the word(s) can be in 3 languages - english, bulgarian and greek (each with different alphabet). And I need to find if a text is in bulgarian, so later translate it to greek.

class Language {
        protected $bgSymbols = array(
            'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ъ', 'ь', 'ч', 'щ', 'ш', 'ю', 'я',
            'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ъ', 'Ь', 'Ч', 'Щ', 'Ш', 'Ю', 'Я'
        );
        
        public function checkIfForTranslate($string) {
            $result = false;
            $stringArray = array();
            preg_match_all('/./u', $string, $matches);
            if(isset($matches[0])) {
                $stringArray = $matches[0];
            }
            foreach($this->bgSymbols as $symbol) {
                $found = array_search($symbol, $stringArray);
                if($found !== false) {
                    $result = true;
                    break;
                }
            }
            return $result;
        }
    }

Hope this help someone with similar case to mine.