Machine learning with naive bayes on non english words

Question

I use text blob library of python, and the Naive bayes classifier of text blob. I have learned that it uses nltk naive bayes classifier. Here is the question: My input sentences are non-english (Turkish). Will it be possible? I don't know how it works. But I tried 10 training data, and it seems to work. I wonder how it works, this naive babes classifier of nltk, on non-English data. What are the disadvantages?

The classifier should work the same for most non-English data. Of course, how well this classification algorithm works depends in part on the language (among other things). For example, is the language highly inflected? Can you accurately tokenize the language? (Some languages, such as those that lack spaces, are notoriously difficult to tokenize.) — Justin O Barber
Turkish has rich morphology but uses the Latin alphabet and separates words with spaces. — alexis

alexis alexis · Accepted Answer · 2015-12-05T21:26:09

Although a classifier trained for English is unlikely to work on other languages, it sounds like you are using textblob to train a classifier for your text domain. Nothing rules out using data from another language, so the real question is whether you are getting acceptable performance. The first thing you should do is test your classifier on a few hundred new sentences (not the ones you trained it on!). If you're happy, that's the end of the story. If not, read on.

What makes or breaks any classifier is the selection of features to train it with. The NLTK's classifiers require a "feature extraction" function that converts a sentence into a dictionary of features. According to its tutorial, textblob provides some kind of "bag of words" feature function by default. Presumably that's the one you're using, but you can easily plug in your own feature function.

This is where language-specific resources come in: Many classifiers use a "stopword list" to discard common words like and and the. Obviously, this list must be language-specific. And as @JustinBarber wrote in a comment, languages with lots of morphology (like Turkish) have more word forms, which may limit the effectiveness of word-based classification. You may see improvement if you "stem" or lemmatize your words; both procedures transform different inflected word forms to a common form.

Going further afield, you didn't say what your classifier is for but it's possible that you could write a custom recognizer for some text properties, and plug them in as features. E.g., in case you're doing sentiment analysis, some languages (including English) have grammatical constructions that indicate high emotion.

For more, read a few chapters of the NLTK book, especially the chapter on classification.

Machine learning with naive bayes on non english words

1 Answers