1
votes

I'm facing a machine learning problem. Basically, I'm trying to classify some text into categories (labels), so this is a supervised classification algorithm. I have training data, with texts and their corresponding labels. Through a bag of words method, I've managed to transform each text into a list of most occuring words, just like in this image : bag of words

As you can see, the lists have different sizes (because of the input data where the text is sometimes very short...).

So now, I have a training data frame with these lists of words and their corresponding labels. However, I'm quite confused about how I should proceed next to implement my machine learning algorithm. How to modify the lists so that I can use a classifier ?

I've looked at one-hot-encoding, but the problem here is :

  • the different sizes of each list and the random place of each word inside the list
  • how to encode one list with the appearance of the possible 0s from an other list

---> example

INPUT:

L1= ['cat','dog','home','house']

L2=['fish','cat','dog']

OUTPUT:

Vector1 = [1,1,1,1,0]
Vector2=[1,1,0,0,1]

Also, just from this example I imagine that even if I did that, the resulting vectors might have a very important size.

I hope this makes sense, I'm quite new to machine learning. However, I'm not even sure the bag of words method I've made is really helping, so don't hesitate to tell me if you think I'm going in the wrong direction.

I'm using pandas and scikit-learn and it is my first time that I'm confronted to a text classification issue.

Thanks for you help.

1
What you have in your example looks pretty much like what you want. The problem with BOW models is they are extremely sparse. You may have a vector of size 50,000 to represent a sentence with only 5 words in it. That is one of the motivations for word2vec - dantiston
Have a look at MultilabelBinarizer and CountVectorizer from scikit. Here is my answer describing the use case you want to apply :- stackoverflow.com/a/42392689/3374996 - Vivek Kumar
Indeed, your example describes exactly what I wanted to do, now I've managed to get my lists of words into vectors. I will try to work with these new vectors. Thank you. And the number of total features is not that big so I guess my computer will be able to process it. - SL2017

1 Answers

1
votes

I would suggest using NLTK and specifically nltk.classify.naivebayes. Take a look at the example here: http://www.nltk.org/book_1ed/ch06.html. You will need to build a feature extractor. I would do something like the following (untested) code:

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
    return dict([(word.lower(), True) for word in words])

train_data = [ (word_feats(L1), 'label1'), (word_feats(L2), 'label2') ]

classifier = NaiveBayesClassifier.train(train_data)

test_data = ["foo"]

classifier.classify(test_data)