7
votes

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.

For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.

For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.

I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.

The issues I am having:

  1. Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
  2. The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
  3. Is there an alternative library I should be using, or variation of this algorithm?

Thanks!

1
By multi-labelled data, do you mean that one document can have multiple labels, or do you mean that there are more than two labels?Hans Then
@HansThen there are more than 2 labels, there are around 10. Each document has only one labelredrubia
Okay. I did not understand your second question very well. What do you mean by "a word count per document"? Do you mean to say that you perform some form of dimensionality reduction ("gathering the top 2k words")?Hans Then
so when extracting features from each document, you can just count the words, to create a hash map using nltk.FreqDist i.e. {'manage':1029,....} this can be passed onto the NaiveBayes algo for training the classifierredrubia
NLTK can classify in multiple categories, just provide a training set with multible labels. You can probably just ignore individual variation between authors and use the FreqDist as you describe.Hans Then

1 Answers

3
votes

Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.

For the issues which you are facing,

  1. nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.

  2. Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)

  3. There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.

Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.