12
votes

Problem Statement:

To classify a text document to which category it belongs and also to classify up to two levels of the category.

Sample Training Set:

Description Category    Level1  Level2
The gun shooting that happened in Vegas killed two  Crime | High    Crime   High
Donald Trump elected as President of America    Politics | High Politics    High
Rian won in football qualifier  Sports | Low    Sports  Low
Brazil won in football final    Sports | High   Sports  High

Initial Attempt:

I tried to create a classifier model which would try to classify the Category using Random forest method and it gave me 90% overall.

Code1:

import pandas as pd
#import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem

from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score

stop = stopwords.words('english')
data_file = "Training_dataset_70k"

#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()

#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)

RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue

Problem:

I want to add two more level to the model they are Level1 and Level2 the reasons for adding them is when I ran classification for Level1 alone I got 96% accuracy. I am stuck at splitting training and test dataset and to train a model which has three classifications.

Is it possible to create a model with three classification or should I create three models? How to split train and test data?

Edit1: import string import codecs import pandas as pd import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score


stop = stopwords.words('english')

data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue
1
Could you clarify what are the two levels supposed to be? In the sample training set that you provided, your category is something like "Crime | High", and then your levels are just the first and the second word in the category (so it doesn't provide any new information). Also, just to make sure- the category always consists of two words?Miriam Farber
@MiriamFarber yes category always contains two words separated by pipe. The reason for adding level1 and level2 is I am getting higher accuracy for level1 so it would reduce the downward process even if category is wrong.The6thSense
OK so just to make sure- when you run the model with one target you get 90% sucess if this target is the category column, and you get 96% sucess if this target is the level 1 column, and you want to construct a model where you have 3 targets (which are the three columns corresponding to description, level 1 and level 2), right?Miriam Farber
Slight change description is the input and the target values are category, level1,level2. Other than that everything you said is right.The6thSense
On a side note, I suggest to change your RF parameters (ex: increase the number of estimators)nbeuchat

1 Answers

1
votes

The scikit-learn Random Forest Classifier natively supports multiple outputs (see this example). Therefore, you do not need to create three separate models.

From the documentation of RandomForestClassifier.fit, the inputs to fit functions are:

X : array-like or sparse matrix of shape = [n_samples, n_features]

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

Therefore, you need an array y (your labels) of size N x 3 as your input to your RandomForestClassifier. In order to split your training and test set, you can do:

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)

Your train_label and test_label should be arrays of size N x 3 that you can use to fit your model compare your predictions (NB: I have not tested it here, you might need to do some transposes).