0
votes

I am working on a project to classify presidential debate tweets into neutral, positive and negative for each candidate. (Not the current debate dataset). I am training using Decision trees, Decision tree ensemble and AdaBoost. The issue is I am getting the accuracy of 100%, which is extremely weird and impossible.

The data I have is in the form of a bag-of-words model. Each word in the vocabulary is represented by 0/1 depending on whether or not the word appears in each tweet. I have included the stats at the end of the question. df_obama is a data-frame with all the tweets about Obama.

df_Obama = pd.DataFrame.from_csv("../data/Obama_BagOfWords.csv")
df_Obama = df_Obama.reindex(np.random.permutation(df_Obama.index)).reset_index()
dataFeatures = df_Obama[allAttribs_Obama]
targetVar = list(df_Obama['Class'])

splitRatio = 0.9
splitPoint = int(splitRatio*len(dataFeatures))
dataFeatures_train = dataFeatures[:splitPoint]
dataFeatures_test = dataFeatures[splitPoint:]

targetVar_train = targetVar[:splitPoint]
targetVar_test = targetVar[splitPoint:]

clfObj = tree.DecisionTreeClassifier()
clfObj.fit(dataFeatures_train,targetVar_train)
preds = list(clfObj.predict(dataFeatures_test))
accScore = accuracy_score(targetVar_test,preds)
labels = [1,-1,0]

precision = precision_score(targetVar_test,preds,average=None,labels=labels)
recall = recall_score(targetVar_test,preds,average=None,labels=labels)
f1Score = f1_score(targetVar_test,preds,average=None,labels=labels)

print("Overall Acurracy",accScore)
print("precision",precision)
print("recall",recall)
print("f1Score",f1Score)

Overall Acurracy 1.0
precision [ 1.  1.  1.]
recall [ 1.  1.  1.]
f1Score [ 1.  1.  1.]

I just cannot figure out why is this the case? Is there a reason why the metrics are so high? I also tried with different train-test split ratio and the result seems to be no different.

Note: Here is the data info:

df_Obama.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5465 entries, 0 to 5464
Columns: 13078 entries, level_0 to zzzzzzzzzz
dtypes: int64(13078)
memory usage: 545.3 MB

df_Obama.head(3)
0023Washington  08hayabusa  09Its   .... 09what 1000000th   
0               1           0            1       0
1               0           0            0       0
0               0           0            0       0
1

1 Answers

1
votes

Is it possible that the classifier can see the target value? Is df_Obama['Class'] included in the array of features? It is not clear because you do not show the value of allAttribs_Obama.

Also check out the documentation for decision trees on scikit-learn, specifically:

"Decision trees tend to overfit on data with a large number of features."

You might want to try reducing your feature space (check out scikit-learn's documentation on feature selection).

On a side note, you can use sklearn.model_selection.train_test_split to create training and testing sets.