I am working on a project to predict the personality of a user using his tweets.
For training, I have a large corpus of 350000 tweets of users who have already taken the personality test, each tweet being linked to a specific personality type. There are 16 different personality types(1-16)
I have performed pre-processing on this tweets to remove stop words, stemming and POS tagging.
I have a large dictionary of 500 most frequent words that I will be using as my features for training. After this, I have performed tfidf vectorization on each tweet using the predefined dictionary of 500 words to create a word vector for each tweet.
vectorizer=TfidfVectorizer(vocabulary=mydict,min_df=1)
x=vectorizer.fit_transform(corpus).toarray()
Where corpus is a list of all the tweets.
I then combine this x and y(1-16 classes for each tweet) using:
result=np.append(x,y,axis=1)
X=pandas.DataFrame(result)
X.to_csv('vectorized500.csv')
I am using this (350000*500) dataframe as X and my 1-16 numbered personality types as my Y dataframe(350000*1) which is divided into training and testing equally using:
X=pd.read_csv('vectorized500.csv')
train = X.sample(frac=0.8, random_state=200)
test=X.drop(train.index)
y_train=train["501"] #501 is the column name where Y is in the csv file
y_test=test["501"]
xtrain=train.drop("501",axis=1)
xtest=test.drop("501",axis=1)
However, no matter what algorithm I run, I'm getting very poor results:
model=RandomForestClassifier()
model.fit(xtrain,y_train)
pickle.dump(model, open('rf1000.sav', 'wb'))
print(model.score(xtest,y_test))
If i run RandomForestClassifier, i get 52% accuracy.
If I run Naive Bayes, Logistic Regression or Linear SVM, i get below 20% accuracy.
Is there any efficient way to run this kind of multiclass text classification or is there something I am doing wrong? The accuracy is too low and I want to improve it.
0. 56887 INFP 1. 54607 INFJ 2. 52511 INTJ 3. 52028 ENFP 4. 24294 INTP 5. 19032 ENTJ 6. 14284 ENFJ 7. 12502 ISFJ 8. 12268 ISTP 9. 10713 ISTJ 10. 10523 ESFP 11. 8103 ESTP 12. 7436 ESFJ 13. 7016 ESTJ 14. 6725 ISFPNumber of samples for each classification. - Rohil