2
votes

I am working on a project to predict the personality of a user using his tweets.

For training, I have a large corpus of 350000 tweets of users who have already taken the personality test, each tweet being linked to a specific personality type. There are 16 different personality types(1-16)

I have performed pre-processing on this tweets to remove stop words, stemming and POS tagging.

I have a large dictionary of 500 most frequent words that I will be using as my features for training. After this, I have performed tfidf vectorization on each tweet using the predefined dictionary of 500 words to create a word vector for each tweet.

vectorizer=TfidfVectorizer(vocabulary=mydict,min_df=1)
x=vectorizer.fit_transform(corpus).toarray()

Where corpus is a list of all the tweets. I then combine this x and y(1-16 classes for each tweet) using:

result=np.append(x,y,axis=1)
X=pandas.DataFrame(result)
X.to_csv('vectorized500.csv')

I am using this (350000*500) dataframe as X and my 1-16 numbered personality types as my Y dataframe(350000*1) which is divided into training and testing equally using:

X=pd.read_csv('vectorized500.csv')
train = X.sample(frac=0.8, random_state=200)
test=X.drop(train.index)
y_train=train["501"] #501 is the column name where Y is in the csv file
y_test=test["501"]
xtrain=train.drop("501",axis=1)
xtest=test.drop("501",axis=1)

However, no matter what algorithm I run, I'm getting very poor results:

model=RandomForestClassifier()
model.fit(xtrain,y_train)
pickle.dump(model, open('rf1000.sav', 'wb'))
print(model.score(xtest,y_test))

If i run RandomForestClassifier, i get 52% accuracy.

If I run Naive Bayes, Logistic Regression or Linear SVM, i get below 20% accuracy.

Is there any efficient way to run this kind of multiclass text classification or is there something I am doing wrong? The accuracy is too low and I want to improve it.

1
How many examples from each class do you have? - Giorgos Myrianthous
Haven't counted that, but I'm sure atleast 15000 from each class. I'll compute it and get back to you. - Rohil
Also, it would be better if you use some more examples for training rather than in testing. You mentioned that training and testing data points are equally divided. Try to use 80:20, 70:30 etc. (training:testing ratio). Furthermore, have you tried to tune the parameters for each algorithm? - Giorgos Myrianthous
0. 56887 INFP 1. 54607 INFJ 2. 52511 INTJ 3. 52028 ENFP 4. 24294 INTP 5. 19032 ENTJ 6. 14284 ENFJ 7. 12502 ISFJ 8. 12268 ISTP 9. 10713 ISTJ 10. 10523 ESFP 11. 8103 ESTP 12. 7436 ESFJ 13. 7016 ESTJ 14. 6725 ISFP Number of samples for each classification. - Rohil
Since the four factors of Myers-Briggs are meant to be independent, why not train a classifier for each, and then give the combined result? I.e. train a classifier for Introvert vs Extravert, another for Intuitive vs Sensing, and so on. - Nameless One

1 Answers

4
votes

The problem might be the imbalanced dataset you are using.

0. 56887 INFP 1. 54607 INFJ 2. 52511 INTJ 3. 52028 ENFP 4. 24294 INTP 5. 19032 ENTJ 6. 14284 ENFJ 7. 12502 ISFJ 8. 12268 ISTP 9. 10713 ISTJ 10. 10523 ESFP 11. 8103 ESTP 12. 7436 ESFJ 13. 7016 ESTJ 14. 6725 ISFP

Imbalanced data, refers to a problem where the classes are not equally represented. There are many techniques that can be used for dealing with this phenomenon.

  1. Collect more data

    Try if possible, to collect more data for the classes with few examples.

  2. Use other performance metrics

    Accuracy is not a metric that can be used when your dataset is imbalanced. Imagine that you have two classes (0 and 1) where 99 examples belong to class 0 and just 1 example to class 1. If you build a model that always assigns class 0 to every testing point you will end up with 99% accuracy but obviously this is not what you want. Some useful metrics other than accuracy are the following:

    • Precision/Recall/F-score (Extracted from a Confusion Matrix)
    • ROC curves
  3. Undersampling

    Try to discard examples from your most popular classes, so that all the classes have approximately the same amount of examples. Throwing data away might not be a good idea, so try to avoid undersampling.