1
votes

For a movie reviews dataset, I'm creating a naive bayes multinomial model. Now in the training dataset, there are reviews per genre. So instead of creating a generic model for the movie reviews dataset-ignoring the genre feature, how do I train a model that also takes into consideration the genre feature in addition the tf-idf associated with words that occurred in the review. Do I need to create one model for each of the genre, or can I incorporate it into one model?

Training Dataset Sample:
genre, review, classification
Romantic, The movie was really emotional and touched my heart!, Positive
Action, It was a thrilling movie, Positive
....

Test Data Set:
Genre, review
Action, The movie sucked bigtime. The action sequences didnt fit into the plot very well
2

2 Answers

1
votes

From the documentation, The multinomial distribution normally requires integer feature counts. Categorical variables provided as inputs, especially if they are encoded as integers, may not have a positive impact on the predictive capacity of the models. As stated above, you may either consider using a neural network, or dropping the genre column entirely. If after fitting the model shows a sufficient predictive capability on the text features alone, it may not even be necessary to add as input a categorical variable.

The way I would try this task is by stacking the dummy categorical values with the text features, and feeding the stacked array to a SGD model, along with the target labels. You would then perform GridSearch for the optimal choice of hyperparameters.

0
votes

Consider treating genre as a categorical variable, probably with dummy encoding (see pd.get_dummies(df['genre'])), and feeding that as well as the tf-idf scores into your model.

Also consider other model types, besides Naive Bayes - a neural network involves more interaction between variables, and may help capture differences between genres better. Scikit-learn also has a MLPClassifier implementation which is worth a look.