I have been reading more modern posts about sentiment classification (analysis) such as this.
Taking the IMDB dataset as an example I find that I get a similar accuracy percentage using Doc2Vec (88%), however a far better result using a simple tfidf vectoriser with tri-grams for feature extraction (91%). I think this is similar to Table 2 in Mikolov's 2015 paper.
I thought that by using a bigger data-set this would change. So I re-ran my experiment using a breakdown of 1mill training and 1 mill test from here. Unfortunately, in that case my tfidf vectoriser feature extraction method increased to 93% but doc2vec fell to 85%.
I was wondering if this is to be expected and that others find tfidf to be superior to doc2vec even for a large corpus?
My data-cleaning is simple:
def clean_review(review):
temp = BeautifulSoup(review, "lxml").get_text()
punctuation = """.,?!:;(){}[]"""
for char in punctuation
temp = temp.replace(char, ' ' + char + ' ')
words = " ".join(temp.lower().split()) + "\n"
return words
And I have tried using 400 and 1200 features for the Doc2Vec model:
model = Doc2Vec(min_count=2, window=10, size=model_feat_size, sample=1e-4, negative=5, workers=cores)
Whereas my tfidf vectoriser has 40,000 max features:
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)
For classification I experimented with a few linear methods, however found simple logistic regression to do OK...