1
votes

I am trying to apply Sentiment Analysis (predicting negative and positive tweets) on a relatively large Dataset (10000 rows). So far, I achieved only ~73% accuracy using Naive Bayes and my method called "final" shown below to extract features. I want to add PoS to help with the classification, but am completely unsure how to implement it. I tried writing a simple function called "pos" (which I posted below) and attempted using the tags on my cleaned dataset as features, but only got around 52% accuracy this way.. Can anyone lead me in the right direction to implement PoS for my model? Thank you.

def pos(word):
 return [t for w, t in nltk.pos_tag(word)]


def final(text):

   """
   I have code here to remove URLs,hashtags, 
   stopwords,usernames,numerals, and punctuation.
   """

   #lemmatization
   finished = []
   for x in clean:
      finished.append(lem.lemmatize(x))

   return finished
1
in your pos(x) is x an individual word or the whole tweet? Because POS tagging individual words can be very inaccurate. - 0x5050
I applied it to each word with this line: clean_text = clean_text.apply(pos), where "clean_text" is a tokenized version of all the tweets. How/where should i apply pos then? I apologize; I am completely new to this. @PradipPramanick - jest3r

1 Answers

1
votes

You should first split the tweets into sentences and then tokenize. NLTK provides a method for this.

   from nltk.tokenize import sent_tokenize
   sents = sent_tokenize(tweet)

After this supply this list of sentences to your nltk.pos_tag method. That should give accurates POS tags.