0
votes

I have a dataset with social media posts that looks as below, but its in Farsi and I could not find a readily available R/Python sentiment analysis package.

post/tweet
"we are tired of this regime and need to make a change happen now"

Ideally, I want to classify each statement as having a negative, positive, or neutral sentiment. Therefore, I created a small dictionary where I classify words into either negative or positive ones.

library(tidyverse)
library(stringr)
library(readxl)
#install.packages("tidyverse")
#install.packages("stringr")
#install.packages("readxl")

With three categories: positive, negative, and neutral

Raw_data_on_posts %>% mutate(p_count = str_count(post, str_c(Dictionary$positive, collapse = '|')), 
                 n_count = str_count(post, str_c(Dictionary$negative, collapse = '|'))) %>% 
           mutate(label = case_when(p_count > n_count ~ 'positive',
                                    p_count < n_count ~ 'negative',
                                    TRUE ~ 'neutral')) %>% select(post, label)

I ended up having most statements as neutral, although based on my reading of the social media posts are either pro or anti the Iranian regime. Specifically, I believe this occurred because it is classifying words that I neither classified as negative nor positive, as neutral. But is it possible to instead only compare whether a statement has more negative or positive words?

post/tweet                                 sentiment 
"we are tired of this regime               neutral 
and need to make a change happen now"

Furthermore, I wonder if it makes sense to instead use a binary, rather than multi-class sentiment in this case?

1

1 Answers

0
votes

Your syntax doesn't appear to have any point where it will account for neutral words, so they are unlikely to make a difference. Tagging posts which don't have any words registering as positive or negative may reveal a problem with your data or positive/negative classes. See if this edit to your code throws up any data that isn't recognised:

mutate(label = case_when(p_count > n_count ~ 'positive',
                                    p_count < n_count ~ 'negative',
                                     n_count==0 & p_count== 0 ~ 'Not recognised'
                                    TRUE ~ 'neutral'))