0
votes

I have data.frame sent with sentences in sent$words and dictionary with pos/neg words in wordsDF data frame (wordsDF[x,1]). Positive words = 1 and negative = -1 (wordsDF[x,2]). The words in that wordsDF data frame are sorted in decreasing order according to their length (length of string). I used this purpose for my following function.

How this function works:

1) Count occurancies of words stored in wordsDF through each sentences 2) Compute sentiment score: count of occurencies particular word (wordsDF) in particular sentence * sentiment value for that word (positive = 1, negative = -1) 3) Remove that matched word from sentence for another iteration.

Original solution using of stringr package:

scoreSentence_01 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    count <- str_count(sentence, wordsDF[x,1])
    score <- (score + (count * wordsDF[x,2])) # compute score (count * sentValue)
    sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
  }
  score
}

Faster solution - rows 4 and 5 replace row 4 in original solution.

scoreSentence_02 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    sd <- function(text) {stri_count(text, regex=wordsDF[x,1])}
    results <- sapply(sentence, sd, USE.NAMES=F)
    score <- (score + (results * wordsDF[x,2])) # compute score (count * sentValue)
    sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
  }
  score
}

Calling functions is:

scoreSentence_Score <- scoreSentence_01(sent$words)

In real I'm using data set with 300.000 sentences and dictionary with positive and negative words - overall 7.000 words. This approach is very very slow for that and because my beginer knowledge in R programming I'm in the end of my efforts.

Could you anyone help me, how to rewrite this function into vectorized or parallel solution, please. Any help or advice is very appreciated. Thank you very much in advance.

Dummy data:

sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
                         "wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
                         "great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
                          stringsAsFactors=F)

posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")

negWords <- c("hate","bad","not good","horrible")

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(10000,sent$words))
sent <- coredata(sent)[rep(seq(nrow(sent)),10000),]
sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL

Desired output is:

                                                                        words user scoreSentence_Score
                         great just great right size and i love this notebook    1                   4
                                             benefits great laptop at the top    2                   2
                                           wouldnt bad notebook and very good    3                   2
                                                            very good quality    4                   1
                                                         bad orgtop but great    5                   0
 great improvement for that great improvement bad product but overall is not good    6                   0
                                   notebook is not good but i love batterytop    7                   0
2

2 Answers

1
votes

Okay, now that I know you have to work around phrases and words... here's another shot at it. Basically, you have to split out your phrases first, score them, remove them from the string, then score your words...

library(stringr)
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
                             "wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
                             "great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
              "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
              "wouldnt bad")

negWords <- c("hate","bad","not good","horrible")
sent$words2 <- sent$words
# split bad into words and phrases...
bad_phrases <- negWords[grepl(" ", negWords)]
bad_words <- negWords[!negWords %in% bad_phrases]
bad_words <- paste0("\\b", bad_words, "\\b")
pos_phrases <- posWords[grepl(" ", posWords)]
pos_words <- posWords[!posWords %in% pos_phrases]
pos_words <- paste0("\\b", pos_words, "\\b")
score <-  - str_count(sent$words2, paste(bad_phrases, collapse="|"))
sent$words2 <- gsub(paste(bad_phrases, collapse="|"), "", sent$words2)
score <- score + str_count(sent$words2, paste(pos_phrases, collapse="|"))
sent$words2 <- gsub(paste(pos_phrases, collapse="|"), "", sent$words2)
score <- score + str_count(sent$words2, paste(pos_words, collapse="|"))  - str_count(sent$words2, paste(bad_words, collapse="|")) 
score
0
votes

can't you just do:

library("stringr")
scoreSentence_Score <- str_count(sent$words, wordsDF[,1]) - str_count(sent$words, wordsDF[,2])