0
votes

I have a data frame of words (tweets have been tokenised), the number of uses of this word and the sentiment score attached to it and the total score (n * value). I have created another data frame that are all the words in my corpus that follow a negative (so I have made bigrams and filtered for word_1 being a negative).

I want to subtract the amount of negatives from the original data frame so it shows the net amount of a word.

library(tidyverse)
library(tidyr)
library(tidytext)
tweets <- read_csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv")

custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"), 
      lexicon = c("custom")), stop_words)


tweet_tokens <- tweets %>% 
  select(user_id, user_key, text, created_str) %>% 
  na.omit() %>% 
  mutate(row= row_number()) %>% 
  unnest_tokens(word, text, token = "tweets") %>% 
  filter(!word %in% custom_stop_words$word)

sentiment <- tweet_tokens %>% 
  count(word, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = n * value)
#df showing contribution of overall sentiment to each word

negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't") 

bigrams <- tweets %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams. 

bigrams_separated <- bigrams %>% 
  separate(bigram, c("word_1", "word_2"), sep = " ")

not_words <- bigrams_separated %>%
  filter(word_1 %in% negation_words) %>%
  inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
  count(word_2, value, sort = TRUE) %>% 
  mutate(value = value * -1) %>% 
  mutate(contribution = value * n)

I would like the outcome to be one data frame. So if sentiment shows 'matter' appears 696 times, but the not_words df shows it was preceded by a negation 274 times, the new data frame has the n value for 'matter' is 422.

1
please, update your code with the libraries your are using.Manuel F
Have done @ManuelF.alex_stephenson
sorry bro!. I keep on getting: Using 'to_lower = TRUE' with 'token = 'tweets'' may not preserve URLs. Error: Evaluation error: object 'custom_stop_words' not found. when running tweet_tokens <- tweets %>% ....Manuel F
Ah that's my fault, I'll edit my original post.alex_stephenson

1 Answers

0
votes

(without really knowing the specifics) I think you did a good job massaging the tweet_tokens and not_words datasets. Nevertheless, you'll have to slightly modify them, for them to work as you (probably?) want.

  1. Inactivate the mutate(row=... line in your tweet_tokens <- ... dataframe, as it would give troubles if you don't. Also re-run your sentiment <- ... dataframe, just to be on the safe side.
tweet_tokens <- tweets %>% 
   select(user_id, user_key, text, created_str) %>% 
   na.omit() %>% 
   #mutate(row= row_number()) %>% 
   unnest_tokens(word, text, token = "tweets") %>% 
   filter(!word %in% custom_stop_words$word)
  1. Cut the last three lines of your not_words <- ... dataframe, as later that summary count(... won't let you reference your dataframes. The select(user_id,user_key,created_str,word = word_2) line gives you a dataframe with the same "standards" of your tweet_tokens dataframe. Check also how my "word_2" column is now called "world" (in the new not_words dataframe).
not_words <- bigrams_separated %>%
   filter(word_1 %in% negation_words) %>%
   inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
   select(user_id,user_key,created_str,word = word_2)

Now, for your particular example/case, when using the word "matter" (for tweet_tokens) we have indeed a dataframe of 696 rows...

> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696   4

and when using the word "matter" (for not_words) we end up with a dataframe of 274 rows.

> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274   4

So if we just subtract matter_not from matter_tweet you would have those 422 rows you're looking for.
Well... no so fast... and strictly speaking I'm also sure that's not what you really want.

  • The simple and accurate answer is:
> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows
  • Now allow me to explain why is that you end up with 429 rows when you asked for 422.
> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256   4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250   4

#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.

> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)

> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267   4

These 267 rows are the ones you want to get rid off!. Hence you're looking for a dataframe of 696 - 267 = 429 rows!.

> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
          user_id       user_key         created_str   word
  1: 1.518857e+09   nojonathonno 2016-11-08 10:36:14 matter
  2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
  3: 1.617939e+09      paulinett 2017-01-14 16:33:38 matter
  4: 1.617939e+09      paulinett 2017-03-05 18:16:48 matter
  5: 1.617939e+09      paulinett 2017-04-03 03:21:34 matter
 ---                                                       
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09    blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09  trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17  margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE

> #-in this way we keep the duplicates from/in 'matter_tweet' 
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429   4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
        user_id        user_key         created_str   word
  1: 1671234620         hyddrox 2016-10-17 07:22:47 matter
  2: 1623180199  jeffreykahunas 2016-09-14 12:53:37 matter
  3: 1594887416  jery_robertsyo 2016-10-21 14:24:05 matter
  4: 1680366068   willisbonnerr 2017-02-14 09:14:24 matter
  5: 2533221819   lazykstafford 2015-12-25 13:41:12 matter
 ---                                                      
411: 4508630900  thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147   melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144    datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620         hyddrox 2017-02-19 19:40:39 matter

> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)

> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE

CONCLUSION 1: do anti_join(matter_tweet,matter) if you don't want duplicated values in your tweet_tokens dataframe; do setdiff(matter_tweet,matter) if otherwise.

CONCLUSION 2: if you noticed anti_join(matter_tweet,matter_not) and anti_join(matter_tweet,matter) gives you the same answer. This means that anti_join(... doesn't take into account NAs in its workings.