Match columns in two different data frames and subtract corresponding values

Question

I have a data frame of words (tweets have been tokenised), the number of uses of this word and the sentiment score attached to it and the total score (n * value). I have created another data frame that are all the words in my corpus that follow a negative (so I have made bigrams and filtered for word_1 being a negative).

I want to subtract the amount of negatives from the original data frame so it shows the net amount of a word.

library(tidyverse)
library(tidyr)
library(tidytext)
tweets <- read_csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv")

custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"), 
      lexicon = c("custom")), stop_words)


tweet_tokens <- tweets %>% 
  select(user_id, user_key, text, created_str) %>% 
  na.omit() %>% 
  mutate(row= row_number()) %>% 
  unnest_tokens(word, text, token = "tweets") %>% 
  filter(!word %in% custom_stop_words$word)

sentiment <- tweet_tokens %>% 
  count(word, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = n * value)
#df showing contribution of overall sentiment to each word

negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't") 

bigrams <- tweets %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams. 

bigrams_separated <- bigrams %>% 
  separate(bigram, c("word_1", "word_2"), sep = " ")

not_words <- bigrams_separated %>%
  filter(word_1 %in% negation_words) %>%
  inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
  count(word_2, value, sort = TRUE) %>% 
  mutate(value = value * -1) %>% 
  mutate(contribution = value * n)

I would like the outcome to be one data frame. So if sentiment shows 'matter' appears 696 times, but the not_words df shows it was preceded by a negation 274 times, the new data frame has the n value for 'matter' is 422.

sorry bro!. I keep on getting: Using 'to_lower = TRUE' with 'token = 'tweets'' may not preserve URLs. Error: Evaluation error: object 'custom_stop_words' not found. when running tweet_tokens <- tweets %>% .... — Manuel F

Manuel F Manuel F · Accepted Answer · 2019-08-23T14:15:40

(without really knowing the specifics) I think you did a good job massaging the tweet_tokens and not_words datasets. Nevertheless, you'll have to slightly modify them, for them to work as you (probably?) want.

Inactivate the mutate(row=... line in your tweet_tokens <- ... dataframe, as it would give troubles if you don't. Also re-run your sentiment <- ... dataframe, just to be on the safe side.

tweet_tokens <- tweets %>% 
   select(user_id, user_key, text, created_str) %>% 
   na.omit() %>% 
   #mutate(row= row_number()) %>% 
   unnest_tokens(word, text, token = "tweets") %>% 
   filter(!word %in% custom_stop_words$word)

Cut the last three lines of your not_words <- ... dataframe, as later that summary count(... won't let you reference your dataframes. The select(user_id,user_key,created_str,word = word_2) line gives you a dataframe with the same "standards" of your tweet_tokens dataframe. Check also how my "word_2" column is now called "world" (in the new not_words dataframe).

not_words <- bigrams_separated %>%
   filter(word_1 %in% negation_words) %>%
   inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
   select(user_id,user_key,created_str,word = word_2)

Now, for your particular example/case, when using the word "matter" (for tweet_tokens) we have indeed a dataframe of 696 rows...

> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696   4

and when using the word "matter" (for not_words) we end up with a dataframe of 274 rows.

> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274   4

So if we just subtract matter_not from matter_tweet you would have those 422 rows you're looking for.
Well... no so fast... and strictly speaking I'm also sure that's not what you really want.

The simple and accurate answer is:

> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows

Now allow me to explain why is that you end up with 429 rows when you asked for 422.

> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256   4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250   4

#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.

> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)

> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267   4

These 267 rows are the ones you want to get rid off!. Hence you're looking for a dataframe of 696 - 267 = 429 rows!.

> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
          user_id       user_key         created_str   word
  1: 1.518857e+09   nojonathonno 2016-11-08 10:36:14 matter
  2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
  3: 1.617939e+09      paulinett 2017-01-14 16:33:38 matter
  4: 1.617939e+09      paulinett 2017-03-05 18:16:48 matter
  5: 1.617939e+09      paulinett 2017-04-03 03:21:34 matter
 ---                                                       
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09    blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09  trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17  margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE

> #-in this way we keep the duplicates from/in 'matter_tweet' 
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429   4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
        user_id        user_key         created_str   word
  1: 1671234620         hyddrox 2016-10-17 07:22:47 matter
  2: 1623180199  jeffreykahunas 2016-09-14 12:53:37 matter
  3: 1594887416  jery_robertsyo 2016-10-21 14:24:05 matter
  4: 1680366068   willisbonnerr 2017-02-14 09:14:24 matter
  5: 2533221819   lazykstafford 2015-12-25 13:41:12 matter
 ---                                                      
411: 4508630900  thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147   melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144    datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620         hyddrox 2017-02-19 19:40:39 matter

> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)

> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE

CONCLUSION 1: do anti_join(matter_tweet,matter) if you don't want duplicated values in your tweet_tokens dataframe; do setdiff(matter_tweet,matter) if otherwise.

CONCLUSION 2: if you noticed anti_join(matter_tweet,matter_not) and anti_join(matter_tweet,matter) gives you the same answer. This means that anti_join(... doesn't take into account NAs in its workings.

Match columns in two different data frames and subtract corresponding values

1 Answers