(without really knowing the specifics) I think you did a good job massaging the tweet_tokens
and not_words
datasets. Nevertheless, you'll have to slightly modify them, for them to work as you (probably?) want.
- Inactivate the
mutate(row=...
line in your tweet_tokens <- ...
dataframe, as it would give troubles if you don't. Also re-run your sentiment <- ...
dataframe, just to be on the safe side.
tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
#mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
- Cut the last three lines of your
not_words <- ...
dataframe, as later that summary count(...
won't let you reference your dataframes. The select(user_id,user_key,created_str,word = word_2)
line gives you a dataframe with the same "standards" of your tweet_tokens
dataframe. Check also how my "word_2" column is now called "world" (in the new not_words
dataframe).
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
select(user_id,user_key,created_str,word = word_2)
Now, for your particular example/case, when using the word "matter" (for tweet_tokens
) we have indeed a dataframe of 696 rows...
> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696 4
and when using the word "matter" (for not_words
) we end up with a dataframe of 274 rows.
> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274 4
So if we just subtract matter_not
from matter_tweet
you would have those 422 rows you're looking for.
Well... no so fast... and strictly speaking I'm also sure that's not what you really want.
- The simple and accurate answer is:
> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows
- Now allow me to explain why is that you end up with 429 rows when you asked for 422.
> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256 4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250 4
#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.
> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)
> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267 4
These 267 rows are the ones you want to get rid off!. Hence you're looking for a dataframe of 696 - 267 = 429 rows!.
> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
user_id user_key created_str word
1: 1.518857e+09 nojonathonno 2016-11-08 10:36:14 matter
2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
3: 1.617939e+09 paulinett 2017-01-14 16:33:38 matter
4: 1.617939e+09 paulinett 2017-03-05 18:16:48 matter
5: 1.617939e+09 paulinett 2017-04-03 03:21:34 matter
---
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09 blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09 trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17 margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE
> #-in this way we keep the duplicates from/in 'matter_tweet'
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429 4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
user_id user_key created_str word
1: 1671234620 hyddrox 2016-10-17 07:22:47 matter
2: 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3: 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4: 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5: 2533221819 lazykstafford 2015-12-25 13:41:12 matter
---
411: 4508630900 thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147 melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144 datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620 hyddrox 2017-02-19 19:40:39 matter
> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)
> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE
CONCLUSION 1: do anti_join(matter_tweet,matter)
if you don't want duplicated values in your tweet_tokens
dataframe; do setdiff(matter_tweet,matter)
if otherwise.
CONCLUSION 2: if you noticed anti_join(matter_tweet,matter_not)
and anti_join(matter_tweet,matter)
gives you the same answer. This means that anti_join(...
doesn't take into account NAs in its workings.
Using 'to_lower = TRUE' with 'token = 'tweets'' may not preserve URLs. Error: Evaluation error: object 'custom_stop_words' not found.
when runningtweet_tokens <- tweets %>% ...
. – Manuel F