I have a data set of a few hundred thousand rows. Below is an example of what they look like.
X user_id screen_name name location
1 1 1.732895e+09 DROPPSScience DROPPS Consortium
2 2 1.172266e+18 Lamy40283167 Alex lamy precious Washington, USA
3 3 3.773702e+08 cdockjr Calvin Wilborn Alabama, USA
4 4 7.040063e+07 xmtl2 Felicio
5 5 3.929519e+08 DeleceWrites Delece Smith-Barrow Washington, DC
6 6 1.130459e+18 evabrooke_26 Eva
7 7 1.067302e+08 MitchellHortert Mitchell R. Hortert Pittsburgh,PA
I have a second data set found at https://github.com/jasonong/List-of-US-States/blob/master/states.csv
I am trying to use str_detect() to find any matches between the "location" column and either column in the states.csv file. I would then like to create a new variable that stores the matched pattern for each observation.
So far I have tried using
data.set %>%
filter(str_detect(location, paste(states$State)
This returns some matches, but omits many observations and gives the warning
Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
longer object length is not a multiple of shorter object length
states$State is a factor variable with 51 levels for each state and DC. What causes this warning, with very few matches but it works on some level?
Finally, how would I create a new variable that is based on when a match does occur, putting the matching pattern in the new variable?