2
votes

I have a data set of a few hundred thousand rows. Below is an example of what they look like.

  X      user_id     screen_name                name        location
  1 1 1.732895e+09   DROPPSScience   DROPPS Consortium                
  2 2 1.172266e+18    Lamy40283167  Alex lamy precious Washington, USA
  3 3 3.773702e+08         cdockjr      Calvin Wilborn    Alabama, USA
  4 4 7.040063e+07           xmtl2             Felicio                
  5 5 3.929519e+08    DeleceWrites Delece Smith-Barrow  Washington, DC
  6 6 1.130459e+18    evabrooke_26                 Eva                
  7 7 1.067302e+08 MitchellHortert Mitchell R. Hortert   Pittsburgh,PA

I have a second data set found at https://github.com/jasonong/List-of-US-States/blob/master/states.csv

I am trying to use str_detect() to find any matches between the "location" column and either column in the states.csv file. I would then like to create a new variable that stores the matched pattern for each observation.

So far I have tried using

data.set %>%
    filter(str_detect(location, paste(states$State)

This returns some matches, but omits many observations and gives the warning

Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
longer object length is not a multiple of shorter object length

states$State is a factor variable with 51 levels for each state and DC. What causes this warning, with very few matches but it works on some level?

Finally, how would I create a new variable that is based on when a match does occur, putting the matching pattern in the new variable?

1
Did you try using a mutate()? - Hansel Palencia

1 Answers

3
votes

If both the 'location' and 'State' are not of the same length, an option is to use collapse in paste to recognize each of the patterns. It serves as OR

library(stringr)
library(dplyr)
data.set %>%
    filter(str_detect(location, paste(states$State, collapse = "|")))

As we are already using stringr, str_c can replace paste

data.set %>%
    filter(str_detect(location, str_c(states$State, collapse = "|")))

Or as @HanselPalencia mentioned, if there are confusion in 'State' use the `Abbreviation' column for pattern detection

data.set %>%
  filter(str_detect(location, str_c(states$Abbreviation, collapse = "|")))