I have a large data.frame / tibble with several character columns and am in the process of cleaning data there. One column contains city names. Sometimes a row does not contain a city name (i.e. City is "" or City may also be NA). Sometimes cities are marked with a degree symbol (i.e. "°" or '\u00B0').
Example situation using tidyverse / dplyr and stringr:
nrow(df) #5000
df.degree <- df %>% filter(str_detect(city, '\u00B0'))
nrow(df.degree) #30
df.withoutdegree <- df %>% filter(!str_detect(city, '\u00B0'))
nrow(df.withoutdegree) #4500
My goal is to remove only the 30 rows that contain the degree symbol in the city column. If I look for those rows, I get them using filter and str_detect. Negating str_detect removes many more rows than just those 30 though.
This seems to be a case where I missed some obvious documentation or parameter I need to set or different approach I'm missing. However, I can't seem to find it. Can you point me into the right direction?
Any hints with code examples on making this even more elegant (maybe with "contains()"?) are also very much appreciated.
Thanks! :)
PS: The following works just fine btw:
df.withoutdegree <- df %>% filter(!(grepl('\u00B0', city, ignore.case = TRUE)))
nrow(df.withoutdegree) #4970
However, I find the code harder to read for peers and I'm generally interested to learn about why negating str_detect doesn't work in this case.
str_which()to identify some cases that it is excluding which should not be excluded as per your logic? - Aramis7d