How do I use a data frame subset without namesakes of other columns being included?

Question

(Apologies for bad title, English is not my native language and I couldn't think of a good way to summarise the question.)

I have a dataset of various US county variables and a shapefile of US counties. I've merged the two, no problem, and now I'm trying to illustrate a variable across the counties of a particular state. But when I try to limit my data to the counties in a particular state, it selects not only the counties in that particular state, but also all counties in other states that have a namesake in said state. I just don't understand why it does that, from what I can tell it really should select only the counties in the specified state.

I'm using the sf, tmap, tmaptools, dplyr, ggplot, and leaflet packages. Here's the code I'm using:

mydata <- readr::read_csv("county_facts.csv")

mymap <- st_read("cb_2014_us_county_500k.shp")

map_and_data <- inner_join(mymap, mydata, by = c("NAME" = "area_name"))

tm_shape(map_and_data[map_and_data$state_abbreviation == "SC",])+
  tm_polygons("AGE135214", id = "NAME", palette = "Greens")

(the column for county names is "NAME" in the shapefiles and "area_name" in the data set)

Here AGE135214 is the variable I'm plotting, and NAME is the county names, and in this example I'm trying to plot it for South Carolina. I attempted a workaround by changing the merging of the data and shapefiles:

map_and_data2 <- inner_join(mymap, mydata[mydata$state_abbreviation=="SC",], by = c("NAME" = "area_name"))

But this only resulted in the new merged data frame including the erroneous namesakes.

I'm new to programming so apologies if there is a super obvious solution. Any help is greatly appreciated!

The data and shapefiles is from https://www.kaggle.com/benhamner/2016-us-election, if that helps.

Austin Graves Austin Graves · Accepted Answer · 2020-11-18T07:16:22

and welcome. I had that issue when I first started working with county level data. The issue is that "NAME" and "area_name" are drastically different (just take a quick look and you'll see that area_name has lots of extra words in it like 'county' that stop the join from working). I find it's best practice when using county data to use the fips code for joins. The map data does not have a fips column ready, but it can be constructed easily. I've implemented it below and it seems to work for me. I hope you have a good day, and I wish you luck on your project.

mymap$fips <- as.numeric(paste0(mymap$STATEFP, mymap$COUNTYFP))

map_and_data <- left_join(mymap, mydata, by = "fips")


tm_shape(map_and_data %>% filter(state_abbreviation == "SC"))+
  tm_polygons("AGE135214", id = "NAME", palette = "Greens")

How do I use a data frame subset without namesakes of other columns being included?

1 Answers