Using dplyr to fill in missing values (through a join?)

Question

I have a data frame (df1) that has some missing values (city, state):

SiteID    City          StateBasedIn   Lat  Lon   Var1 Var2
4227      Richmond      KY            -39  -113   6    0
4987      Nashville     TN            -33  -97    7    0
4000      Newark        NJ            -39  -95    8    0
4925      Miami         FL            -40  -99    0    0
4437      Montgomery    AL            -32  -117   4    1
4053      Jonesboro     AR            -30  -98    8    1

df1 <- structure(list(SiteID = c(4227L, 4987L, 4000L, 4925L, 4437L, 
4053L, 4482L, 4037L, 4020L, 1787L, 2805L, 3025L, 3027L, 3028L, 
3029L, 3030L, 3031L, 3033L), City = structure(c(10L, 7L, 8L, 
5L, 6L, 4L, 2L, 9L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"Arcata", "Jackson", "Jonesboro", "Miami", "Montgomery", "Nashville", 
"Newark", "Portland", "Richmond"), class = "factor"), StateBasedIn = structure(c(6L, 
10L, 8L, 5L, 2L, 3L, 4L, 9L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("", "AL", "AR", "CA", "FL", "KY", "MS", "NJ", 
"OR", "TN"), class = "factor"), Lat = c(-39L, -33L, -39L, -40L, 
-32L, -30L, -38L, -31L, -35L, -38L, -30L, -39L, -38L, -32L, -39L, 
-31L, -38L, -34L), Lon = c(-113L, -97L, -95L, -99L, -117L, -98L, 
-98L, -95L, -112L, -120L, -114L, -81L, -117L, -90L, -109L, -115L, 
-81L, -104L), Var1 = c(6L, 7L, 8L, 0L, 4L, 8L, 1L, 8L, 0L, 3L, 
3L, 7L, 4L, 8L, 0L, 8L, 1L, 3L), Var2 = c(0L, 0L, 0L, 0L, 1L, 
1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("SiteID", 
"City", "StateBasedIn", "Lat", "Lon", "Var1", "Var2"), class = "data.frame", row.names = c(NA, 
-18L))

And I would like to fill those values in by merging with another data frame (df2) that has 3 of the same columns, but not all of the columns that are in df1:

SiteID       City         StateBasedIn
1787         Lusby        MD
2805         Springdale   AR
3025         Saukville    WI
3027         Saukville    WI
3028         Saukville    WI
3029         Saukville    WI

df2 <- structure(list(SiteID = c(1787L, 2805L, 3025L, 3027L, 3028L, 
3029L, 3030L, 3031L, 3033L), City = structure(c("Lusby", "Springdale", 
"Saukville", "Saukville", "Saukville", "Saukville", "Saukville", 
"Mequon", "Mequon"), .Dim = c(9L, 1L)), StateBasedIn = structure(c("MD", 
"AR", "WI", "WI", "WI", "WI", "WI", "WI", "WI"), .Dim = c(9L, 
1L))), row.names = c(NA, -9L), class = "data.frame", .Names = c("SiteID", 
"City", "StateBasedIn"))

So basically I would retain all of the information in df1, and input the missing values that are available from df2. As I'm not too familiar with all of the dplyr options yet, I tried the different 'join' options but had no luck. I also tried to use 'merge' in the base package but still no success. Is there another way to do this (preferably with dplyr)?

FYI: your df2 is not a proper data.frame; the second two columns are matrices. — r2evans
Jason, etiquette on SO dictates that you accept one of the answers provided. Not only does it acknowledge to others who follow with the same dilemma as you that the problem was resolved by one or more of the answers, but it also "rewards" those who volunteer their time to help you. — r2evans

Jake Kaupp Jake Kaupp · Accepted Answer · 2017-08-15T01:19:00

You can use a full_join from dplyr, along with replace and coalesce to put together a pretty concise solution.

library(dplyr)
library(purrr) 

# Cleaning from r2evans (if you want to keep it to dplyr just use r2evans lapply method

df1 <- mutate_if(df1, is.factor, as.character)
df2 <- dmap(df2, as.vector)

full_join(df1, df2, by = "SiteID") %>% 
  mutate_at(vars(matches("City","StateBased")), funs(replace(., . == "", NA))) %>% 
  mutate(City = coalesce(City.y, City.x),
         StateBasedIn = coalesce(StateBasedIn.y, StateBasedIn.x)) %>% 
  select(-contains("."))

Using dplyr to fill in missing values (through a join?)

3 Answers