3
votes

If I read in a Stata or SAS dataset with labels using haven, it will be (at least in haven 0.2.0) read with the following format:

library(dplyr)
df1 <- data_frame(fips = structure(c(1001, 1001, 1001, 1001, 1001),
                                   label = "FIPS (numeric)"),
                  id = structure(letters[1:5], label = "ID"))
df2 <- data_frame(fips = structure(c(1001, 1003, 1005, 1007, 1009),
                                   label = "FIPS (numeric)"),
                  state = structure("AL", label = "State Abbreviation"))

(If necessary, I can post some Stata data that produces this, but this should be easy to verify using any labeled Stata/SAS dataset.)

When I try to use any of the dplyr join functions to join on a labeled column, I am sorely disappointed:

df1 %>% inner_join(df2)

returns the error

Error in eval(expr, envir, enclos) : cannot join on columns 'fips' x 'fips': Can't join on 'fips' x 'fips' because of incompatible types (numeric / numeric)

The only way to avoid it seems to be to remove the labels on the join variables:

df1 %>%
  mutate(fips = `attr<-`(fips, 'label', NULL)) %>% 
  inner_join(df2 %>% mutate(fips = `attr<-`(fips, 'label', NULL)))

which raises the question of why the labels were read in the first place. (The join also obliterates the labels in df2.)

This would seem to be a bug in the way haven and dplyr interact. Is there a better solution?

3
More precisely, isn't it an issue with how dplyr interacts with structures, given that it's reproducable without using haven?hoyland

3 Answers

0
votes

Try converting the columns into a character string. This seems to work

df1$fips<-as.character(df1$fips)
df2$fips<-as.character(df2$fips)
df1 %>% inner_join(df2)

The help page for inner_join does state: "a character vector of variables to join by"

0
votes

When dplyr joins on a variable that is a factor in one dataset and a character in the other it sends out a warning but completes the join. numeric and character vectors are not compatible classes so it errors out. By converting them both to character the join works fine

library(dplyr)
df1 %>% 
  mutate(fips = as.character(fips)) %>%
  inner_join(
    df2 %>%
      mutate(fips = as.character)
    )
0
votes

This was fixed at some point, and works in dplyr 0.7.4. I can't track down the exact version where it was fixed.