To make sure that your second data frame is unique on the join column(s), you can use my package safejoin (a wrapper around dplyr
's join functions) which will give you an explicit error if it's not the case.
Current situation :
df1 <- data.frame(column1 = c("a","b","b"), X = 1:3)
df2 <- data.frame(column1 = c("a","b"), Y = 4:5)
df3 <- data.frame(column1 = c("a","a","b"), Y = 4:6)
merge(df1,df2, by="column1",all.x=TRUE)
# column1 X Y
# 1 a 1 4
# 2 b 2 5
# 3 b 3 5
merge(df1,df3, by="column1",all.x=TRUE)
# column1 X Y
# 1 a 1 4
# 2 a 1 5
# 3 b 2 6
# 4 b 3 6
Some values were duplicated by mistake.
Using safejoin :
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
safe_left_join(df1, df2, check= "V")
# column1 X Y
# 1 a 1 4
# 2 b 2 5
# 3 b 3 5
safe_left_join(df1, df3, check= "V")
# Error: y is not unique on column1
# Call `rlang::last_error()` to see a backtrace
check = "V"
controls that the join columns are unique on the right hand side (check = "U"
like Unique checks that they are unique on the left hand side, "V"
is the next letter in the alphabet).
table(df2$column1)
. If you find a value from column1 with a count > 1 then this is the reason. - RDizzl3sqldf
which allows you to use sql like queries on your data frames! - RDizzl3