0
votes

I have 2 reproducible dataframes over here. I am trying to identify which column contain values that are similar to another column. I hope my code will randomly select 1 value from each column, and loop through every single column in df2.

df1 <- data.frame(fruit=c("Apple", "Orange", "Pear"), location = c("Japan", "China", "Nigeria"), price = c(32,53,12))
df2 <- data.frame(grocery = c("Durian", "Apple", "Watermelon"), place=c("Korea", "Japan", "Malaysia"), invoice = c("XD1", "XD2", "XD3"))

df1$source <- "DF1"
df2$source <- "DF2"

df1

   fruit location price source
1  Apple    Japan    32    DF1
2 Orange    China    53    DF1
3   Pear  Nigeria    12    DF1

df2  
     grocery    place invoice source
1     Durian    Korea     XD1    DF2
2      Apple    Japan     XD2    DF2
3 Watermelon Malaysia     XD3    DF2

This is the output I hope to obtain under a new dataframe called df3.

df3
         grocery    place    invoice source
    1     fruit    location     NA    DF1

The source column will allow the user to identify where the respective columns (fruit/location) comes from. The column name of df3 = the column names from df2, whereas the values under row1 = column names from df1.

The column Grocery is matched with fruit as there is a matching value i.e. "Apple" and "Japan" can be found in both place and location columns respectively.

Thank you!

2
Not actually clear about your output. Does that mean you will always have only 1 row in output? Would it always be column names from df2 as column names for df3 and column names from df1 as 1st row? - Ronak Shah
Hello Ronak, yes it will only have 1 row. Yes the column name under df3 will always be from df2 and and the row values from df1 - Javier
So something like setNames(data.frame(t(names(df1))), names(df2)) ? - Ronak Shah
Nope, it has to loop through the dataframe to identify which columns match up to which, not via a setNames method. You can see that under invoices, it is NA, as the values under invoices cannot be found in df1 - Javier
This is clearly doable but you should define "similar values". What do you mean if you say that the values of "grocery" are similar to those in "fruit"? Is one matching value enough? Or two? Or all? Or if one contains "DF1" and "DF2" and the other variable contains "DF3" and "DF4", are they similar?a - lebatsnok

2 Answers

2
votes

This might not be the most optimal solution but one way using double sapply (since for every column of df2 we want to find similar columns in df1)

sapply(names(df2), function(x) {
     temp <- sapply(names(df1), function(y) 
         if(any(match(df2[[x]], df1[[y]], nomatch = FALSE))) y else NA)
     ifelse(all(is.na(temp)), NA, temp[which.max(!is.na(temp))])
   }
)

# grocery     place     invoice     source 
# "fruit" "location"         NA         NA 

This would give you all the columns in df2 which match atleast in one value with the ones in df1. You can then change the source column manually later since if there is a row present you know it is from df1.

1
votes

Made a change. It's messy but it works.

#create data frame of column combinations
col_combos <- expand.grid(names(df1), names(df2))

#identify like columns
like_cols <- na.omit(col_combos[as.logical(apply(col_combos, 1, function(x) intersect(df1[, x[1] ],df2[, x[2] ]) > 1 )), ])

#match like columns
rbind(names(df2), as.character(like_cols$Var1)[match(names(df2), as.character(like_cols$Var2))])

     [,1]      [,2]       [,3]      [,4]    
[1,] "grocery" "place"    "invoice" "source"
[2,] "fruit"   "location" NA        NA