1
votes

I have two data frame with same row size and different column number, the name of the columns is also different, however the content may be similar in some of them.

i.e. df1:

df1<- data.frame("a"=c("0","1","0","1","0","0","0"),
                "b"=c("1","1","1","1","1","0","0"),
                "c"=c("1","1","0","0","1","0","0"),
                "d"=c("1","1","1","1","1","1","1"))

df2:

df2<- data.frame("e"=c("1","1","0","1","0","0","0"),
                "f"=c("1","1","1","1","1","0","0"),
                "g"=c("0","0","0","0","1","0","0"),
                "h"=c("0","0","0","0","1","1","1"))

If you see, the column "b" of df1 and "f" of df2 are equal. Therefore, the result I want is a new dataframe looking like this:

df3 <- data.frame("a"=c("0","1","0","1","0","0","0"),
                  "c"=c("1","1","0","0","1","0","0"),
                  "d"=c("1","1","1","1","1","1","1"),
                  "e"=c("1","1","0","1","0","0","0"),
                  "g"=c("0","0","0","0","1","0","0"),
                  "h"=c("0","0","0","0","1","1","1"))

NOTE: column "b" and "f" (that were similar) are not in the new df3. I have looked in the web but I did not find an example for this. I think the major complexity is that the merge is by content and not by column name.

4
could you not merge then remove them by using df3[, -c(2, 3)], the numbers in the brackets suggest which columns to remove. Although, you may want an all-in-one function for your suggestion? - Lime
Hi Lime, the problem is that my data frames are bigger than this simplified example (around 2000 rows by 10000 columns df1, and 2000 rows time 100 columns df2). So I cannot identify visually which columns are similar. - marb_021

4 Answers

1
votes

This would do the job:

df3 <- cbind(df1,df2)
df3 <- t(t(df3)[!(duplicated(t(df3)) | duplicated(t(df3), fromLast = TRUE)),])
df3

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

this will give you a matrix, you can save the result as a df if so desired

1
votes

We can use sapply to check for the columns that perfectly match.

mat <- sapply(df1, function(x) sapply(df2, function(y) all(x == y)))
mat

#      a     b     c     d
#e FALSE FALSE FALSE FALSE
#f FALSE  TRUE FALSE FALSE
#g FALSE FALSE FALSE FALSE
#h FALSE FALSE FALSE FALSE

Here we can see column b from df1 and column f from df2 should be removed. We can do this by :

m2 <- which(mat, arr.ind = TRUE)
cbind(df1[-m2[, 2]], df2[-m2[, 1]])

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1
1
votes

Here is a more tidyverse solution.

library(dplyr)
library(tidyr)
# based on Ronak's sapply approach
matches <- as.data.frame(sapply(df1, function(x) sapply(df2, function(y) identical(x, y)))) %>%
  rownames_to_column(var = "df2") %>%
  pivot_longer(-df2, names_to = "df1") %>% # pivot longer
  filter(value) # keep only the matches

# programmatically build list of names to remove
vars_remove <- c(matches$df1, matches$df2) # will remove var names that are matches
df1 %>% bind_cols(df2) %>%
  select(-any_of(vars_remove))

  a c d e g h
1 0 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 1 0 0 0
4 1 0 1 1 0 0
5 0 1 1 0 1 1
6 0 0 1 0 0 1
7 0 0 1 0 0 1
1
votes

We can use outer from base R

mat <- outer(df1, df2, FUN = Vectorize(function(x, y) all(x == y)))
mat
#      e     f     g     h
#a FALSE FALSE FALSE FALSE
#b FALSE  TRUE FALSE FALSE
#c FALSE FALSE FALSE FALSE
#d FALSE FALSE FALSE FALSE

Now, we can get the row/column names

m2 <- as.matrix(subset(as.data.frame.table(mat), Freq, select = -Freq))

Now, we use the 'm2' to get remove the column names from 'df1', 'df2' and cbind

cbind(df1[setdiff(names(df1), m2[,1])], df2[setdiff(names(df2), m2[,2])])
#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1