Column overlap in two binary R dataframe and calculate overlap/non-overlap for each column

Question

My two dataframes are as follows:

df1 <- structure(list(species = structure(1:4, .Label = c("a", "b", 
                                                          "c", "d"), class = "factor"), sample1 = c(1L, 1L, 1L, 1L), sample2 = c(0L, 
                                                                                                                                 0L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(species = structure(c(1L, 5L, 6L, 7L, 2L, 3L, 
                                            4L), .Label = c("a", "b", "c", "d", "x", "y", "z"), class = "factor"), 
                      sample1 = c(1L, 1L, 0L, 1L, 0L, 1L, 1L), sample2 = c(1L, 
                                                                           1L, 1L, 0L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                         -7L))

1/0 indicates presence and absence.

Now I want to match each column of df1 with the corresponding column in df2 and save the comparison results in two parameters (for each column in df1).

TP - Number of non-zero df1 values in each column matched with the corresponding df2 non-zero values and
FP - Number of non-zero df1 values in each column that do not match with the corresponding df2 non-zero values.

The output dataframe (df3) should be:

df3<-structure(list(species = structure(c(1L, 2L, 3L, 4L, 6L, 5L), .Label = c("a", 
                                                                         "b", "c", "d", "FP", "TP"), class = "factor"), sample1 = c(1L, 
                                                                                                                                    1L, 1L, 1L, 3L, 1L), sample2 = c(0L, 0L, 1L, 1L, 2L, 0L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                   -6L))

I try to use setdiff to get the differences in df1:

overlap <- for ( i in 1:colnames(df1)){
     data.frame(setdiff(df1[,i], df2[,i]) >0)
  }

But clearly this is not the right way.

Thanks for your help!

Rui Barradas Rui Barradas · Accepted Answer · 2020-06-18T19:36:07

Something like this?

i <- match(df1$species, df2$species)

TP <- colSums((df2[i, -1] == df1[-1]) & (df1[-1] == 1))
FP <- colSums((df2[i, -1] != df1[-1]) & (df1[-1] == 1))

TP <- cbind.data.frame(species = 'TP', t(TP))
FP <- cbind.data.frame(species = 'FP', t(FP))
res <- rbind(df1, TP, FP)

res
#  species sample1 sample2
#1       a       1       0
#2       b       1       0
#3       c       1       1
#4       d       1       1
#5      TP       3       2
#6      FP       1       0

Column overlap in two binary R dataframe and calculate overlap/non-overlap for each column

1 Answers