1
votes

I have two data frames in R with same columns & data types in each. Some columns are text based & others are numbers & some others are dates. However, same columns have the same sort of data in both data frames. The unique identifier is also the same in both i.e., the primary keys match.

Now, I want to create a third data frame which essentially captures for each primary key, what is the difference between the values in DF1 and DF2 for the corresponding columns. When the columns to be checked is character, we can simple say 1 or 0 indicating a difference. When it is numeric, we can have the difference amount being captured or perhaps simply 1 or 0 again.

What's the most efficient way to do this in R? I do not want to do a row by row comparison as it is slow. Column by column comparison would be fine but that too seems like too much manual oversight required. Ideally, looking for a few data frame level functions that would help me do this.

Reproducible & editable example:

Dataframe1:
ID    val1     date1     chrval1    val3
A1    400      3/4/2017  DR9912YS   -43
A2    230      3/4/2017  ER9F4YS    -43
A3    500      31/2/2015  FFR99S     -49

Dataframe2:
ID    val1     date1     chrval1    val3
A1    400      3/4/2017  DR9912YS   -43
A2    400      3/4/2017  DR9912YS   -43
A3    400      31/4/2017  DR9912YS   -43

Ideally this is what I am looking for:
Difference Dataframe:
ID    val1     date1     chrval1    val3
A1    0        0         True        0
A2    170      0         False       0
A3    -100     0/2/2     False       5
2
interesting question. As far as I know it is generally difficult to perform test on data. There is the function identical that might be of some help. Have a look hereDJJ
@DJJ I saw that, but that essentially tells me if they are an overall exact match or not. I need more detail. Anything else will be helpful? I'm open to breaking my dataframes into two or three splits based on data types..Dr Confuse
What do you want to do with factor variables? Also, the date difference does not make sense to me. The difference between 1/1/2017 and 1/2/2017 is 0/1/0. The dfference1/2/2017 and 1/3/2017 is also 0/1/0. Obviously, the lengths of these two periods are different.amatsuo_net

2 Answers

0
votes

In base R:

# merge the two dataframes
dfm <- merge(df1, df2, by = 'ID')

# create numeric vectors for the column-names ending with '.x' and '.y'
xvec <- grep('.x', names(dfm), fixed = TRUE)
yvec <- grep('.y', names(dfm), fixed = TRUE)
# determine which columns are not of the character class
non_char <- which(sapply(dfm, class) != 'character')

# create a new dataframe by binding the 'ID' column
# with the difference of the '.x' & '.y' columns
dfnew <- cbind.data.frame(ID = dfm$ID, 
                          dfm[, intersect(yvec, non_char)] - dfm[, intersect(xvec, non_char)], 
                          chrval1 = dfm$chrval1.x == dfm$chrval1.y)

# remove the '.y' from the column-names of the new dataframe
names(dfnew) <- gsub('.y','',names(dfnew),fixed=TRUE)

which gives:

> dfnew
  ID val1    date1 val3 chrval1
1 A1    0   0 days    0    TRUE
2 A2  170   0 days    0   FALSE
3 A3 -100 733 days    6   FALSE

With regard to memory-efficiency and speed, the data.table-package is probably the best choice. You could then do:

library(data.table)
setDT(df1)
setDT(df2)

df1[df2, on = 'ID', `:=` (val1 = i.val1 - x.val1, 
                          dat1 = as.numeric(i.date1) - as.numeric(x.date1), 
                          chrval1 = i.chrval1 == x.chrval1, 
                          val3 = i.val3 - x.val3)][, date1:= NULL][]

which gives:

> df1
   ID val1      date1 chrval1 val3
1: A1    0 1970-01-01    TRUE    0
2: A2  170 1970-01-01   FALSE    0
3: A3 -100 1972-01-04   FALSE    6

Used data:

df1 <- structure(list(ID = c("A1", "A2", "A3"), 
                      val1 = c(400L, 230L, 500L), 
                      date1 = structure(c(17290, 17290, 16557), class = "Date"), 
                      chrval1 = c("DR9912YS", "ER9F4YS", "FFR99S"), 
                      val3 = c(-43L, -43L, -49L)), 
                 .Names = c("ID", "val1", "date1", "chrval1", "val3"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(ID = c("A1", "A2", "A3"), 
                      val1 = c(400L, 400L, 400L), 
                      date1 = structure(c(17290, 17290, 17290), class = "Date"), 
                      chrval1 = c("DR9912YS", "DR9912YS", "DR9912YS"), 
                      val3 = c(-43L, -43L, -43L)), 
                 .Names = c("ID", "val1", "date1", "chrval1", "val3"), row.names = c(NA, -3L), class = "data.frame")
0
votes

Just cooked something quickly. It does not treat the date case.

I'm using a macro from the library gtools. I'm not sure it is necessary but my mind got hook on that.

library(gtools)

The tables. read.table is nice for reproducing the data easily.

aa <- read.table(header=TRUE,text="
ID    val1     date1     chrval1    val3
A1    400      3/4/2017  DR9912YS   -43
A2    230      3/4/2017  ER9F4YS    -43
A3    500      31/2/2015  FFR99S     -49")

bb <- read.table(header=TRUE,text="
ID    val1     date1     chrval1    val3
A1    400      3/4/2017  DR9912YS   -43
A2    400      3/4/2017  DR9912YS   -43
A3    400      31/4/2017  DR9912YS   -43")

This is a simple macro that does the following. if fn1 produces an error it will catch it and use fn2. It is probably not the most suitable way to do it. feel free to improve.

expect_error <- defmacro(fn1,fn2,expr={
    tryCatch({fn1(x,y)},
             error=function(e) {mytc(fn2(x,y))}
)})

It have in store a this function as well. which uses fn1 or fn2 depending on the crit. for example crit=is.numeric. If crit is true fn1 is used, if not fn2 is used.

condlapply <- function(lst, crit, fn1, fn2){
       lapply(lst, function(x) if(crit(x)) {
                                   fn1(x)} else {fn2(x)})

       }

Some simple functions

myequal <-  function(x,y=1){ 
    `==`(x,y)
    }

mydiff <-  function(x,y){
    `-`(x,y)
    }


res <- data.frame(sapply(Map(function(x,y) expect_error(mydiff,myequal),aa,bb),c))

Got lazy here with the ID.

res$ID <- aa$ID


## res                         
##   ID val1 date1 chrval1 val3
## 1 A1    0     1       1    0
## 2 A2 -170     1       0    0
## 3 A3  100     0       0   -6

we can wrap up in a function

check_df <- function(df1,df2){
### DD
    ## df1, df2 . data.frames
    res <- data.frame(sapply(Map(function(x,y) expect_error(mydiff,myequal),df1,df2),c))
    res$ID <- aa$ID
    res
    }