Merge/Join prioritizing non-missing values

Question

Is there a merging function that prioritizes non-missing values from common variables?

Consider the following example.

First we generate two data.frames with the same IDs but complementary missing values on a particular varuiable:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

Applying merge or any of the join functions from the dplyr package produces results similar to the below:

> merge(df1, df2, by = 'ID')
  ID      V1.x       V1.y
1  a        NA -1.5399500
2  b 1.3297993         NA
3  c 0.4146414         NA
4  d        NA -0.9285670
5  e        NA -0.2947204
6  f 1.2724293         NA

We'd like to join these two data.frames in a "smarter" way that ignores missing values in one data.frame when not missing in the other to obtain the below output:

> output <- df1
> output$V1[is.na(df1$V1)]  <- df2$V1[!(is.na(df2$V1))]
> output
  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

We can assume that df1 and df2 have totally complementary missing values of V1.

EDIT

A solution that would work for an arbitrary number of variables would be ideal.

But what if they aren't complementary? If an ID does have non-missing values in both df1 and df2 do you want to retain both, or prioritize one? SQL would typically have you prioritize one using the coalesce function - see here for implementations of coalesce in R. Of course it will still work if they are complementary as well. — Gregor Thomas
@Gregor The devel version of dplyr has an implementation of coalsece so you can simply do dplyr::coalesce(df1, df2): — Steven Beaupré
Correct--you would want to retain both in such a situation. That's not my situation, but coalesce is sounding right--thank you both. Perhaps a more general (better?) question would ask how to implement a merge that selects one of two values based on some condition, not just missingness... — Richard Border
@Gregor got it. The issue that comes to mind is that there may be many such variables (my current situation) and it would be great to automate the coalescing! — Richard Border
That's a good point. In a case like this though when you're not really joining something like na.omit(rbind(df1, df2)) works as well (equivalently to merge(na.omit(df1), na.omit(df2), by = 'ID'), which you more-or-less show). Not sure why that method is unsatisfactory. Are there potentially missing values in other columns you want to keep around? — Gregor Thomas

Richard Border Richard Border · Accepted Answer · 2016-06-09T00:26:54

Thanks to the very helpful comments of @Gregor and @StevenBeaupré, I came up with a solution using coalesce.na from the kimisc package that extends to arbitrary numbers of variables:

mapply(function(x,y) coalesce.na(x,y), df1$V1, df2$V1)
[1] -1.5399500  1.3297993  0.4146414 -0.9285670 -0.2947204  1.2724293

Notice that df1$V1 and df2$V1 could be replaced lists of variables, allowing for something like:

> set.seed(1)
> missings  <- sample.int(6, 3)
> df1  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df2  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df1$V1[missings]  <- rnorm(3)
> df2$V1[setdiff(1:6, missings)]  <- rnorm(3)
> df1$V2[setdiff(1:6, missings)]  <- rnorm(3)
> df2$V2[missings]  <- rnorm(3)

> cbind(df1, df2)
  ID        V1           V2 ID         V1         V2
1  a        NA -0.005767173  a -1.5399500         NA
2  b 1.3297993           NA  b         NA -0.7990092
3  c 0.4146414           NA  c         NA -0.2894616
4  d        NA  2.404653389  d -0.9285670         NA
5  e        NA  0.763593461  e -0.2947204         NA
6  f 1.2724293           NA  f         NA -1.1476570

> dfMerged <- merge(df1, df2, by = 'ID')
> xList <- dfMerged[grep("\\.x$", names(dfMerged))]
> yList <- dfMerged[grep("\\.y$", names(dfMerged))]

> mapply(function(x,y) coalesce.na(x,y), xList, yList)
           V1.x         V2.x
[1,] -1.5399500 -0.005767173
[2,]  1.3297993 -0.799009249
[3,]  0.4146414 -0.289461574
[4,] -0.9285670  2.404653389
[5,] -0.2947204  0.763593461
[6,]  1.2724293 -1.147657009

A full solution would thus look something like:

library(kimisc)
smartMergeList <- function(dfList, idVar) {
    merged <- Reduce(x = dfList, 
                     f = function(x,y) merge(x, y, by = idVar, all = T))
    xList <- merged[grep("\\.x$", names(merged))]
    yList <- merged[grep("\\.y$", names(merged))]
    merged[names(xList)] <- mapply(function(x,y) coalesce.na(x,y),
                            xList, yList)
    merged[names(yList)] <- NULL
    merged
})

I would love to see something prettier though!

Merge/Join prioritizing non-missing values

EDIT

4 Answers