3
votes

Is there a merging function that prioritizes non-missing values from common variables?

Consider the following example.

First we generate two data.frames with the same IDs but complementary missing values on a particular varuiable:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

Applying merge or any of the join functions from the dplyr package produces results similar to the below:

> merge(df1, df2, by = 'ID')
  ID      V1.x       V1.y
1  a        NA -1.5399500
2  b 1.3297993         NA
3  c 0.4146414         NA
4  d        NA -0.9285670
5  e        NA -0.2947204
6  f 1.2724293         NA

We'd like to join these two data.frames in a "smarter" way that ignores missing values in one data.frame when not missing in the other to obtain the below output:

> output <- df1
> output$V1[is.na(df1$V1)]  <- df2$V1[!(is.na(df2$V1))]
> output
  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

We can assume that df1 and df2 have totally complementary missing values of V1.

EDIT

A solution that would work for an arbitrary number of variables would be ideal.

4
But what if they aren't complementary? If an ID does have non-missing values in both df1 and df2 do you want to retain both, or prioritize one? SQL would typically have you prioritize one using the coalesce function - see here for implementations of coalesce in R. Of course it will still work if they are complementary as well. - Gregor Thomas
@Gregor The devel version of dplyr has an implementation of coalsece so you can simply do dplyr::coalesce(df1, df2): - Steven Beaupré
Correct--you would want to retain both in such a situation. That's not my situation, but coalesce is sounding right--thank you both. Perhaps a more general (better?) question would ask how to implement a merge that selects one of two values based on some condition, not just missingness... - Richard Border
@Gregor got it. The issue that comes to mind is that there may be many such variables (my current situation) and it would be great to automate the coalescing! - Richard Border
That's a good point. In a case like this though when you're not really joining something like na.omit(rbind(df1, df2)) works as well (equivalently to merge(na.omit(df1), na.omit(df2), by = 'ID'), which you more-or-less show). Not sure why that method is unsatisfactory. Are there potentially missing values in other columns you want to keep around? - Gregor Thomas

4 Answers

3
votes

Thanks to the very helpful comments of @Gregor and @StevenBeaupré, I came up with a solution using coalesce.na from the kimisc package that extends to arbitrary numbers of variables:

mapply(function(x,y) coalesce.na(x,y), df1$V1, df2$V1)
[1] -1.5399500  1.3297993  0.4146414 -0.9285670 -0.2947204  1.2724293

Notice that df1$V1 and df2$V1 could be replaced lists of variables, allowing for something like:

> set.seed(1)
> missings  <- sample.int(6, 3)
> df1  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df2  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df1$V1[missings]  <- rnorm(3)
> df2$V1[setdiff(1:6, missings)]  <- rnorm(3)
> df1$V2[setdiff(1:6, missings)]  <- rnorm(3)
> df2$V2[missings]  <- rnorm(3)

> cbind(df1, df2)
  ID        V1           V2 ID         V1         V2
1  a        NA -0.005767173  a -1.5399500         NA
2  b 1.3297993           NA  b         NA -0.7990092
3  c 0.4146414           NA  c         NA -0.2894616
4  d        NA  2.404653389  d -0.9285670         NA
5  e        NA  0.763593461  e -0.2947204         NA
6  f 1.2724293           NA  f         NA -1.1476570

> dfMerged <- merge(df1, df2, by = 'ID')
> xList <- dfMerged[grep("\\.x$", names(dfMerged))]
> yList <- dfMerged[grep("\\.y$", names(dfMerged))]

> mapply(function(x,y) coalesce.na(x,y), xList, yList)
           V1.x         V2.x
[1,] -1.5399500 -0.005767173
[2,]  1.3297993 -0.799009249
[3,]  0.4146414 -0.289461574
[4,] -0.9285670  2.404653389
[5,] -0.2947204  0.763593461
[6,]  1.2724293 -1.147657009

A full solution would thus look something like:

library(kimisc)
smartMergeList <- function(dfList, idVar) {
    merged <- Reduce(x = dfList, 
                     f = function(x,y) merge(x, y, by = idVar, all = T))
    xList <- merged[grep("\\.x$", names(merged))]
    yList <- merged[grep("\\.y$", names(merged))]
    merged[names(xList)] <- mapply(function(x,y) coalesce.na(x,y),
                            xList, yList)
    merged[names(yList)] <- NULL
    merged
})

I would love to see something prettier though!

2
votes

If avoiding specifying the columns is the only issue with output$V1[is.na(df1$V1)] <- df2$V1[!(is.na(df2$V1))], then you just need to use na.omit() instead of specifying the variables individually. We can also do this only for overlapping columns.

Let's modify the original data so that the columns aren't a perfect overlap and the extra columns have some data we want to keep:

set.seed(1)

missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA, V2 = c(NA, 2, 3, NA, 5, 6))
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

# now df1 looks like this:
df1
#   ID        V1 V2
# 1  a        NA NA
# 2  b 1.3297993  2
# 3  c 0.4146414  3
# 4  d        NA NA
# 5  e        NA  5
# 6  f 1.2724293  6


common_cols = intersect(names(df1), names(df2))
result = na.omit(rbind(df1[common_cols], df2[common_cols]))
result = merge(result, df1, all.x = T)
result = merge(result, df2, all.x = T)
  # the merges are only necessary if there are additional columns to pick up

result
# ID           V1 V2
# 1  a -1.5399500 NA
# 2  b  1.3297993  2
# 3  c  0.4146414  3
# 4  d -0.9285670 NA
# 5  e -0.2947204 NA
# 6  f  1.2724293  6
1
votes

Based on the discussion and answers above, here is my take using dplyr. Not the cleanest code, yes I do have suppressWarnings().

For the reproducible example by OP:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

Simple solution:

library(dplyr)
library(reshape2)
coalesce <- function(...) {
  apply((...), 1, function(x) {
    x[which(!is.na(suppressWarnings(as.numeric(x))))[1]]
  })
}

full_join(df1, df2, by = 'ID') %>% mutate(V1 = coalesce(.)) %>% select(.,ID,V1)

  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

For a general solution with multiple (here shown with 3) variables:

set.seed(1)
df1  <- data.frame(ID = letters[1:6], V1 = NA, V2 = NA, V3 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA, V2 = NA, V3 = NA)
df1$V1[sample.int(6, 3)]  <- rnorm(3)
df2$V1[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)
df1$V2[sample.int(6, 3)]  <- rnorm(3)
df2$V2[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)
df1$V3[sample.int(6, 3)]  <- rnorm(3)
df2$V3[setdiff(1:6, sample.int(6, 3))]  <- rnorm(3)

Same coalesce() function, more elaborate dplyr logic:

library(dplyr)
library(reshape2)
coalesce <- function(...) {
  apply((...), 1, function(x) {
    x[which(!is.na(suppressWarnings(as.numeric(x))))[1]]
  })
}

full_join(df1, df2, by = "ID") %>% 
  melt(., id.vars = "ID") %>%
  mutate(var = substr(as.character(variable),0,2)) %>%
  group_by(var,value) %>% 
  dcast(.,ID + var ~ variable, value.var = "value") %>%
  mutate(c = coalesce(.)) %>%
  dcast(.,ID ~ var, value.var = "c")

  ID         V1         V2          V3
1  a -1.5399500  0.3898432        <NA>
2  b -0.9285670 -0.3053884  0.80418951
3  c -0.8356286       <NA>   0.5939013
4  d  0.1836433 -0.4115108 -0.05710677
5  e       <NA>       <NA>   0.8212212
6  f -0.6264538  1.5117812   0.9189774

The coalesce() function picks the first non-NA value (if it exists). You can choose max or something else per your problem. The ID column or any other columns in the merged dataset must be non-numeric. I hope this is some help toward solving your real problem.

0
votes

Here is a possible data.table way to do it:

library(data.table)
setDT(df1); setDT(df2);

df1[df2, V1 := ifelse(is.na(V1), i.V1, V1), on = "ID"]

df1
#    ID         V1
# 1:  a -1.5399500
# 2:  b  1.3297993
# 3:  c  0.4146414
# 4:  d -0.9285670
# 5:  e -0.2947204
# 6:  f  1.2724293