0
votes

Could you please tell me how I could transform the data frame like this:

    tg  qr  loc a1  a2  a3  b1  b2  b3  c1  c2  c3
1   A   1   89  NA  NA  NA  1   2   3   1   2   3
2   A   1   61  1   2   3   NA  NA  NA  1   2   3
3   A   2   38  4   5   6   NA  NA  NA  NA  NA  NA
4   B   1   40  4   5   6   NA  NA  NA  NA  NA  NA
5   B   1   3   NA  NA  NA  NA  NA  NA  4   5   6

into this:

    tg  qr  loc a1  a2  a3  b1  b2  b3  c1  c2  c3
1   A   1   15  1   2   3   1   2   3   1   2   3
2   A   2   95  4   5   6   NA  NA  NA  NA  NA  NA
3   B   1   42  4   5   6   NA  NA  NA  4   5   6

The function should:

  • merge all rows with same values in columns 'tg' AND 'qr' into one row
  • while merging, replace all "NAs" with existing values - never in the opposite direction
  • Often there will be cases, when a variable is present in both rows being merged, but its value will be always equal (then it doesn't matter from which row it would be taken)
  • 'loc' column values differ, but are not relevant, the column could be even deleted

Code for these sample dataframes is:

df = rbind(c("A","1",floor(runif(1,1,100)),c(NA,NA,NA),c(1,2,3),c(1,2,3)),
           c("A","1",floor(runif(1,1,100)),c(1,2,3),c(NA,NA,NA),c(1,2,3)),
           c("A","2",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
           c("B","1",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
           c("B","1",floor(runif(1,1,100)),c(NA,NA,NA),c(NA,NA,NA),c(4,5,6)))
df = as.data.frame(df)
colnames(df) = c("target","query","loc",c("a1","a2","a3"),c("b1","b2","b3"),c("c1","c2","c3"))

df2 = rbind(c("A","1",floor(runif(1,1,100)),c(1,2,3),c(1,2,3),c(1,2,3)),
            c("A","2",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
            c("B","1",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(4,5,6)))
df2 = as.data.frame(df2)
colnames(df2) = c("target","query","loc",c("a1","a2","a3"),c("b1","b2","b3"),c("c1","c2","c3"))

Thank you for your support.

2
merge(df,df2,by=c("target","query"))? - Metrics
@Metrics Not really. I do not have df2. I posted it as the sample output I'd like to obtain. - mjktfw

2 Answers

2
votes

Use na.omit:

library(data.table)
dt = data.table(df)

dt[, lapply(.SD, function(x) na.omit(x)[1]), by = list(target, query)]
#   target query loc a1 a2 a3 b1 b2 b3 c1 c2 c3
#1:      A     1  21  1  2  3  1  2  3  1  2  3
#2:      A     2  71  4  5  6 NA NA NA NA NA NA
#3:      B     1  25  4  5  6 NA NA NA  4  5  6
1
votes

Something like this maybe?

library(data.table)
dt <- data.table(df)
dt <- dt[,lapply(.SD, as.numeric), by = c("target","query")]
dt2 <- dt[,lapply(.SD, mean, na.rm = TRUE), by = c("target","query")]
dt2[is.na(dt2)] <- NA

dt2

> dt2
   target query loc a1 a2 a3 b1 b2 b3 c1 c2 c3
1:      A     1 2.0  1  1  1  1  1  1  1  1  1
2:      A     2 2.0  2  2  2 NA NA NA NA NA NA
3:      B     1 2.5  2  2  2 NA NA NA  2  2  2