Join multiple columns in data.frame to single column in reference data.frame

Question

I am trying to join two dataframes. However unlike the normal join I want to match a series of columns from the first with the second. Basically I have a site list with a reference to the nearest surrounding sites. I need to look up the nearest sites full Gauge and LTA IDs which I have in a separate dataframe. I have provided some example dataframes, including an example output, but the real thing isn't nearly this neat (And has a lot more columns and rows) which is why I need to look up the Surrogate gauges in TestRefList, instead of creating in the method below.

library(plyr)
library(tidyverse)

TestRefList <- data.frame(Site = paste0("sl",1:10,".1"), Gauge = paste0(1:10,".1","/110.00/1"), LTA = paste0(1:10,".1","/110.99/1"), stringsAsFactors = F)
Surrogates <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), stringsAsFactors = F)
HopefulOutput <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), 
                    nearest1Gauge = paste0(1:3,".1","/110.00/1"), nearest1LTA = paste0(1:3,".1","/110.99/1"), 
                    nearest2Gauge = paste0(7:9,".1","/110.00/1"), nearest2LTA = paste0(7:9,".1","/110.99/1"), stringsAsFactors = F)

I thought I could use some combination of plyr::ldply and dplyr::left_join such as: Out <- ldply(names(Surrogates)[2:3], function(x) left_join(Surrogates,TestRefList, by = c(paste0(x, '="Site"'))))

however I can't get the joining to working using a name from the list. I've tried outside the list, with a few arrangements of " and ' around the equals eg:

left_join(Surrogates,TestRefList, by = c(paste0('"',names(Surrogates)[2],'"' , '="Site"')))

Even if I could get this part to work, I'm not sure how it'll work within the ldply.

Any ideas? I'm happy to come at this an entirely different way if necessary although I'm much more comfortable with data.frames and tidyverse than data.table options

If any of the answers meets your needs, please "accept" it! Thanks. — r2evans

r2evans r2evans · Accepted Answer · 2017-12-12T07:07:12

Reduce(function(x, fld) merge(x, TestRefList, by.x=fld, by.y="Site"),
       c("nearest1", "nearest2"), init = Surrogates)
#   nearest2 nearest1 Primary      Gauge.x        LTA.x      Gauge.y        LTA.y
# 1    sl7.1    sl1.1   sl2.1 1.1/110.00/1 1.1/110.99/1 7.1/110.00/1 7.1/110.99/1
# 2    sl8.1    sl2.1   sl4.1 2.1/110.00/1 2.1/110.99/1 8.1/110.00/1 8.1/110.99/1
# 3    sl9.1    sl3.1   sl6.1 3.1/110.00/1 3.1/110.99/1 9.1/110.00/1 9.1/110.99/1

You can rename the columns as needed. This can be done with dplyr::left_join as well with little change:

Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
       c("nearest1", "nearest2"), init = Surrogates)

or within a pipeline with:

Surrogates %>% 
  Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
         c("nearest1", "nearest2"), init = .)

Join multiple columns in data.frame to single column in reference data.frame

3 Answers