1
votes

I am trying to join two dataframes. However unlike the normal join I want to match a series of columns from the first with the second. Basically I have a site list with a reference to the nearest surrounding sites. I need to look up the nearest sites full Gauge and LTA IDs which I have in a separate dataframe. I have provided some example dataframes, including an example output, but the real thing isn't nearly this neat (And has a lot more columns and rows) which is why I need to look up the Surrogate gauges in TestRefList, instead of creating in the method below.

library(plyr)
library(tidyverse)

TestRefList <- data.frame(Site = paste0("sl",1:10,".1"), Gauge = paste0(1:10,".1","/110.00/1"), LTA = paste0(1:10,".1","/110.99/1"), stringsAsFactors = F)
Surrogates <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), stringsAsFactors = F)
HopefulOutput <- data.frame(Primary = paste0("sl",c(2,4,6),".1"), nearest1=paste0("sl",1:3,".1"), nearest2=paste0("sl",7:9,".1"), 
                    nearest1Gauge = paste0(1:3,".1","/110.00/1"), nearest1LTA = paste0(1:3,".1","/110.99/1"), 
                    nearest2Gauge = paste0(7:9,".1","/110.00/1"), nearest2LTA = paste0(7:9,".1","/110.99/1"), stringsAsFactors = F)

I thought I could use some combination of plyr::ldply and dplyr::left_join such as: Out <- ldply(names(Surrogates)[2:3], function(x) left_join(Surrogates,TestRefList, by = c(paste0(x, '="Site"'))))

however I can't get the joining to working using a name from the list. I've tried outside the list, with a few arrangements of " and ' around the equals eg:

left_join(Surrogates,TestRefList, by = c(paste0('"',names(Surrogates)[2],'"' , '="Site"')))

Even if I could get this part to work, I'm not sure how it'll work within the ldply.

Any ideas? I'm happy to come at this an entirely different way if necessary although I'm much more comfortable with data.frames and tidyverse than data.table options

3
If any of the answers meets your needs, please "accept" it! Thanks. - r2evans

3 Answers

1
votes
Reduce(function(x, fld) merge(x, TestRefList, by.x=fld, by.y="Site"),
       c("nearest1", "nearest2"), init = Surrogates)
#   nearest2 nearest1 Primary      Gauge.x        LTA.x      Gauge.y        LTA.y
# 1    sl7.1    sl1.1   sl2.1 1.1/110.00/1 1.1/110.99/1 7.1/110.00/1 7.1/110.99/1
# 2    sl8.1    sl2.1   sl4.1 2.1/110.00/1 2.1/110.99/1 8.1/110.00/1 8.1/110.99/1
# 3    sl9.1    sl3.1   sl6.1 3.1/110.00/1 3.1/110.99/1 9.1/110.00/1 9.1/110.99/1

You can rename the columns as needed. This can be done with dplyr::left_join as well with little change:

Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
       c("nearest1", "nearest2"), init = Surrogates)

or within a pipeline with:

Surrogates %>% 
  Reduce(function(x, fld) left_join(x, TestRefList, by = setNames("Site", fld)),
         c("nearest1", "nearest2"), init = .)
1
votes

I am offering a data.table based solution. Certainly your task can be done using dplyr as you requested. However I don't know dplyr well enough to work it out. Plus, I think the data.table solution below is very elegant and fast, so long as you are willing to add another package to your workflow. Also, this code is already general to any number of "nearest n" columns in your data.

library(data.table)

# Melt the Surrogate data, providing useful column names.
surrogate_dat = melt(data.table(Surrogates), 
                id.vars="Primary", 
                value.name="Site", 
                variable.name="nearest_site_group")
#    Primary nearest_site_group  Site
# 1:   sl2.1           nearest1 sl1.1
# 2:   sl4.1           nearest1 sl2.1
# 3:   sl6.1           nearest1 sl3.1
# 4:   sl2.1           nearest2 sl7.1
# 5:   sl4.1           nearest2 sl8.1
# 6:   sl6.1           nearest2 sl9.1

# Merge melted surrogate data with reference list data.
merged_dat = merge(x=surrogate_dat, 
                   y=data.table(TestRefList), 
                   by="Site")
#     Site Primary nearest_site_group        Gauge          LTA
# 1: sl1.1   sl2.1           nearest1 1.1/110.00/1 1.1/110.99/1
# 2: sl2.1   sl4.1           nearest1 2.1/110.00/1 2.1/110.99/1
# 3: sl3.1   sl6.1           nearest1 3.1/110.00/1 3.1/110.99/1
# 4: sl7.1   sl2.1           nearest2 7.1/110.00/1 7.1/110.99/1
# 5: sl8.1   sl4.1           nearest2 8.1/110.00/1 8.1/110.99/1
# 6: sl9.1   sl6.1           nearest2 9.1/110.00/1 9.1/110.99/1

# 'Cast' merged data back to wide form, specifying 3 value variables.
results= dcast(data=merged_dat, 
               formula=Primary ~ nearest_site_group, 
               value.var=c("Site", "Gauge", "LTA"))
#    Primary Site_nearest1 Site_nearest2 Gauge_nearest1 Gauge_nearest2
# 1:   sl2.1         sl1.1         sl7.1   1.1/110.00/1   7.1/110.00/1
# 2:   sl4.1         sl2.1         sl8.1   2.1/110.00/1   8.1/110.00/1
# 3:   sl6.1         sl3.1         sl9.1   3.1/110.00/1   9.1/110.00/1
#    LTA_nearest1 LTA_nearest2
# 1: 1.1/110.99/1 7.1/110.99/1
# 2: 2.1/110.99/1 8.1/110.99/1
# 3: 3.1/110.99/1 9.1/110.99/1
0
votes

Here is a generalised solution for any number of "nearest" columns in Surrogates. It starts by getting a vector of the "nearest" columns and goes from there.

# get list of columns matching "nearest"
nearestCols <- colnames(Surrogates) %>%
  `[`(grepl("nearest", .))

# output data.frame
out <- Surrogates

# for each "nearest" column, merge Gauge and LTA
for (n in nearestCols) {
  out <- merge(out, TestRefList, by.x = n, by.y = "Site", all.x = TRUE)
  colnames(out)[(ncol(out)-1):ncol(out)] <- paste0(n, c("Gauge", "LTA"))
}

# re-order the columns
out <- out[, c(length(nearestCols) + 1, length(nearestCols):1, (length(nearestCols)+2):ncol(out))]

Output:

> out
  Primary nearest1 nearest2 nearest1Gauge  nearest1LTA nearest2Gauge  nearest2LTA
1   sl2.1    sl1.1    sl7.1  1.1/110.00/1 1.1/110.99/1  7.1/110.00/1 7.1/110.99/1
2   sl4.1    sl2.1    sl8.1  2.1/110.00/1 2.1/110.99/1  8.1/110.00/1 8.1/110.99/1
3   sl6.1    sl3.1    sl9.1  3.1/110.00/1 3.1/110.99/1  9.1/110.00/1 9.1/110.99/1
> identical(out, HopefulOutput)
[1] TRUE