0
votes

I am stuck in merging two data-sets the simple complexity of which goes well-beyond my fluency in R. I tried to learn from here and here but could not solve my problem. I am trying to merge the following two data frames:

df1

No    County       Route      Number
1     Anderson       SR009       6150
2     Anderson       SR061       5880
3     Bedford        SR016       9500
4     Bedford        SR130       320
5       .
6       .
7       .
8       .

df2

No.  County        Route     Number1    abc      def
1    Clay          02264     4500        50       789
2     Dickson       01544     5870       45       33
3     Anderson      01421     981        70       65
4     Anderson      SR009     10000      56       56
5     Anderson      SR009     6145       32       53
6     Bedford       SR016     7500       23       32
7     Anderson      SR061     4400       12       24
8     Anderson      SR061     5875       87       26
9     Anderson      SR061     15000      45       45
10     Bedford       SR016     22000     71       75
11     Bedford       SR016     9450      145      615
12     Bedford       SR130     900       7854     76
13     Bedford       SR130     310       124      25
14     Anderson      SR061     5865      312      123
       .
       .
       .

First, the "county" and "Route" columns in df1 and df2 should be compared, and if they exactly match, then specific row of df2$Number1 should be selected whose value is NEAREST to df1$Number, and as such all unique df2 columns should be added to df1

Here is the pseudo-code of what i am trying to achieve:

if(df1$County == Anderson & df2$County == Anderson) && if(df1$Route == SR009 & df2$Route == SR009) 
then select specific row from df2$Number1 whose value is nearest to the df1$Number value, 
and add all subsequent columns of df2 to corresponding row in df1

An example:

Based on "County" and "Route" columns, row No. 1 in df1 matches with row 4 and 5 in df2. Now out of the two df2 rows that match with first row in df1, i want to select that specific row in df2 whose "Number1" value is closest to the "Number" value in df1 i.e. 6150. Said this, i want to select row 5 in df2 because "Number1" value is 6145 which is closest to 6150, and add all subsequent columns from df2 to df1...

The final output would look like this:

No      County           Route       Number     Number1     abc    def  .  .    
1       Anderson         SR009       6150       6145        32     53   .  .
2       Anderson         SR061       5880       5875        87     26   .  .
3       Bedford          SR016       9500       9450        145    615  .  .
4       Bedford          SR139       320        310         124    25   .  .
.          .
.          .

I highly appreciate any help on this. Sorry for the long post.

2

2 Answers

0
votes

Your question is a bit confusing. That withstanding, from your desired output I think the following dplyr approach will work for you.

library(dplyr)

d1%>%
  full_join(d2, by = c("County", "Route")) %>%
  group_by(County, Route) %>%
  mutate(myDiff = abs(Number - Number1)) %>%
  slice(which.min(myDiff))
0
votes

Using library(data.table)

setkey(dt1, County, Route)
setkey(dt2, County, Route)
dt3 = dt1[dt2]
dt3[, Number.close := Number1[which.min(abs(Number1-Number))], by = .(County, Route)]
dt3 = dt3[Number.close == Number1, ][, Number.close:=NULL][]

#    No   County Route Number No. Number1 abc def
# 1:  1 Anderson SR009   6150   5    6145  32  53
# 2:  2 Anderson SR061   5880   8    5875  87  26
# 3:  3  Bedford SR016   9500  11    9450 145 615
# 4:  4  Bedford SR130    320  13     310 124  25

The data:

dt1 = structure(list(No = 1:4, County = c("Anderson", "Anderson", "Bedford", 
"Bedford"), Route = c("SR009", "SR061", "SR016", "SR130"), Number = c(6150L, 
5880L, 9500L, 320L)), .Names = c("No", "County", "Route", "Number"
), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))

dt2 = structure(list(No. = c(3L, 4L, 5L, 7L, 8L, 9L, 14L, 6L, 10L, 
11L, 12L, 13L, 1L, 2L), County = c("Anderson", "Anderson", "Anderson", 
"Anderson", "Anderson", "Anderson", "Anderson", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "Clay", "Dickson"), Route = c("01421", 
"SR009", "SR009", "SR061", "SR061", "SR061", "SR061", "SR016", 
"SR016", "SR016", "SR130", "SR130", "02264", "01544"), Number1 = c(981L, 
10000L, 6145L, 4400L, 5875L, 15000L, 5865L, 7500L, 22000L, 9450L, 
900L, 310L, 4500L, 5870L), abc = c(70L, 56L, 32L, 12L, 87L, 45L, 
312L, 23L, 71L, 145L, 7854L, 124L, 50L, 45L), def = c(65L, 56L, 
53L, 24L, 26L, 45L, 123L, 32L, 75L, 615L, 76L, 25L, 789L, 33L
)), .Names = c("No.", "County", "Route", "Number1", "abc", "def"
), row.names = c(NA, -14L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))