R from SAS PROC SQL Conditional Join

Question

I'm trying to join two very large tables based off of a conditional statement. I want to join df2 onto df1 within each group (x), but only include rows from df2 that fall within the min and max values in df2.

df1 <- data.frame(x = c(1,1,1,1,2,2,2,2,2,3), y = seq(1,10))
df2 <- data.frame(x2 = c(1,1,2,2,2), y_min = c(1, 1, 6, 6, 6), y_max = c(3,3,9,9,9), cat = c("A",'A','S','S','S'))

The result I'm looking for is

df3 <- data.frame(x = c(1,1,1,1,2,2,2,2,2,3), y = seq(1,10), y_min = c(1,1,1,NA,NA,6,6,6,6,NA), y_max = c(3,3,3,NA,NA,9,9,9,9,NA), cat = c('A','A','A',NA,NA,'S','S','S','S',NA))

   x  y y_min y_max  cat
1  1  1     1     3    A
2  1  2     1     3    A
3  1  3     1     3    A
4  1  4    NA    NA <NA>
5  2  5    NA    NA <NA>
6  2  6     6     9    S
7  2  7     6     9    S
8  2  8     6     9    S
9  2  9     6     9    S
10 3 10    NA    NA <NA>

This was originally written in a SAS PROC SQL script, but am having trouble converting it to R. The PROC SQL statement looked something like...

PROC SQL;
SELECT a.*, b.*
FROM tbl1 a
LEFT JOIN tbl2 b
   on (a.col1 - b.col1) >= 0 and (a.col1 - b.col2) <= 0
     and a.id = b.id

I've tried using base:: & data.table::merge, but am not having any luck. Any help would be greatly appreciated.

Mike Mike · Accepted Answer · 2019-04-09T14:49:39

You can use the package sqldf to use SQL code on R objects. As a side note your SAS names were different that the names you used in R, for future reference make sure they are the same so people can reproduce.

library(sqldf)
df1 <- data.frame(x = c(1,1,1,1,2,2,2,2,2,3), y = seq(1,10))
df2 <- data.frame(x2 = c(1,1,2,2,2), y_min = c(1, 1, 6, 6, 6), y_max = c(3,3,9,9,9), cat = c("A",'A','S','S','S'))

sqldf('SELECT a.*, b.*
FROM df1 a
LEFT JOIN df2 b
   on (a.y - b.y_min) >= 0 and (a.y- b.y_max) <= 0
     and a.x = b.x2')

R from SAS PROC SQL Conditional Join

2 Answers