Using plyr to join two massive dataframes on two columns

Question

I have a very large dataframe that I need join to another dataframe on two columns. I've been using merge to accomplish ir, but R runs out of memory the larger the tables get. Is there a similar solution using dplyr or plyr? I hear they require substantially less memory to accomplish. I know how to use the join function in plyr generally, what I am struggling with is joining by two columns. The merge synatx I've been using is below:

Correlation_Table <- merge(Correlation_Table, inter, by.x = c(1,2), by.y = c(1,2), all.x = TRUE, all.y = TRUE)

So for example if I have the following two dataframes:

> head(df1)
  x y         z          a
1 1 2 429.57410  43.746670
2 2 3 717.98184 524.288886
3 3 4 601.66938 640.245469
4 4 5  87.41476 318.964765
5 5 6 586.22234 196.759991
6 6 7 619.82194   3.308136
> head(df2)
   b  c        d
1  5  8 152.2855
2  6  9 191.5406
3  7 10 197.0520
4  8 11 175.4209
5  9 12 157.6239
6 10 13 136.3286

Where columns x and y of df1 are dimensions, while columns b and c of df2 are also dimensions and the other columns are measures. My goal here is create a new dataframe of all three measures where records of df1.x and df1.y match df2.a and df2.b.

Is this possible using plyr?

I was just reading about full_join. How do you specify which columns you want to join using full_join? From the examples I've seen it doesn't look like you can. — ben890
But how do I join via two columns in each each tbale. In merge I can say by.x = c(1,2) and by.y = c(1,2). What is the equivalent syntax using plyr? — ben890

akrun akrun · Accepted Answer · 2015-01-31T18:55:32

You can try

library(dplyr)
res1 <- full_join(df1, df2, by=c('x'='b', 'y'='c'))

According to ?full_join

by: a character vector of variables to join by. If ‘NULL’, the default, ‘join’ will do a natural join, using all variables with common names across the two tables. A message lists the variables so that you can check they're right. To join by different variables on x and y use a named vector. For example, ‘by = c("a" = "b")’ will match ‘x.a’ to ‘y.b’.

and compare the results with

res2 <-  merge(df1, df2, by.x = c(1,2), by.y = c(1,2),
                           all.x = TRUE, all.y = TRUE)

NOTE: The order of rows will be different

Using plyr to join two massive dataframes on two columns

1 Answers