Mutate new column of dataframe 1. and match values from another dataframe 2. based on variable combinations

Question

I am trying to mutate a new column into a very large dataframe by using combinations of two variables to draw a value from another dataframe. I know I can do this without using a for loop, but the original dataframe is ~2.7 million rows and so a for loop takes prohibitively long.

Essentially, I have two dataframes like such:

df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df1
   years sites        var1
1   2001     1 -0.01419947
2   2002     2  0.53729311
3   2003     3  0.89222231
4   2004     4  0.20600240
5   2005     5  0.24541548

df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
df2
      2001      2002      2003      2004      2005
1 -0.1217767 0.6703649 3.6074038 2.3048512 3.0022530
2  2.6167986 1.7673236 0.9467254 0.9903685 1.8533297
3  0.3192424 2.2183726 0.8783466 2.7741821 0.1847018
4  2.3599459 0.5653315 3.8843616 3.3171480 2.9135520
5  1.5399871 2.8127713 1.2168152 2.1788167 2.1153329

where for df2 the column names are years, the rownames are sites, and in the actual version of df1, each year/site combination is present many times.

I would like to end up with:

   years sites        var1    NewVar
1   2001     1 -0.01419947  1.322451
2   2002     2  0.53729311  3.083238
3   2003     3  0.89222231  1.106300
4   2004     4  0.20600240  2.723593
5   2005     5  0.24541548  2.4919137

Such that the variable NewVar is made based on the combination of the values in years and sites which indicate the appropriate value to draw from df2.

What I want to do, is mutate a column in df1 that, for each row, uses the combination of the site and year variables to find the appropriate value in df2. As I mentioned before, I know this is possible to do with a for loop but on the computer I'm running on, a similar foreach loop took 6 hours running on 3 cores, so I'm really hoping for a quicker mutate version.

The only other thought for a solution I've had so far is to use a combination of indexing and a ton of ifelse statements, but the real versions of the dataframes I'm working with have 702 unique combinations of site and years.

Any help would be greatly appreciated!

TinglTanglBob TinglTanglBob · Accepted Answer · 2019-05-24T19:16:03

You could try to solve this using the data.table package. It is a very fast package for larger amounts of data.

The idea is to melt df2 to long format, so each year X site value is stored in a separat line in the dataframe. After doing so df1 und df2_long can be joined together by the key-elements years and sites.

edit: you can leaf out all the data.table:: if you load the data.table package. I just used them to indicate the data.table functions.

set.seed(123)
df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))

df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)

# helpercolum to melt the data
df2$site = rownames(df2)
# melt data and change varnames
df2_long = data.table::melt(df2, id.vars = "site")
names(df2_long) = c("sites", "years", "NewVar")
# set df1 as data.table
data.table::setDT(df1)
# set df2 as data.table and convert the factors to numerics, as @Gregor suggested in his post (this way you dont have to deal with common factor-struggles)
data.table::setDT(df2_long)
df2_long$sites = as.numeric(as.character(df2_long$sites))
df2_long$years = as.numeric(as.character(df2_long$years))
# set key-columns on which the join should be made
data.table::setkey(df1, years, sites)
data.table::setkey(df2_long, years, sites)
# leftjoin the data
df2_long[df1]

Thanks for your input @Gregor on rather changing factors to numeric then other way round.

Mutate new column of dataframe 1. and match values from another dataframe 2. based on variable combinations

3 Answers

data