I am trying to mutate a new column into a very large dataframe by using combinations of two variables to draw a value from another dataframe. I know I can do this without using a for loop, but the original dataframe is ~2.7 million rows and so a for loop takes prohibitively long.
Essentially, I have two dataframes like such:
df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df1
years sites var1
1 2001 1 -0.01419947
2 2002 2 0.53729311
3 2003 3 0.89222231
4 2004 4 0.20600240
5 2005 5 0.24541548
df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
`2002` = rnorm(n = 5, mean = 2, sd = 1),
`2003` = rnorm(n = 5, mean = 2, sd = 1),
`2004` = rnorm(n = 5, mean = 2, sd = 1),
`2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
df2
2001 2002 2003 2004 2005
1 -0.1217767 0.6703649 3.6074038 2.3048512 3.0022530
2 2.6167986 1.7673236 0.9467254 0.9903685 1.8533297
3 0.3192424 2.2183726 0.8783466 2.7741821 0.1847018
4 2.3599459 0.5653315 3.8843616 3.3171480 2.9135520
5 1.5399871 2.8127713 1.2168152 2.1788167 2.1153329
where for df2
the column names are years, the rownames are sites, and in the actual version of df1
, each year/site combination is present many times.
I would like to end up with:
years sites var1 NewVar
1 2001 1 -0.01419947 1.322451
2 2002 2 0.53729311 3.083238
3 2003 3 0.89222231 1.106300
4 2004 4 0.20600240 2.723593
5 2005 5 0.24541548 2.4919137
Such that the variable NewVar
is made based on the combination of the values in years
and sites
which indicate the appropriate value to draw from df2.
What I want to do, is mutate
a column in df1 that, for each row, uses the combination of the site and year variables to find the appropriate value in df2
. As I mentioned before, I know this is possible to do with a for loop but on the computer I'm running on, a similar foreach loop took 6 hours running on 3 cores, so I'm really hoping for a quicker mutate version.
The only other thought for a solution I've had so far is to use a combination of indexing and a ton of ifelse
statements, but the real versions of the dataframes I'm working with have 702 unique combinations of site and years.
Any help would be greatly appreciated!