0
votes

I am trying to mutate a new column into a very large dataframe by using combinations of two variables to draw a value from another dataframe. I know I can do this without using a for loop, but the original dataframe is ~2.7 million rows and so a for loop takes prohibitively long.

Essentially, I have two dataframes like such:

df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df1
   years sites        var1
1   2001     1 -0.01419947
2   2002     2  0.53729311
3   2003     3  0.89222231
4   2004     4  0.20600240
5   2005     5  0.24541548

df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
df2
      2001      2002      2003      2004      2005
1 -0.1217767 0.6703649 3.6074038 2.3048512 3.0022530
2  2.6167986 1.7673236 0.9467254 0.9903685 1.8533297
3  0.3192424 2.2183726 0.8783466 2.7741821 0.1847018
4  2.3599459 0.5653315 3.8843616 3.3171480 2.9135520
5  1.5399871 2.8127713 1.2168152 2.1788167 2.1153329

where for df2 the column names are years, the rownames are sites, and in the actual version of df1, each year/site combination is present many times.

I would like to end up with:

   years sites        var1    NewVar
1   2001     1 -0.01419947  1.322451
2   2002     2  0.53729311  3.083238
3   2003     3  0.89222231  1.106300
4   2004     4  0.20600240  2.723593
5   2005     5  0.24541548  2.4919137

Such that the variable NewVar is made based on the combination of the values in years and sites which indicate the appropriate value to draw from df2.

What I want to do, is mutate a column in df1 that, for each row, uses the combination of the site and year variables to find the appropriate value in df2. As I mentioned before, I know this is possible to do with a for loop but on the computer I'm running on, a similar foreach loop took 6 hours running on 3 cores, so I'm really hoping for a quicker mutate version.

The only other thought for a solution I've had so far is to use a combination of indexing and a ton of ifelse statements, but the real versions of the dataframes I'm working with have 702 unique combinations of site and years.

Any help would be greatly appreciated!

3

3 Answers

2
votes

You could try to solve this using the data.table package. It is a very fast package for larger amounts of data.

The idea is to melt df2 to long format, so each year X site value is stored in a separat line in the dataframe. After doing so df1 und df2_long can be joined together by the key-elements years and sites.

edit: you can leaf out all the data.table:: if you load the data.table package. I just used them to indicate the data.table functions.

set.seed(123)
df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))

df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)

# helpercolum to melt the data
df2$site = rownames(df2)
# melt data and change varnames
df2_long = data.table::melt(df2, id.vars = "site")
names(df2_long) = c("sites", "years", "NewVar")
# set df1 as data.table
data.table::setDT(df1)
# set df2 as data.table and convert the factors to numerics, as @Gregor suggested in his post (this way you dont have to deal with common factor-struggles)
data.table::setDT(df2_long)
df2_long$sites = as.numeric(as.character(df2_long$sites))
df2_long$years = as.numeric(as.character(df2_long$years))
# set key-columns on which the join should be made
data.table::setkey(df1, years, sites)
data.table::setkey(df2_long, years, sites)
# leftjoin the data
df2_long[df1]

Thanks for your input @Gregor on rather changing factors to numeric then other way round.

2
votes

I think data.table is probably better option here, however just to illustrate the same logic in tidyverse:

library(tidyverse)

df2 %>%                                            # pipe in df2 
  rowid_to_column('sites') %>%                     # assign rownames to 'sites'
  gather(key = years, value = newVar, -sites) %>%  # transworm df2 to long form
  mutate(years = as.numeric(years)) %>%            # convert 'years' into numeric    
  right_join(df1, by = c('years', 'sites')) %>%    # join df1 and df2 
  select(years, sites, var1, newVar)               # rearrange columns

#   years sites       var1   newVar
# 1  2001     1 -0.2324031 3.652280
# 2  2002     2 -1.6015391 4.144123
# 3  2003     3 -1.9813792 3.514144
# 4  2004     4 -0.6039213 2.334821
# 5  2005     5  0.3302109 3.416026
0
votes

An one-liner without reshaping and using the efficient row/column indexing from base R would be

df1$newvar <- df2[cbind(df1$sites, match(df1$years,names(df2)))]  
df1
#  years sites        var1     newvar
#1  2001     1 -0.56047565 3.71506499
#2  2002     2 -0.23017749 2.35981383
#3  2003     3  1.55870831 0.03338284
#4  2004     4  0.07050839 1.27110877
#5  2005     5  0.12928774 3.25381492

data

set.seed(123)
df1 <- data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))

df2 <- data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) <- 2001:2005
rownames(df2) <- 1:5