dplyr + “meta”-columns: when a column contains names of other columns to use instead of the data

4

votes

I wonder if the following question has an elegant solution in dplyr.

To provide a simple reproducible example, consider the following data.frame:

df <- data.frame( a=1:5, b=2:6, c=3:7,
                  ref=c("a","a","b","b","c"), 
                  stringsAsFactors = FALSE )

Here a,b,c are regular numeric variables while ref is meant to reference which column is the "main" value for that observation. For example:

  a b c ref
1 1 2 3   a
2 2 3 4   a
3 3 4 5   b
4 4 5 6   b
5 5 6 7   c

For example, for observation 3, ref==b and thus column b contains the main value. While for observation 1, ref==a and thus column a contains the main value.

Having this data.frame the question is to create the new column with main values for each observation using dplyr.

  a b c ref main
1 1 2 3   a    1
2 2 3 4   a    2
3 3 4 5   b    4
4 4 5 6   b    5
5 5 6 7   c    7

I'll probably need to use dplyr for that since this one operation is a part of a longer dplyr %>% data transformation chain.

r dplyr

6

votes

Here's a simple, fast way that allows you to stick with dplyr chaining:

require(data.table)
df %>% setDT %>% .[,main:=get(ref),by=ref]
#    a b c ref main
# 1: 1 2 3   a    1
# 2: 2 3 4   a    2
# 3: 3 4 5   b    4
# 4: 4 5 6   b    5
# 5: 5 6 7   c    7

Thanks to @akrun for the idea for the fastest way and benchmarking to show it (see his answer).

setDT modifies the class of df so you won't have to convert to data.table again in future chains.

The conversion should work with any future code in the chain, but both dplyr and data.table are under active development, so to be on the safe side, one could instead use

df %>% data.table %>% .[,main:=get(ref),by=ref]

5

votes

We could do this in base R with row/column index. We get the column index with match, cbind with row index (1:nrow(df)) and extract the elements. The indexing is very fast.

df$main <- df[-4][cbind(1:nrow(df),match(df$ref,names(df)[-4]))]
df
#    a b c ref main
#1 1 2 3   a    1
#2 2 3 4   a    2
#3 3 4 5   b    4
#4 4 5 6   b    5
#5 5 6 7   c    7

The analogous dplyr chain is

df %>% 
  `[[<-.data.frame`(.,"main",value=.[-4][
           cbind(1:nrow(.),match(.$ref,names(.)[-4]))])

Benchmarks

set.seed(24)
df <- data.frame(a= sample(10, 1e6, replace=TRUE), b= sample(20, 1e6, 
replace=TRUE), c= sample(40,1e6, replace=TRUE), ref= sample(letters[1:3],
 1e6, replace=TRUE), stringsAsFactors=FALSE)
df2 <- copy(df)
df3 <- copy(df)
df4 <- copy(df)

akrun <- function() {df$main <- df[-4][cbind(1:nrow(df),match(df$ref,names(df)[-4]))]}
akrun2 <- function(){setDT(df3)[, main:=get(ref), ref]}
Frank <- function() {df2 %>% data.table %>% .[,main:=.SD[[ref]],by=ref]}
Frank2 <- function() {setDT(df4)[, main:= .SD[[ref]], by =ref]}
MrFlick <- function() {getval <- . %>%
                                  mutate(id=factor(1:n())) %>% 
                                  gather(col, val, a:c) %>% 
                                  group_by(id) %>% 
                                  summarize(val=first(val[col==ref])) %>% 
                                  select(val)
                       df2 %>%
                          cbind(., getval(.))}

akhmed <- function() {df %>%
                        group_by(ref) %>%
                        do({
                        eval(parse(text=sprintf("main <- .$%s",.$ref[1])))
                        data.frame(., main = main, stringsAsFactors=FALSE)
                         }) %>% 
                        ungroup()
         }

 system.time(akrun())
 #user  system elapsed 
 #0.07    0.00    0.07

 system.time(akrun2())
#user  system elapsed 
# 0.018   0.000   0.018 

system.time(Frank())
# user  system elapsed 
# 0.028   0.000   0.028 

system.time(Frank2())
# user  system elapsed 
# 0.018   0.000   0.018 

system.time(MrFlick())
#  user  system elapsed 
#42.725   0.066  42.777 

 system.time(akhmed())
 #user  system elapsed 
 # 1.125   0.004   1.169 


library(microbenchmark)
microbenchmark(akrun(), akrun2(), Frank(), Frank2(), unit='relative', times=20L)
#Unit: relative
# expr      min       lq     mean   median        uq      max neval cld
# akrun() 3.732126 3.822714 3.768758 3.784908 3.7490118 3.448839    20   c
#akrun2() 1.000000 1.000000 1.000000 1.000000 1.0000000 1.000000    20 a  
# Frank() 1.457337 1.455412 1.672008 1.493600 1.6575381 3.697565    20  b 
# Frank2() 1.001986 1.005541 1.003171 1.003474 0.9980418 1.013425    20 a

2

votes

The dplyr mutates operates on entire columns at a time which this type of operation does not lend itself nicely to. A different strategy might be use the tidyr library as well to make "tidy" data in a long format then do the subsetting. Here's how you might do this.

library(tidyr)
library(dplyr)

getval <- . %>% mutate(id=factor(1:n())) %>% 
    gather(col, val, a:c) %>% group_by(id) %>% 
    summarize(val=first(val[col==ref])) %>% select(val)

df %>% cbind(., getval(.))

This does assume that each ref value corresponds to a column that exists.

1

votes

Self-Answer: Here is one solution I found that may not be the most elegant but it seems to work:

library(dplyr)

df2 <- df %>%
  group_by(ref) %>%
  do({
    eval(parse(text=sprintf("main <- .$%s",.$ref[1])))
    data.frame(., main = main, stringsAsFactors=FALSE)
  }) %>% ungroup()

df2

which gives me this:

  a b c ref main
1 1 2 3   a    1
2 2 3 4   a    2
3 3 4 5   b    4
4 4 5 6   b    5
5 5 6 7   c    7

I am still wondering if this could be done with some easy mutate_ instead?

dplyr + “meta”-columns: when a column contains names of other columns to use instead of the data

4 Answers

Benchmarks