0
votes

Packages:

library(data.table)
library(dplyr)

Dataset:

iris

data.table workflow:

When I want to add a column to a data.table that displays the mean for specific rows (in this case: when Sepal.Length < 5) and grouped by Species, I can do the following:

iris_table <- as.data.table(iris)
iris_table[Sepal.Length < 5, `:=`(new_var = mean(Sepal.Length)), by = Species] %>% as_tibble()

Output:

# A tibble: 150 x 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
 1          5.1         3.5          1.4         0.2 setosa    NA   
 2          4.9         3            1.4         0.2 setosa     4.67
 3          4.7         3.2          1.3         0.2 setosa     4.67
 4          4.6         3.1          1.5         0.2 setosa     4.67
 5          5           3.6          1.4         0.2 setosa    NA   
 6          5.4         3.9          1.7         0.4 setosa    NA   
 7          4.6         3.4          1.4         0.3 setosa     4.67
 8          5           3.4          1.5         0.2 setosa    NA   
 9          4.4         2.9          1.4         0.2 setosa     4.67
10          4.9         3.1          1.5         0.1 setosa     4.67
# ... with 140 more rows

Here, new_var displays the mean for when Sepal.Length < 5 in the first 10 rows for setosa.

Question:

How can I do the same operation with dplyr in a similar efficient way?

I can achieve the same result by using an intermediate variable (new_var) in order to calculate the desired variable(new_var2), but it looks rather complicated and not as concise as with data.table:

iris %>% 
  group_by(Species) %>% 
  mutate(new_var = case_when(Sepal.Length < 5 ~ Sepal.Length,
                         TRUE ~ NA_real_),
         new_var2 = case_when(Sepal.Length < 5 ~ mean(new_var, na.rm = TRUE),
                          TRUE ~ NA_real_)) %>% 
  select(-new_var) %>% as_tibble()

Output:

# A tibble: 150 x 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var2
          <dbl>       <dbl>        <dbl>       <dbl> <fct>      <dbl>
 1          5.1         3.5          1.4         0.2 setosa     NA   
 2          4.9         3            1.4         0.2 setosa      4.67
 3          4.7         3.2          1.3         0.2 setosa      4.67
 4          4.6         3.1          1.5         0.2 setosa      4.67
 5          5           3.6          1.4         0.2 setosa     NA   
 6          5.4         3.9          1.7         0.4 setosa     NA   
 7          4.6         3.4          1.4         0.3 setosa      4.67
 8          5           3.4          1.5         0.2 setosa     NA   
 9          4.4         2.9          1.4         0.2 setosa      4.67
10          4.9         3.1          1.5         0.1 setosa      4.67
# ... with 140 more rows

Is there a simpler way of adding a variable to a data.frame in the form of a conditional aggregation using dplyr?

1

1 Answers

1
votes

You can use ifelse :

library(dplyr)

iris %>%
  group_by(Species) %>%
  mutate(new_var = ifelse(Sepal.Length < 5, 
                      mean(Sepal.Length[Sepal.Length < 5]), NA))  %>%
  ungroup()


#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
#          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
# 1          5.1         3.5          1.4         0.2 setosa    NA   
# 2          4.9         3            1.4         0.2 setosa     4.67
# 3          4.7         3.2          1.3         0.2 setosa     4.67
# 4          4.6         3.1          1.5         0.2 setosa     4.67
# 5          5           3.6          1.4         0.2 setosa    NA   
# 6          5.4         3.9          1.7         0.4 setosa    NA   
# 7          4.6         3.4          1.4         0.3 setosa     4.67
# 8          5           3.4          1.5         0.2 setosa    NA   
# 9          4.4         2.9          1.4         0.2 setosa     4.67
#10          4.9         3.1          1.5         0.1 setosa     4.67
# … with 140 more rows