Packages:
library(data.table)
library(dplyr)
Dataset:
iris
data.table workflow:
When I want to add a column to a data.table that displays the mean for specific rows (in this case: when Sepal.Length < 5) and grouped by Species, I can do the following:
iris_table <- as.data.table(iris)
iris_table[Sepal.Length < 5, `:=`(new_var = mean(Sepal.Length)), by = Species] %>% as_tibble()
Output:
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3 1.4 0.2 setosa 4.67
3 4.7 3.2 1.3 0.2 setosa 4.67
4 4.6 3.1 1.5 0.2 setosa 4.67
5 5 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa 4.67
8 5 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa 4.67
10 4.9 3.1 1.5 0.1 setosa 4.67
# ... with 140 more rows
Here, new_var displays the mean for when Sepal.Length < 5 in the first 10 rows for setosa.
Question:
How can I do the same operation with dplyr in a similar efficient way?
I can achieve the same result by using an intermediate variable (new_var) in order to calculate the desired variable(new_var2), but it looks rather complicated and not as concise as with data.table:
iris %>%
group_by(Species) %>%
mutate(new_var = case_when(Sepal.Length < 5 ~ Sepal.Length,
TRUE ~ NA_real_),
new_var2 = case_when(Sepal.Length < 5 ~ mean(new_var, na.rm = TRUE),
TRUE ~ NA_real_)) %>%
select(-new_var) %>% as_tibble()
Output:
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var2
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3 1.4 0.2 setosa 4.67
3 4.7 3.2 1.3 0.2 setosa 4.67
4 4.6 3.1 1.5 0.2 setosa 4.67
5 5 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa 4.67
8 5 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa 4.67
10 4.9 3.1 1.5 0.1 setosa 4.67
# ... with 140 more rows
Is there a simpler way of adding a variable to a data.frame in the form of a conditional aggregation using dplyr?