Quite often, I find myself manually combining select() and mutate() functions within dplyr. This is usually because I'm tidying up a dataframe, want to create new columns based on the old columns, and only want keep the new columns.
For example, if I had data about heights and widths but only wanted to use them to calculate and keep the area then I would use:
library(dplyr)
df <- data.frame(height = 1:3, width = 10:12)
df %>%
mutate(area = height * width) %>%
select(area)
When there are a lot of variables being created in the mutate step it can be difficult to make sure they're all in the select step. Is there a more elegant way to only keep the variables defined in the mutate step?
One workaround I've been using is the following:
df %>%
mutate(id = row_number()) %>%
group_by(id) %>%
summarise(area = height * width) %>%
ungroup() %>%
select(-id)
This works but is pretty verbose, and the use of summarise() means there's a performance hit:
library(microbenchmark)
microbenchmark(
df %>%
mutate(area = height * width) %>%
select(area),
df %>%
mutate(id = row_number()) %>%
group_by(id) %>%
summarise(area = height * width) %>%
ungroup() %>%
select(-id)
)
Output:
min lq mean median uq max neval cld
868.822 954.053 1258.328 1147.050 1363.251 4369.544 100 a
1897.396 1958.754 2319.545 2247.022 2549.124 4025.050 100 b
I'm thinking there's another workaround where you can compare the original dataframe names with the new dataframe names and take the right complement, but maybe there's a better way?
I feel like I'm missing something really obvious in the dplyr documentation, so apologies if this is trivial!
dplyr::transmute
? - Natewith(df, data.frame(area = height*width))
is much faster. Ordf %$% data.frame(area = height*width)
if pipes are essential. However, benchmarking on such a tiny data set is not really meaningful. - Frankdplyr::transmute
is perfect. I had it in my mind that it only dropped variables that were referenced in transmute, but I was mistaken. Thanks! - mdpead