4
votes

I would like to have a tidyverse solution for the following problem. In my dataset, I have data on various factor levels. I would like to create a new factor level "Total" that is the sum of all values Y at existing factor levels of X. This can be done, for example, with:

mutate(Data, X = fct_collapse(X, Total = c("A", "B", "C", "D"))) %>%
  group_by(X) %>% 
  summarize(Y = sum(Y))

However, this also necessarily overwrites the original factor levels. I would have to combine the original dataset with the new collapsed dataset in an additional step.

One solution I have used in the past to retain the original levels is to bring data in the wide format and proceed with rowwise() and mutate() to create a new variable with the "Total" and then reshape back to long.

spread(Data, key = X, value = Y) %>%
  rowwise() %>%
  mutate(Total = sum(A, B, C, D)) %>%
  gather(1:5, key = "X", value = "Y")

However, I am very unhappy with this solution since using rowwise() is not considered good practice. It would be great if you could point me to an available alternative solution how to combine data under different factor levels while retaining original levels.

Minimal reproducible example:

Data<-data.frame(
X = factor(c("A", "B", "C", "D")),
Y = c(1000, 2000, 3000, 4000))

Expected result:

# A tibble: 5 x 2
  X         Y
  <chr> <dbl>
1 A      1000
2 B      2000
3 C      3000
4 D      4000
5 Total 10000
3
df %>% janitor::adorn_totals("row") does this work for you? It requires loading an additional package and Total will not be added as a factor.M--
In principle, the function does exactly what I've been looking for (with the slight qualification that it would be great to have "Total" as a factor level). However, I would indeed prefer a tidyverse solution to the problem. But I guess would then have to define my own function e.g. based on @Rui Barradas suggestions.miwin
I do something similar to this often enough that I added a function bind_self to a package I wrote for work: github.com/camille-s/camiller/blob/master/R/bind_self.R It's a little beyond the scope of an SO answercamille
@miwin if you write your own function, is it really a tidyverse solution? I am asking conceptually, not arguing about what you may prefer, which is, obviously, totally up to you.M--
@M-M Your point is well taken :) What I meant was to have an answer that works without additional packages. But your solution is certainly the most concise one and very helpful.miwin

3 Answers

5
votes

Using library, this would be straightforward.

Data %>% janitor::adorn_totals("row") %>% mutate(X=factor(X))

  # X     Y
  # A     1000
  # B     2000
  # C     3000
  # D     4000
  # Total 10000

Looking at the output structure:

str(output)

# 'data.frame': 5 obs. of  2 variables:
#  $ X: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ Y: num  1000 2000 3000 4000 10000
4
votes

Using the suggestion in @M--'s first version of his comment to the question, now edited, I have added bind_rows.
I have also changed the input dataset a bit. Following the OP's and @camille's comment, this dataset has a factor level "Z" but keeps the original order and adds level "Total" at the end.

Data <- data.frame(
  X = factor(c("A", "B", "C", "Z")),
  Y = c(1000, 2000, 3000, 4000))

Data %>%
  mutate(lvl = levels(X),
         X = fct_collapse(X, Total = c("A", "B", "C", "Z")),
         X = as.character(X)) %>%
  bind_rows(mutate(Data, X = as.character(X)), .) %>%
  mutate(X = factor(X, levels = c(lvl, "Total"))) %>%
  group_by(X) %>% 
  summarize(Y = sum(Y)) -> d

d
## A tibble: 5 x 2
#  X         Y
#  <fct> <dbl>
#1 A      1000
#2 B      2000
#3 C      3000
#4 Z      4000
#5 Total 10000

Check the output factor levels.

levels(d$X)
#[1] "A"     "B"     "C"     "Z"     "Total"
1
votes

This solution can also be used in this case:

library(dplyr)

Data %>%
  add_row(X = "Total", Y = sum(.$Y)) %>%
  mutate(X = factor(X))

      X     Y
1     A  1000
2     B  2000
3     C  3000
4     D  4000
5 Total 10000

Data %>%
  add_row(X = "Total", Y = sum(.$Y)) %>%
  mutate(X = factor(X)) %>%
  {levels(.$X)}

[1] "A"     "B"     "C"     "D"     "Total"