Compute sum and relative proportion by group for any number of columns with random names using dplyr

Question

I want to calculate the relative proportion by group for every column - except the grouping column - of a data frame. However, this should be programmed once to be used with different data frames which will have a different number of columns with different names. Because I am relying heavily on dplyr in this project, I want to achive this with dplyr.

I have read this topic, regarding a similiar but less complex problem: Use dynamic variable names in `dplyr` and also vignette("programming", "dplyr") but I am still not able to set the quotation correctly. I am really stuck at this point and like to have some advice of more experienced developers.

To reproduce the problem, I have set up a minimal example with a data frame with randomly created data columns and a grouping column.

library(dplyr)
library(stringi)

df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)), 
               stri_rand_strings(3, 10, pattern = "[A-Za-z]"))

group <- c("group1","group2","group3")

df <- cbind(df, group)

The following function should achive two things:

calculate the sum of every column in the data frame by group
calculate the relative proportions of every column in the data frame by group

propsum <- function(df, expr){

  expr_quo <- enquo(expr)

  sum <- paste(quo_name(expr), "sum", sep = ".")
  prop <- paste(quo_name(expr), "prop", sep = ".")

  df %>%
    group_by(., group) %>%
    mutate(., !! sum :=  sum(!! expr_quo),
              !! prop := expr / !! sum * 100) -> df

  return(df)
}

for(i in length(df)-1){
  propsum(df, names(df)[i]) -> df_new
}

The expected result is a data frame with the initial columns, the sums by group for every initial column and the relative proportions for every initial column by group. So in the example, the data frame should have 10 columns (1 goruping column, 3 initial data columns, 3 columns with sums by group, 3 columns with relative proportions by group).

However, I am getting the following error:

Error in sum(~names(df)[i]) : invalid 'type' (character) of argument

In the vignette, the code example for a similar task ist:

my_mutate <- function(df, expr) {
  expr <- enquo(expr)
  mean_name <- paste0("mean_", quo_name(expr))
  sum_name <- paste0("sum_", quo_name(expr))

  mutate(df,
    !! mean_name := mean(!! expr),
    !! sum_name := sum(!! expr)
  )
}

my_mutate(df, a)
#> # A tibble: 5 x 6
#>      g1    g2     a     b mean_a sum_a
#>   <dbl> <dbl> <int> <int>  <dbl> <int>
#> 1     1     1     5     4      3    15
#> 2     1     2     3     2      3    15
#> 3     2     1     4     1      3    15
#> 4     2     2     1     3      3    15
#> # … with 1 more row

I tried a lot of different things as of now, but I am not able to get the RHS to use the correct column. What am I doing wrong?

How do you define relative proportions of every column? Relative with respect to? — NelsonGon
Lets asume we have a column with four values of which each is 25. The sum of this column would be 100, so the proportion of every value is 0.25 or 25%. If we slice this column in two groups of 2 values each, the relative proportion of each value would be 0.5 or 50%. I hope it is clear now, I am not an english native speaker. I am sorry if I made a mistake here. — CoCoL0r3s
What proportions do you expect for the sample data? Check my "answer" below and let me know if it works as you expect. — NelsonGon

CoCoL0r3s CoCoL0r3s · Accepted Answer · 2019-11-13T14:54:10

I have found a solution which I just want to share in case somebody faces a similar task. The solution is, to call rlang::parse_expr() explicitly to save the varnames as expressions.

Here is the working example:

library(dplyr)
library(stringi)

df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)), 
               stri_rand_strings(3, 10, pattern = "[A-Za-z]"))

group <- c("group1","group2","group3")

df <- cbind(df, group)

gpercentage <- function(df, a_var, p_var, sum_var){

  df %>%
    group_by(., group) %>%
    mutate(., !! sum_var := sum(!! a_var),
              !! p_var := !! a_var / sum(!! a_var)) -> df

  return(df)
}

i <- 1

for(i in seq_along(1:(length(df)-1))){

  a_var <- rlang::parse_expr(names(df)[i])
  p_var <- rlang::parse_expr(paste(names(df)[i], "P", sep = "."))
  sum_var <- rlang::parse_expr(paste(names(df)[i], "SUM", sep = "."))

df %>%
  gpercentage(., a_var, p_var, sum_var) -> df
}

Compute sum and relative proportion by group for any number of columns with random names using dplyr

2 Answers