0
votes

I have a dataset (example) as below,

 data <- data.frame(pc = c("A","A","A","A","A","A", "B","B","B","B"),   #categorical 
            index = c(1, 1, 2, 2, 2, 3, 4, 5, 5, 5),                    #categorical 
            g= c(1, 2, 4, 3, 6, 7, 8, 5, 9, 3),                         #numeric 
            h= c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4))                         #categorical 

I want to group by 'pc', iterate over all combinations based on 'index' to get the summation of values in 'g' and number of categories in 'h' columns, and keep the rows of the combination that yields the highest summation value from 'g' + number of categories from 'h'.

For example, in pc=A group, index=1 has two rows, index=2 has three, index=3 has one, so in total I have 2x3x1= 6 combinations (each combination has three rows, one with index=1, one with index=2, one with index=3). I want to keep the rows (one row from each unique index) that yields the highest (summation value from 'g' + number of categories from 'h'). The number of index and length of each index are all different in each pc group.

Just an example to visualise the combination for pc=A group,

 combination sum_of_values_in_g    number_of_categories_in_h
          #1                 12                     2
          #2                 11                     3
          #3                 14                     3
          #4                 13                     2
          #5                 12                     3
          #6                 15                     3

My desired result in this example will be

 pc index g h
  A     1 2 1
  A     2 6 2
  A     3 7 3
  B     4 8 3
  B     5 9 3

I have done some research on how to get combinations (Iterate over unique combination of groups in a data frame, How to iterate through all combinations of columns and apply function by group in R? and Combinations by group in R).. but I couldn't figure out how to get the right combination in each group and run further operation in each combination... Any input or direction will be appreciated!

1
What if the largest sum and the greatest number of categories are found in more than one combinations? Then, which one do you want to get, first, last, or all? Also, what if no combination has both the largest sum and the greatest number of categories? Then which condition has a higher priority?ekoam
Oh, thanks for clarity. I'm looking for the largest sum of (the sum from g and number of category from h), so in the example, #6 combination give me a value of 18 so I want to keep the three rows (each from index1, 2, 3) in final output. If two combinations yields the same final value, the one yields the higher number of category will have higher priority.user10301352
What is number_of_categories_in_h? How do you calculate it? For example, why it is 2 for #1 in your example?ThomasIsCoding
#1 combination has the first row from index=1, the first row from index=2, and the only one row from index=3. So the categories (from column h) for each row is category 1, 1, and 3, so there are 2 categories in total for #1 combination.user10301352

1 Answers

0
votes

Here is a brute force solution. The run time could be really long given a large dataset.

We need functions from these packages:

library(tidyr)
library(dplyr)
library(purrr)

This is the first step, we need a function to first split your data into several groups (split(transpose(df), df[[split_by]])), then find all possible row combinations across them (cross(...)), and finally merge each of them into a single dataframe (lapply(..., bind_rows)).

perm_all <- function(df, split_by){
  lapply(cross(split(transpose(df), df[[split_by]])), bind_rows)
}

(transpose turns an n-row dataframe into an n-element list of single-row dataframes)

This is the second step, we loop through all dataframes in that list to see which one satisfies your requirements.

which_max <- function(ls_of_df, numer, categ) {
  test_stats <- vapply(
    ls_of_df, 
    function(df) {
      temp <- length(unique(df[[categ]]))
      c(sum(df[[numer]]) + temp, temp)
    }, 
    double(2L)
  )
  # You could have multiple maxima for those sums
  out <- which(test_stats[1L, ] == max(test_stats[1L, ]))
  # but after the second test (i.e. find the greatest number of categories), you should have one and only one combination left
  out[[which.max(test_stats[2L, out])]]
}

Now, we use a single function to perform these two steps.

max_of_all_perm <- function(df, group_var, numer, categ) {
  l <- perm_all(df, group_var)
  l[[which_max(l, numer, categ)]]
}

And run it across all groups defined by pc

data %>% 
  nest(data = -pc) %>% 
  mutate(data = lapply(data, max_of_all_perm, "index", "g", "h")) %>% 
  unnest(data)

Output

# A tibble: 5 x 4
  pc    index     g     h
  <chr> <dbl> <dbl> <dbl>
1 A         1     2     1
2 A         2     6     2
3 A         3     7     3
4 B         4     8     3
5 B         5     9     3