4
votes

I am trying achieve the following: I have a dataset, and a function that subsets this dataset and then performs a series of operations on the subset. Subsetting happens based on row names. I am able to do it step by step (i.e. running this function for each subset separately), but I have a list of desired subsets, and I would like to loop over this list. It sounds complicated - please check the example below. This is what I can do:

#dataframe with rownames
whole_dataset <- data.frame(wt1 = c(1, 2, 3, 6, 6), 
                            wt2 = c(2, 3, 4, 4, 2))
row.names(whole_dataset) = c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")

# two different non-overlapping subsets
his <- c("HTA1", "HTA2", "HTB2")
cse <- c("CSE1", "CSE2")

#this is the function I have
fav_complex <- function (data, complex) {
  small_data<- data[complex,] #subset only the rows that you need 
  sum.all<-colSums(small_data) #calculate sum of columns
  return(sum.all)
}

#I generate two deparate named vectors
his_data <- fav_complex(data = whole_dataset, complex = his)
cse_data <- fav_complex(data = whole_dataset, complex = cse)

#and merge them
merged_data<- rbind(his_data,cse_data)

it looks like this

> merged_data
         wt1 wt2
his_data   6   9
cse_data  12   6

I would like to somehow generate the merged_data dataframe without having to call the 'fav_complex' function multiple times. In real life I have about 20 subsets, and it is a lot of code. This is my solution that doesn't work

#I first have a character vector listing all the variable names
subset_list <- c("his", "cse")

#then create a loop that goes over this list

#make an empty dataframe
merged_data2 <- data.frame()

#fill it with a for loop output
for (element in subset_list) {
  result <- fav_complex(data = whole_dataset, element)
  merged_data2 <-rbind(merged_data2, result)
}

I know this is wrong. In this loop, 'element' is just a string, rather than a variable with stuff in it. But I don't know how to make it a variable. noquote(element) didn't work. I tried reading about non standard evaluation and eval(), substitute(), but it is too abstract for me - I think I am not there yet with my R expertise.

2
There are errors. 1) In the function it's data not whole_dataset. 2) In the loop use result <- fav_complex(data = whole_dataset, get(element)) - Rui Barradas
I'd propose a modified workflow: having a function both subset a data frame and do perform a series of operations seems like making the function more complex than it needs to be. I'd recommend simplifying the function to just to the series of operations, and use standard tools to split the data into pieces, apply the function, and combine the results. In base, you can use split, lapply, do.call(rbind), or if you don't mind extra dependencies using purrr or similar. (Or, more simply, dplyr / data.table grouped operations if the operations really are as simple as "sum all columns") - Gregor Thomas
@joran - thank you, this simple advice worked. However, the output of the for loop is different than the manually created merged_data, in that it lacks colnames and rownames. Would you have any suggestion how to introduce them? I would also be grateful if you could tell me why you don't think using get is a good idea. @RuiBarradas, thank you, I have corrected the error. This solution also produces a dataframe without row names and column names. @Gregor, this is a very simplified example and I find this weird way more convenient, but I might try to re-write it if necessary! - Wera

2 Answers

2
votes

Consider by to run needed operation across all subsets. But first create a group column:

# ANY FUNCTION TO APPLY ON SUBSETS (REMOVE GROUP COL)
fav_complex_new <- function (sub) {  
  sum.all <- colSums(transform(sub, group=NULL)) 
  return(sum.all)
}

# ASSIGN GROUPING
whole_dataset$group <- ifelse(row.names(whole_dataset) %in% his, "his",
                              ifelse(row.names(whole_dataset) %in% cse, "cse", NA))

# BY CALL
df_list <- by(whole_dataset, whole_dataset$group, FUN=fav_complex_new)
# COMBINE ALL DFs IN LIST
merged_data <- do.call(rbind, df_list)

Rextester demo (includes OP's original and above solution)

1
votes

Following @Gregor's suggestion of a modified workflow, would you consider this solution, including some bonus data wrangling?

  1. Put the data that's currently in row names in its own column.
  2. Add a column for complex. We can do this programmatically in case the data are large.
  3. Use dplyr to created split-apply-combine summaries of data grouped by complex.

It could work like this

library(dplyr)

whole_dataset <- tibble(wt1 = c(1, 2, 3, 6, 6),
                        wt2 = c(2, 3, 4, 4, 2),
                        id = factor(c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")))

whole_dataset <- mutate(whole_dataset,
                        complex = case_when(
                          grepl("^HT", id) ~ "his",
                          grepl("^CSE", id) ~ "cse")
                        ) %>%
  group_by(factor(complex))

whole_dataset %>% summarize(sum_wt1 = sum(wt1),
                            sum_wt2 = sum(wt2))

# # A tibble: 2 x 3
# `factor(complex)` sum_wt1 sum_wt2
# <fct>               <dbl>   <dbl>
# 1 cse                    12       6
# 2 his                     6       9