How to loop data from multiple data frame through a summarise dplyr function

Question

Hello I am writing because I am trying to place a group_by and summarise function through a loop tied to variables in a second dataset. I tried to do this through both a for loop and an apply loop.

I have one dataset that is a list of Species and attributes. d1 looks like

Species Height
Cenjac    67
Cirarv    24

d2 is patch data that I normally summarise which has the presence absence of the species in each patch, the nearest patch (Target), and the size of the patch.

Patch  Target  Size   Cenjac Cirarv 
  a       c    250      0      1
  b       a    18       1      0
  c       a    20       1      0

My normal method of summarising is manually through group_by and summarise to create a new variable which is the Height from d1 the Size and presence/absence from d2. I need to write the Height in each time. (Note:This is not my real equation)

DfullCJ<- group_by(d2, Patch, Target) %>% summarise(Cenjacmax=(67*Size*Cenjac))

I would then need to re-write the code each time for each species

 DfullCA<- group_by(d2, Patch, Target) %>% summarise(Cirarvmax=(24*Size*Cirarv))

Ideally, I would be able to automate this process through either a for loop or apply. Is there no way to set the Species name as a variable and then pull from d1 both the Height and the corresponding Species name (which is also the name of presence absence column in d2) to plug into the group_by summarise function. Or or run the function through a loop with d1 as a list.

Thanks to any one who can help me.

DfullCJ<- group_by(d2, Patch, Target) %>% summarise(Cenjacmax=(67*Size*Cenjac), Cirarvmax=(24*Size*Cirarv)) You can summarise multiple in one command, would this help you out? Note that you would need to have merged your dataframes into one. — Bart
Thanks! I definitely should have seen that you can summarise multiple.. But that is really only a small part of my issue. My two dataframes are incompatable really. One are species attributes and one is patch data. I suppose I could just have one data sheet with each Species Height as a column just copy and pasted for every patch row, but that seems very inelegant as I have >150 species (in other words an extra 150 columns for each species attribute) and >3000 patches. And either way I would need to manually write out all 150 equations. But maybe that is the easiest way to do it... — Kevin
Usually one uses an aggregate function with summarise. Did you just need mutate? — Parfait
An earlier answer (which must have been deleted) suggested using mutate and spreading. "d2 %>% rownames_to_column('rn') %>% gather(Species, val, Cenjac:Cirarv) %>% left_join(d1) %>% mutate(val = HeightSizeval)) %>% select(-Height) %>% spread(Species, val)" However, as I mentioned my dataset is too large and while this would work, when I try to include each species it overruns the memory limits. I may need to break the dataset apart — Kevin

Parfait Parfait · Accepted Answer · 2019-08-13T15:33:07

Consider reshaping your data from wide to long to create Species and Indicator columns then merge to height data for your needed calculation or aggregation. Usually long format is the preferred format in data science as aggregation, merging, plotting, modeling, and other methods is much easier without looping across hundreds of indicator columns.

reshape

d2_long <- reshape(d2, varying = list(names(d2)[4:ncol(d2)]), v.names = "Indicator",
                   times = names(d2)[4:ncol(d2)], timevar = "Species",
                   new.row.names = 1:1E5, direction = "long")
d2_long
#   Patch Target Size Species Indicator id
# 1     a      c  250  Cenjac         0  1
# 2     b      a   18  Cenjac         1  2
# 3     c      a   20  Cenjac         1  3
# 4     a      c  250  Cirarv         1  1
# 5     b      a   18  Cirarv         0  2
# 6     c      a   20  Cirarv         0  3

merge

merge_df <- merge(d2_long, d1, by="Species")
merge_df$Value <- with(merge_df, Size*Height*Indicator)

merge_df

#   Species Patch Target Size Indicator id Height Value
# 1  Cenjac     a      c  250         0  1     67     0
# 2  Cenjac     b      a   18         1  2     67  1206
# 3  Cenjac     c      a   20         1  3     67  1340
# 4  Cirarv     a      c  250         1  1     24  6000
# 5  Cirarv     b      a   18         0  2     24     0
# 6  Cirarv     c      a   20         0  3     24     0

aggregate

agg_raw <- aggregate(Value ~ Patch + Target, merge_df, 
                    function(x) c(count=length(x), min=min(x), median=median(x), 
                                  mean=mean(x), max=max(x)))

agg_df <- do.call(data.frame, agg_raw)
agg_df

#   Patch Target Value.count Value.min Value.median Value.mean Value.max
# 1     b      a           2         0          603        603      1206
# 2     c      a           2         0          670        670      1340
# 3     a      c           2         0         3000       3000      6000

Rextester demo

How to loop data from multiple data frame through a summarise dplyr function

2 Answers