1
votes

I have a data frame made of 420 rows and 37 columns from insect field sampling data. I am trying to sum column values every 5 rows so that every 5 rows becomes just 1. In addition I am trying to make the non numerical values in my dataframe($site, $date, $plot) "collapse" so they fit the single new row made of the sums I mentioned.

I have tried using advice from this thread:

Summing columns on every nth row of a data frame in R

Including using package dplyr's summarize_each function as well as gl and colSums that are part of data.table.

I have tried these on my whole dataframe

library(data.table) setDT(FinalData)[, as.list(colSums(.SD)), by = gl(ceiling(420/5), 5, 420)]

this gives me:

Error in colSums(.SD) : 'x' must be numeric

and 

library(dplyr)
 FinalData %>%
   group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
   summarise_each(funs(sum))

which gives me error: 
Error in Summary.factor(c(4L, 4L, 4L, 4L, 4L), na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

  Site.Date.Plot  CarA  CarB    CarC...
1.SL.VI.1          0     0        1
2.SL.VI.1          0     0        0
3.SL.VI.1          0     6        0
4.SL.VI.1          0     0        3
5.SL.VI.1          1     0        0
...
every 5 rows has a different $Site.Date.Plot. I expect this:

  Site.Date.Plot  CarA  CarB    CarC...
1. SL.VI.1          1     6       4
2. SL.VI.2         ...   ...      ...

But I get the above error messages from above.
3
Hi, and welcome to SO! Could you provide a sample of data so we can see the situation? Take a look at how to make a reproducible exampleCalum You
Hi Calum, I provided a small portion of my data in my post, should I post more than that? Do you need my whole datasheet?cjag
It's better to use dput so people can copy and paste. Also, including 20ish rows may have been better just to verify that every 5 rows there is a different Site.Date.Plot. So, dput(FinalData[1:20, 1:4]) would have been great.Cole

3 Answers

1
votes

Does this work?

FinalData %>%
   group_by(Site.Date.Plot) %>% 
   summarise_all(list(sum))

This will preserve the first column and sum the rest.

0
votes

You could try using mutate_at if you want to keep the first column Site.Date.Plot

library(dplyr)

FinalData %>%
  group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
  mutate_at(2:37, sum, na.rm = TRUE) %>%
  slice(1)

Or it will also work with summarise_at but you'll loose the first column in this case

FinalData %>%
  group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
  summarise_at(-1, sum, na.rm = TRUE)
0
votes

The first issue is that in your data.table and dplyr solutions, the sum function is operating on the factor of Site.Date.Plot. That's why you are receiving the error.

The second improvement is that your factor Site.Date.Plot changes every 5 rows. In other words, we can simply group by that instead of creating a different index.

These solutions are untested since there isn't enough sample data.

library(tidyverse)

 FinalData %>%
   group_by(Site.Date.Plot) %>%
   summarise_all(sum)

library(data.table)

setDT(FinalData)
FinalData[, lapply(.SD, sum), by = 'Site.Date.Plot']

#base R
aggregate(x = FinalData[, -1], by = list(FinalData$Site.Date.Plot), FUN = sum)