1
votes

I know there is a similar question (r aggregate dataframe: some columns unchanged, some columns aggregated), but it doesn't work for what I need.

I have a dataframe with observations of different species, but many observations are from the same point (same longitude and latitude).

df=data.frame(sampleID=c("a","b","c","d","e","f"),latitude=c(46.4,46.4,73.2,36.6,36.6,55.3), longitud=c(66.4,66.4,34.6,46.4,46.4,50.7), material=c("soil","soil","water","water","water","water"), biome=c("forest", "forest", "sea","sea","sea","lake"), sp1=c(2,1,0,4,0,0),sp2=c(0,2,3,1,0,1), sp3=c(0,1,1,4,3,0))

df
  sampleID latitude longitud material  biome sp1 sp2 sp3
1        a     46.4     66.4     soil forest   2   0   0
2        b     46.4     66.4     soil forest   1   2   1
3        c     73.2     34.6    water    sea   0   3   1
4        d     36.6     46.4    water    sea   4   1   4
5        e     36.6     46.4    water    sea   0   0   3
6        f     55.3     50.7    water   lake   0   1   0



I would like to sum the observations (columns sp1,sp2, and sp3, in the example provided) from the same location and same type of material, but leaving the other columns the same. In the case of "biome" it should be the same value, so just the same, and in the case of sample ID keep one ID. Something like:

  sampleID latitude longitud material  biome sp1 sp2 sp3
1        a     46.4     66.4     soil forest   3   2   1
3        c     73.2     34.6    water    sea   0   3   1
4        d     36.6     46.4    water    sea   4   1   7
6        f     55.3     50.7    water   lake   0   1   0

I have tried with aggregate, with ddply, and with data.table and lapply. But I don't know how to apply the sum function only to a set of columns (6:8), therefore it fails. The aggregate gives me an error and ddply deletes the other columns which it cannot sum (ID and biome). And it orders the observations according to latitude, so finding out which ID was each one is a mess.

Some of the things I tried:

ddply(df, .(latitude, longitude, material), numcolwise(sum))   #This one deletes the columns which it cannot sum, ID and biome (and many other variables I haven't shown in this example)

aggregate(.~latitude+longitude+material,data=df,sum) #Gives an error
aggregate(cbind(df[,c(6:8)]) ~ as[,c(2,3,4)], data=df, sum) #Gives an error

Also, this is a simplified version of the data frame I have, I have more variables and more than 200 species, each one with its actual name in latin. I can't be typing all of them, I need the selection by df[,c(18:220)], or something like that, to work.

Any help would be appreciated.

1

1 Answers

2
votes

We can do this in base R:

do.call(rbind, 
        DF |>
          split(DF$longitud, DF$latitude, DF$material) |> 
          lapply(function(x) { 
            cbind(x[1:5][1,], t(apply(x[6:8], 2, sum, na.rm = TRUE)))})) |>
  `rownames<-`(1:4)

  sampleID latitude longitud material  biome sp1 sp2 sp3
1        d     36.6     46.4    water    sea   4   1   7
2        a     46.4     66.4     soil forest   3   2   1
3        f     55.3     50.7    water   lake   0   1   0
4        c     73.2     34.6    water    sea   0   3   1

The above solution without the native pipe:

DF <- do.call(rbind, 
              lapply(split(df, df$longitud, df$latitude, df$material), 
                     function(x) { 
                       cbind(x[1:5][1,], t(apply(x[6:8], 2, sum, na.rm = TRUE)))})) 

`rownames<-`(DF, seq_len(nrow(DF)))

  sampleID latitude longitud material  biome sp1 sp2 sp3
1        c     73.2     34.6    water    sea   0   3   1
2        d     36.6     46.4    water    sea   4   1   7
3        f     55.3     50.7    water   lake   0   1   0
4        a     46.4     66.4     soil forest   3   2   1

We can also do this with aggregate:

aggregate(DF[startsWith(names(DF), "sp")], 
          list(latitude = DF$latitude, 
               longitude = DF$longitud, 
               material = DF$material), 
          sum, na.rm = TRUE)

  latitude longitude material sp1 sp2 sp3
1     46.4      66.4     soil   3   2   1
2     73.2      34.6    water   0   3   1
3     36.6      46.4    water   4   1   7
4     55.3      50.7    water   0   1   0

Or if you would like to use a formula, we make use of cbind for numeric variables since they are more than one:

aggregate(cbind(DF$sp1, DF$sp2, DF$sp3) ~ longitud + latitude + material, 
          data = DF, sum)

  longitud latitude material V1 V2 V3
1     66.4     46.4     soil  3  2  1
2     46.4     36.6    water  4  1  7
3     50.7     55.3    water  0  1  0
4     34.6     73.2    water  0  3  1

And also if you are willing to use package dplyr, it would make the job much easier:

library(dplyr)

df %>%
  group_by(latitude, longitud, material) %>%
  mutate(across(starts_with("sp"), ~ sum(.x))) %>%
  slice_head(n = 1) %>%
  arrange(sampleID)

# A tibble: 4 x 8
# Groups:   latitude, longitud, material [4]
  sampleID latitude longitud material biome    sp1   sp2   sp3
  <chr>       <dbl>    <dbl> <chr>    <chr>  <dbl> <dbl> <dbl>
1 a            46.4     66.4 soil     forest     3     2     1
2 c            73.2     34.6 water    sea        0     3     1
3 d            36.6     46.4 water    sea        4     1     7
4 f            55.3     50.7 water    lake       0     1     0