I am trying to explore the values of 1 variable in my grouped dataset around peaks of another variable. My dataset is quite large (4000 groups). For clarity let's see an example with iris
. I would like first to identify the peaks in sepal length for each species.
id <- as.numeric(iris$Species)
iris2 <- cbind(iris, id)
require(purrr)
result <- iris2 %>%
split(.$id) %>%
map(~quantmod:: findPeaks(.$Sepal.Length, thresh= 0 ))
Using the above, I managed to identify the row numbers in which there are peaks in sepal length for every group id :
> $`1` [1] 7 9 12 14 16 20 22 25 30 33 35 38 41 46 48 50
>
> $`2` [1] 4 6 8 10 15 17 20 24 28 35 38 43 49
>
> $`3` [1] 4 7 9 11 14 20 22 24 27 33 37 39 41 43 45 47 49
Next, I want to identify the values of sepal width in the respective positions identified earlier. Basically, I would like to find the max and min sepad width values in each group and examine whether they are close to the peaks in sepal length, specifically, 5 rows before or after. I would like to add a TRUE/FALSE column that examines each id based on this criterion.
The nested list appeared to be a complicated structure to use, so I transformed it into a dataframe:
library(data.table)
dfs <- lapply(result, data.frame, stringsAsFactors = FALSE)
r_df <- rbindlist(dfs, use.names = TRUE, fill=TRUE, idcol = "file")
r_df
is a 2-column dataframe, including the species id and the within group row number with peaks in sepal length. The next step is to identify the max and min values of sepal width.
iris2<-iris2 %>% group_by(Species) %>% mutate(max_sep=max(iris2$Sepal.Width))
iris2<-iris2 %>% group_by(Species) %>% mutate(min_sep=min(iris2$Sepal.Width))
What I have not managed to do however is to examine whether the max and min sepal widths are within 5 rows of the peak. E.g.: For species 1, max_sep = 4.4, in row 16. Looking at the results of the function findpeaks
ealrier, it looks like like the index would be TRUE, since it is close to a peak (exactly on one of the peaks actually).
[![example of max_sep][1]][1]
I have been trying solutions using group_by
since I am more familiar with dplyr
but I haven't done much progress. An additional issue is that in both the nested list and the dataframe the row numbers refer to the within group row number and not the total. Example of solutions I tried:
r_df <- r_df %>% group_by(file) %>% mutate(frame= case_when(nrow(iris2)== r_df$X..i.. & file== id ~ iris2$max_sep))
This gives the error:
1: In file == id : longer object length is not a multiple of shorter object length 2: In nrow(iris2) == r_df$X..i.. & file == id : longer object length is not a multiple of shorter object length
Any ideas will be really appreciated! Many thanks [1]: https://i.stack.imgur.com/s4ces.png
Map("[", split(iris2$Sepal.Width, iris2$id), result)
? - GKi[
is the typical way to subset. E.g.x[4]
will give you the 4'th element ofx
. WithMap
you can use[
where in this case the second argument stands for thex
and the third for the4
inx[4]
. - GKi