0
votes

I have a dataframe. For simplicity, I am leaving out many columns and rows:

    Distance     Type
1      162         A
2      27182       A
3      212         C
4       89         B
5       11         C

I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.

For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.

1
Also, there may be multiple regions, and the number of consecutive rows with an average distance of 1000 may exceed 6. There could be 6 rows (with avg of 1000), then 100 in between, then another 8 (with avg of 1000), then 2 in between, then another 18 (with avg of 1000) - user41912

1 Answers

0
votes

Maybe a solution with data.table library ?

For reproducibility, here is a data sample, based on what you wrote.

library(data.table)

# data orig (with row numbers...)
DO<-"Distance     Type
1      162         A
2      27182       A
3      212         C
4       89         B
5       11         C
6      1234       A"

# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)

Then, make a function to increment a counter each time a sequence of identical value is found :

# function to set sequencial group number
mkGroupRep<-function(x){
  cnt=1L
  grp=1L
  lx<-length(x)
  ne<- x[-lx] != x[-1L] #next not equal
  for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
  grp
}

And use it with data.table 'multiple assignment by reference' :

# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]

# calc sum of distance and number of item in group, by group
DT[,`:=`(
  distMean=mean(Distance),
  grpLength=.N
),by=grp]


# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]

Output :

   Distance Type grp distMean grpLength
1:      162    A   1    13672         2
2:    27182    A   1    13672         2