0
votes

Hi I have a data frame (~4 million rows) with time series data for different sites and events.

Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases

Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
                      Site = c(rep("SiteA",25),rep("SiteB",28)),
                      Quality = rep(25,53),
                      Value = round(runif(53,0,5),2),
                      Othermetadata =  c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))

What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group

I have tried using the zoo library and zoo::aggregate ie:

library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)

However when I do this I'm losing all my metadata and merging different sites data.

I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.

Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either

*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *


Edit I found the solution after Hercules Apergis pushed me in the right direction.

newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()

The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method. Now hopefully it works with my real data not eg data!

1
Could you show what's the desired output?Edgar Santos

1 Answers

0
votes

Given the function that you have on aggregation:

aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)

You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:

library(dplyr)
newdata   <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")

The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.