I have a data file containing readings sampled at 30 seconds duration. File organization is:
> head(dframe)
timestamp power
1 2015-08-01 00:00:04 584.1379
2 2015-08-01 00:00:34 585.8087
3 2015-08-01 00:01:04 584.9335
4 2015-08-01 00:01:34 584.4366
5 2015-08-01 00:02:04 584.2829
Now to represent the 30 second duration data into hourly mean I use following R
command:
df = aggregate(list(power=dframe$power),by=list(timestamp=cut(as.POSIXct(dframe$timestamp),"hour")),mean)
This works perfectly. But, the actual issue is with time it takes for large files (data of one year). Can I somehow reduce the time it takes for conversion process? In other words, is there any other best alternative which takes less time for conversion of seconds data to hourly mean data in R?
UPDATE: I used 4 different methods for the same problem as suggested by @akrun and @Joshua. For other users of stack overflow here I am providing the usage of all methods and the respective time taken
dframe<-read.csv(path,head=TRUE,sep=",")
dframe$timestamp<- as.POSIXct(dframe$timestamp)
xframe = dframe
#using aggregate
system.time(
df1<- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")),mean)
)
# using data.table
system.time(
dfx<-setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
)
# using dplyr
system.time(
xframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))
)
#using xts
system.time({
x <- xts(dframe$power,dframe$timestamp)
h <- period.apply(x, endpoints(x, "hours"), mean)
h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})
Respective times taken on two (one month, three month) different datasets are: For one month data set:
Method user system elapsed
Aggregate 0.137 0.005 0.142
data.table 0.031 0.001 0.032
dplyr 0.035 0.001 0.036
xts 0.053 0.000 0.053
For three month data set:
Aggregate 0.456 0.019 0.475
data.table 0.099 0.002 0.102
dplyr 0.099 0.004 0.103
xts 0.158 0.004 0.161
Caveat: All approaches except xts changes the type of timestamp from POSIXct
to Factor
. This means you have to convert again type of timestamp column, that will incur some more cpu cycles. In short, If in the end you further need POSIXct timestamp then xts is the best, otherwise go for data.table.
DATASET Dataset used can be found at link
dplyr
,library(dplyr); dframe %>% group_by(timestamp= cut(as.POSIXct(timestamp), 'hour') %>% summarise(power=mean(power))
– akruncut
is causing a lot of the slow-down – Rich Scriven