0
votes

I have a hourly dataset from 1996 - 2010 in the following format:

             date         value  
1- - -1996-01-01 00:00:00- - -  NA  
2- - -1996-01-01 01:00:00- - -  38  
3 - - -1996-01-01 02:00:00- - - 44  
4- - -1996-01-01 03:00:00- - -  48  
5- - -1996-01-01 04:00:00- - -  42  
6- - -1996-01-01 05:00:00- - -  44  
7- - - 1996-01-01 06:00:00- - - 38  
8- - - 1996-01-01 07:00:00- - - 42  
9- - -1996-01-01 08:00:00- - -  44  
10- - -1996-01-01 09:00:00- - - 44  

I have lot of missing hours data and I am tring to calculate the missing hour values by taking the average of previous and next hour, and if many hours missing I am trying to calculate by taking the average of that hour for every day that year, just wondering if this is possible?

I have tried the following but this gives me average of the complete dataset:

a = c(NA, 1, 2, 3, 10)   
a[which(is.na(a)==TRUE)] = mean(a,na.rm = T)

I would really appreciate any advice about how I should proceed with this calculation.

2

2 Answers

2
votes

na.aggregate in zoo does that. Its only one line of code to fill in the missing values:

# read in the data

Lines <- "1996-01-01 00:00:00 NA  
1996-01-01 01:00:00 38  
1996-01-01 02:00:00 43
1997-01-01 00:00:00 44  
1997-01-01 01:00:00 45"

library(zoo)
library(chron)
z <- read.zoo(text = Lines, index = 1:2, FUN = paste, FUN2 = as.chron)

# fill in the missing values

na.aggregate(z, hours, FUN = mean)
1
votes

You could probably do this using some handy function from the zoo package. For instance, na.approx, with maxgap = 1 should linearly interpolate all the gaps of length one. Then you'd probably want to use na.aggregate, splitting by year and hour, to fill longer gaps with that periods' mean.

Here's a simple example to give you a sense of how these functions work:

set.seed(124)
tt <- as.POSIXct("2000-01-01 10:00:00") + 3600*c(1:100,10000:10100)
dd <- runif(201)

aa <- data.frame(x1 = tt,x2 = dd)
aa$x2[sample(201,30)] <- NA
aa$x3 <- na.approx(aa$x2,maxgap = 1)
aa$x4 <- na.aggregate(aa$x3,by = format(aa$x1,"%Y-%H"))

Note that if your series has leading or trailing NAs, you might get errors, since the "linear interpolation" piece doesn't make much sense in that case. So you'd have to fill those in some other way.