I have a multi-stage question that I have been trying to address using existing threads, but I haven't been fully effective yet, so I wanted to pose it here. I am transitioning to R from excel because of the volume of data I am trying to work with. I have some background in R, but most of what comes below is cobbled from various stack posts so if there is a different way to approach the problem I am all ears.
My raw data comes in 25 columns (Date and 24 individual hours of the day) for multiple years in a row as in the sample below:
date_seq <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2015-02-01"),
by=("hour"))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
replicate(24,sample(1:9,27049,rep=TRUE)))
headers<-c("Date", "1:00 AM", "2:00 AM","3:00 AM", "4:00 AM","5:00 AM", "6:00 AM","7:00 AM", "8:00 AM","9:00 AM", "10:00 AM","11:00 AM", "12:00 PM","1:00 PM", "2:00 PM","3:00 PM", "4:00 PM","5:00 PM", "6:00 PM","7:00 PM", "8:00 PM","9:00 PM", "10:00 PM","11:00 PM", "12:00 AM")
colnames(df)<-headers
End Goal: Calculate average hourly values by month to create an “average day” for each month so I can eventually compare the “average days” month to month, season to season, and year to year. For example with the finished data I could do something like the chart below (just an example of the final data even though I need to do some other calculations with it).
Chart to demonstrate final data form
To that end here is what I have done so far and the associated problems:
library(readr)
library(lubridate)
library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)
library(reshape2)
library(chron)
df2<-melt(df,variable.name="Time",value.name = "Load",id.vars = c("Date"))
times<- as.POSIXct(df2$Time, format = "%I:%M %p", tz = "GMT")
df2$Time<-times(strftime(times, format = "%H:%M:%S", tz = "GMT"))
df3<-as.data.frame(df2)
df3<-separate(data = df3, col = Date, into = c("Year", "Month","Day"), sep = "\\-",remove=FALSE)
Seems good to this point, but now trying to create those average days is where I get stuck. When I run the below code it creates the hourly averages by month. It also unfortunately makes the day and date columns turn to NAs which isn’t a big deal for the end comparisons and following calculations, but clearly I am not doing something right. I have tried to minus out columns but I end up getting errors.
df_month<- df3 %>%
group_by(Month, Year, Time) %>%
summarise_each(funs(mean(.,na.rm=TRUE)))
After this I have really struggled to build back the average days from this long format. Essentially I need to take the average hour 1 for January 2012 and combine it again with the averages for the other hours and repeat for all months.
I tried going back to the wide format of the raw data, but that became problematic for doing line graphs in ggplot2 even though I can graph those lines from the wide format in excel. I also tried messing with some for loops to create vectors for the average days, but to no avail.
Sorry for the long post I would really appreciate your insight both on the approach I have taken so far and how I should proceed.