2
votes

I have a multi-stage question that I have been trying to address using existing threads, but I haven't been fully effective yet, so I wanted to pose it here. I am transitioning to R from excel because of the volume of data I am trying to work with. I have some background in R, but most of what comes below is cobbled from various stack posts so if there is a different way to approach the problem I am all ears.

My raw data comes in 25 columns (Date and 24 individual hours of the day) for multiple years in a row as in the sample below:

date_seq <- seq(as.POSIXct("2012-01-01"), 
                as.POSIXct("2015-02-01"), 
                by=("hour"))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
                 replicate(24,sample(1:9,27049,rep=TRUE)))

headers<-c("Date", "1:00 AM", "2:00 AM","3:00 AM", "4:00 AM","5:00 AM", "6:00 AM","7:00 AM", "8:00 AM","9:00 AM", "10:00 AM","11:00 AM", "12:00 PM","1:00 PM", "2:00 PM","3:00 PM", "4:00 PM","5:00 PM", "6:00 PM","7:00 PM", "8:00 PM","9:00 PM", "10:00 PM","11:00 PM", "12:00 AM")
colnames(df)<-headers

End Goal: Calculate average hourly values by month to create an “average day” for each month so I can eventually compare the “average days” month to month, season to season, and year to year. For example with the finished data I could do something like the chart below (just an example of the final data even though I need to do some other calculations with it).

Chart to demonstrate final data form

To that end here is what I have done so far and the associated problems:
library(readr)
library(lubridate)
library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)
library(reshape2)
library(chron)

df2<-melt(df,variable.name="Time",value.name = "Load",id.vars = c("Date"))
times<- as.POSIXct(df2$Time, format = "%I:%M %p", tz = "GMT")
df2$Time<-times(strftime(times, format = "%H:%M:%S", tz = "GMT"))

df3<-as.data.frame(df2)
df3<-separate(data = df3, col = Date, into = c("Year", "Month","Day"), sep = "\\-",remove=FALSE)

Seems good to this point, but now trying to create those average days is where I get stuck. When I run the below code it creates the hourly averages by month. It also unfortunately makes the day and date columns turn to NAs which isn’t a big deal for the end comparisons and following calculations, but clearly I am not doing something right. I have tried to minus out columns but I end up getting errors.

df_month<- df3 %>% 
  group_by(Month, Year, Time) %>%
  summarise_each(funs(mean(.,na.rm=TRUE)))

After this I have really struggled to build back the average days from this long format. Essentially I need to take the average hour 1 for January 2012 and combine it again with the averages for the other hours and repeat for all months.

I tried going back to the wide format of the raw data, but that became problematic for doing line graphs in ggplot2 even though I can graph those lines from the wide format in excel. I also tried messing with some for loops to create vectors for the average days, but to no avail.

Sorry for the long post I would really appreciate your insight both on the approach I have taken so far and how I should proceed.

1

1 Answers

1
votes

I don't see any gross mistakes in your code, so it's just a matter of clearing it up.

For example you can use tidyr's gather instead of the old melt, I'd use dplyr::mutate with lubridate's year(), month() and hour() instead of separate, and finally summarize_at, instead of summarize_each (which is now deprecated and creates the NAs).

library(dplyr)
library(tidyr)
library(lubridate)

df_month <- df %>% 
    gather(hours, Load, -Date) %>% 
    mutate(year  = year(Date),
           month = month(Date, label = TRUE),
           hour  = hour(as.POSIXct(hours, format = '%I:%M %p'))) %>% 
    group_by(year, month, hour) %>% 
    summarise_at(vars(Load), mean, na.rm = TRUE)

To create a ggplot from this data.frame is easy, the only particular thing to pay attention is the fact that the color aes have to be mapped to two variables, and that's why we use a interaction:

library(ggplot2)
ggplot(df_month) +
    geom_line(aes(hour, Load, color = interaction(month, year, sep = '-'))) +
    scale_y_continuous(limits = c(2, NA)) +
    scale_colour_discrete('')

(Data has been filtered in the plot to match the example, and also because data is random noise that makes an ugly graph)