0
votes

I'm trying to create a histogram from time-series data in R, similar to this question. Each bin should show the total duration for the values falling within the bin. I have non-integer sample times in an zoo object of thousands of rows. The timestamps are irregular, and the data is assumed to be constant between each timestamp (sample-and-hold).

Example data:

library(zoo)
library(ggplot2)
timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5", "2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3", "2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data  = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))
ggplot(x.df, aes(x = Date, y = Value)) + geom_step() + scale_x_datetime(labels = date_format("%H:%M:%OS"))

Please see the times-series plot here. Creating a histogram with hist(z, freq = T) does not care about the timestamps: Plot from hist method.

My desired output is a histogram with duration in seconds on the y-axis, something like this: Histogram with non-integer duration on y-axis.

Edit:

I should point out that the data values are not integers, and that i want to be able to control the bin width(s). I could use diff(timestamp) to create a (non-integer) column showing duration for each point, and plotting a bar graph like suggested by @MKR:

x.df = data.frame(DurationSecs = as.numeric(diff(timestamp)), Value = data[-length(data)])
ggplot(x.df, aes(x = Value, y = DurationSecs)) + geom_bar(stat = "identity")

This gives a histogram with the right bar heights for the example. But this fails when the values are floating point numbers.

2

2 Answers

0
votes

Since you want duration (in seconds) on y-axis, hence you should add one column in x.df for duration. A histogram with stat = sum will fit needs of OP. The steps are

library(zoo)
library(dplyr)
timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5",
 "2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3", 
 "2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data  = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))

# DurationSecs is added as numeric. It shows diff from earliest time. 

x.df <- x.df %>% arrange(Date) %>%
mutate(DurationSecs = ifelse(is.na(lead(Date)), 0, lead(Date) - Date))


# Draw the plot now
ggplot(x.df, aes(x = Value, y = DurationSecs)) + geom_histogram(stat="sum") 

enter image description here

#The data
#                 Date Value DurationSecs
#1 2018-02-21 15:00:00     0          2.5
#2 2018-02-21 15:00:02     3          2.7
#3 2018-02-21 15:00:05     5          1.8
#4 2018-02-21 15:00:07     1          2.3
#5 2018-02-21 15:00:09     3          0.7
#6 2018-02-21 15:00:10     0          2.0
#7 2018-02-21 15:00:12     2          0.0
0
votes

After some trial and error I found a solution. The answer provided by MKR sort of works, but I could not set the number of bins and it failed for floating-pont values.

I came across the wonderful functions cut and xtab in this question: How to plot an histogram with y as a sum of the x values for every bin in ggplot2. The solution provided there was painfully slow, drawing each data-point duration as stacked bars.

I don't need separate bars for each data-point, I just need the sum of the durations within each bin. This is my solution:

library(dplyr)
library(magrittr)
library(zoo)
library(ggplot2)

timestamp = as.POSIXct(c("2018-02-21 15:00:00.0", "2018-02-21 15:00:02.5",
"2018-02-21 15:00:05.2", "2018-02-21 15:00:07.0", "2018-02-21 15:00:09.3",
"2018-02-21 15:00:10.0", "2018-02-21 15:00:12.0"), tz = "GMT")
data  = c(0,3,5,1,3,0,2)
z = zoo(data, order.by = timestamp)
x.df <- data.frame(Date = index(z), Value = as.numeric(coredata(z)))

# DurationSecs is added as numeric. It shows diff from the previous datapoint. 
x.df <- x.df %>% arrange(Date) %>%
mutate(DurationSecs = ifelse(is.na(lead(Date)), 0, lead(Date) - Date))

# Adding a column of bins to the dataframe:
BinCount <- 7
x.df$bins = cut(x.df$Value, pretty(x.df$Value, n = BinCount), include.lowest = TRUE, right = FALSE)

# Creating a new dataframe containing bins and the sum of DurationSecs for each bin.
y.df = data.frame(xtabs(DurationSecs ~ bins, x.df))

# Ready to plot
ggplot(y.df, aes(x = bins, y = Freq)) +
    geom_bar(stat = "identity") +
    ylab("Duration") +
    xlab("Value") +
    scale_x_discrete(drop = F) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1)) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) 

The result is shown here. As a bonus, the labels on the x-axis are really beautiful, and I have the frequency table available for further analysis.