R histogram showing time spent in each bin

Question

I'm trying to create a plot similar to the ones here:

Cadence distribution

Basically I want a histogram, where each bin shows how long was spent in that range of cadence (e.g 1 hour in 0-20rpm, 3 hours in 21-40rpm, etc)

library("rjson") # 3rd party library, so: install.packages("rjson")

# Load data from Strava API.
# Ride used for example is http://app.strava.com/rides/13542320
url <- "http://app.strava.com/api/v1/streams/13542320?streams[]=cadence,time"
d <- fromJSON(paste(readLines(url)))

Each value in d$cadence (rpm) is paired with the same index in d$time (the number of seconds from the start).

The values are not necessarily uniform (as can be seen if you compare plot(x=d$time, y=d$cadence, type='l') with plot(d$cadence, type='l') )

If I do the simplest possible thing:

hist(d$cadence)

..this produces something very close, but the Y value is "frequency" instead of time, and ignores the time between each data point (so the 0rpm segment in particular will be underrepresented)

What do you mean that the Y value is "frequency"? You mean the label on the Y axis? That can be fixed with hist(d$cadence, ylab="Time"). — David Robinson
Are you saying the desired behavior is to ignore the times in between each data point? i.e. the bar originating from 0 represents the number of samples which were exactly 0? — Jeff Allen
When I tried your code I found that the number == 0 was 4425, and the number < 10 was 4439. So I don't think you'd see a drastic change in the bars when you in/exclude the "time between each data point." — Jeff Allen
In the same vein as @DavidRobinson 's comment, you can specify the bin edges with additional arguments to hist . That will solve your difficulty with zero-rpm values not being plotted. — Carl Witthoft
I've worked with data like this before. I believe what the OP wants is to account for uneven time between samples, e.g. if one sample was taken at t=10 with cadence of 95, and the next sample isn't until t=16 with cadence=93, then the metrics should act as if the cadence was 95 for all 10<=t<16. i.e. cadence was 95 for 6 seconds. The standard hist() call treats the "count" of this event as 1, not 6. — mac

mac mac · Accepted Answer · 2012-07-17T20:18:22

You need to create a new column to account for the time between samples.

I prefer data.frames to lists for this kind of thing, so:

d <- as.data.frame(fromJSON(paste(readLines(url))))
d$sample.time <- 0
d$sample.time[2:nrow(d)] <- d$time[2:nrow(d)]-d$time[1:(nrow(d)-1)]

now that you've got your sample times, you can simply "repeat" the cadence measures for anything with a sample time more than 1, and plot a histogram of that

hist(rep(x=d$cadence, times=d$sample.time),
     main="Histogram of Cadence", xlab="Cadence (RPM)",
     ylab="Time (presumably seconds)")

There's bound to be a more elegant solution that wouldn't fall apart for non-integer sample times, but this works with your sample data.

EDIT: re: the more elegant, generalized solution, you can deal with non-integer sample times with something like new.d <- aggregate(sample.time~cadence, data=d, FUN=sum), but then the problem becomes plotting a histogram for something that looks like a frequency table, but with non-integer frequencies. After some poking around, I'm coming to the conclusion you'd have to roll-your-own histogram for this case by further aggregating the data into bins, and then displaying them with a barchart.

R histogram showing time spent in each bin

1 Answers