9
votes

I have a data frame that has a format like the following:

Month       Frequency
2007-08     2
2010-11     5
2011-01     43
2011-02     52
2011-03     31
2011-04     64
2011-05     73

I would like to create a histogram from this data, using X bins (X will probably be around 15, but the actual data has over 200 months), and using the data from the frequency column as the frequency for each bin of the histogram. How can I accomplish this?

I've tried two approaches so far, with the hist() and barplot() commands. The problem with hist() is that it does not seem to give me any way to specify that I want to use the frequency column in the frequency calculations for the histogram. The problem with barplot() is that I don't have any flexibility in choosing X bins, and if there are omitted months, then resulting graph is not actually a true histogram because the x-axis isn't continuous.

The only idea I have right now is to go with the barplot() approach, fill in the missing months with a value of 0 for Frequency, and use space=0 to remove the spacing between the bars. The problem with that is that it's not particularly easy to choose an arbitrary number of bins.

4

4 Answers

4
votes

To get this kind of flexibility, you may have to replicate your data. Here is one way of doing it with rep:

n <- 10
dat <- data.frame(
    x = sort(sample(1:50, n)),
    f = sample(1:100, n))
dat

expdat <- dat[rep(1:n, times=dat$f), "x", drop=FALSE]

Now you have your data replicated in the data.frame expdat, allowing you to call hist with different numbers of bins:

par(mfcol=c(1, 2))
hist(expdat$x, breaks=50, col="blue", main="50 bins")
hist(expdat$x, breaks=5, col="blue", main="5 bins")
par(mfcol=c(1, 1))

enter image description here

4
votes

take a gander at ggplot2.

if you data is in a data.frame called df:

ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')

or if you want continuous time:

df$Month<-as.POSIXct(paste(df$Month, '01', sep='-'),format='%Y-%m-%d')
ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')
3
votes

Yea, rep solutions will waste too much memory in most interesting/large cases. The HistogramTools CRAN package includes an efficient PreBinnedHistogram function which creates a base R histogram object directly from a list of bins and breaks as the original question provided.

0
votes

Another possibility is to scale down your frequency variable by some large factor so that rep doesn't have as much work to do. Then adjust the vertical axis scale of the histogram by that same factor.